* Re: [fuse-devel] Debugging a stale kernel cache during file growth
[not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
@ 2026-04-16 7:24 ` Amir Goldstein
2026-04-16 12:12 ` Miklos Szeredi
0 siblings, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2026-04-16 7:24 UTC (permalink / raw)
To: Nikolay Amiantov; +Cc: fuse-devel, linux-fsdevel, fuse-devel
[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]
[CC new fuse-devel list and fsdevel]
On Wed, Apr 15, 2026 at 7:24 PM Nikolay Amiantov via fuse-devel
<fuse-devel@lists.sourceforge.net> wrote:
>
> Hi everybody,
>
> I've recently encountered a weird issue with JuiceFS [1], a network FS
> which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> of the same file on another host would periodically read a block of zero
> bytes instead of the actual data.
>
> While researching it I've built an MRE; turns out, this issue can be
> triggered on a fresh kernel and libfuse3 with:
> * A test FUSE FS which exposes a slowly growing file containing only
> 0xAA bytes, with disabled kernel stat cache and no handling of the read
> cache (so, with automatic cache handling by the kernel);
> * A script which reads the file sequentially checking for zeroes, while
> also hammering `os.stat` on the same file from separate threads.
>
> I couldn't find any prior discussion of this issue. I'm suspecting a
> kernel bug; sadly, I have no prior experience with the kernel-side FUSE
> subsystem, but with a lot of (disclosure) LLM help explaining the FUSE
> module and the kernel read cache architecture, I think I understand what
> happens and implemented a partial fix which, to me, makes sense.
>
> If I understand correctly, a full fix is impossible without VFS-level
> locking changes; I've actually managed to reproduce a similar issue in
> NFS [2]; this may be applicable to any network FS which may increase an
> inode size outside of `read`/`write`s.
>
> All my code, experiments and a kernel patch which makes the issue less
> frequent (but still existing) can be found at
> https://github.com/abbradar/fuse_growtest
>
> Any help checking my findings and pointing me in the right direction is
> appreciated!
>
> Thanks for your help,
> Nikolay.
>
> [1]: https://github.com/juicedata/juicefs/issues/5038
> [2]: https://github.com/abbradar/nfs_stale_cache_test
>
>
Hi Nikolay,
Your question is related to kernel filesystem and you also mention NFS,
so added fsdevel list for attention of relevant developers which are
not subscribed
to the fuse-devel list.
Also attaching your patch here for convenience.
Thanks,
Amir.
[-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --]
[-- Type: text/x-patch, Size: 2943 bytes --]
From 007c8531c0644e259321b8e1b151003439f75ebf Mon Sep 17 00:00:00 2001
From: Nikolay Amiantov <ab@fmap.me>
Date: Wed, 15 Apr 2026 07:28:19 +0000
Subject: [PATCH] fuse: fix stale page cache data race on file growth
When a FUSE server reports a larger file size via lookup or getattr,
fuse_change_attributes() updates i_size before invalidating stale page
cache entries. The page before the EOF contains kernel-generated
zero-fill beyond the old i_size. Once i_size is increased, these zeroes
become visible to concurrent readers before we invalidate the cache
later on.
Fix this by evicting the affected page before updating i_size. The fi->lock
spinlock is dropped before the invalidation (which can sleep), then
reacquired to recheck FUSE_I_SIZE_UNSTABLE before updating i_size.
---
fs/fuse/inode.c | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 735abf426a06..61ae0f94e4dc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -334,8 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
* extend local i_size without keeping userspace server in sync. So,
* attr->size coming from server can be stale. We cannot trust it.
*/
- if (!(cache_mask & STATX_SIZE))
- i_size_write(inode, attr->size);
+ if (!(cache_mask & STATX_SIZE)) {
+ /*
+ * When a file grows remotely, the page straddling the old
+ * EOF contains zero-fill beyond oldsize. Those zeroes are
+ * valid while i_size equals oldsize (they are beyond EOF),
+ * but become stale once i_size is increased: concurrent
+ * readers would see zeroes instead of the data written by
+ * the remote host. Evict the affected page(s) BEFORE updating
+ * i_size. Any reader that re-populates the cache between the
+ * invalidation and the i_size update will issue a fresh
+ * FUSE_READ with the new data there.
+ *
+ * There is a residual race: a reader that has
+ * already obtained a folio reference via the lockless
+ * filemap_get_read_batch() but has not yet reached the
+ * i_size_read() in filemap_read() would hold a ref that
+ * prevents invalidate_inode_pages2_range() from evicting
+ * the folio. The reader would then use the new i_size
+ * (read after our i_size_write) to copy stale data.
+ */
+ if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
+ spin_unlock(&fi->lock);
+ invalidate_inode_pages2_range(inode->i_mapping,
+ oldsize >> PAGE_SHIFT,
+ (attr->size - 1) >> PAGE_SHIFT);
+ spin_lock(&fi->lock);
+ /*
+ * Recheck — a write or truncate may have set
+ * FUSE_I_SIZE_UNSTABLE while we dropped the lock.
+ */
+ if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state))
+ i_size_write(inode, attr->size);
+ } else {
+ i_size_write(inode, attr->size);
+ }
+ }
spin_unlock(&fi->lock);
if (!cache_mask && S_ISREG(inode->i_mode)) {
--
2.47.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-16 7:24 ` [fuse-devel] Debugging a stale kernel cache during file growth Amir Goldstein
@ 2026-04-16 12:12 ` Miklos Szeredi
2026-04-16 12:41 ` Nikolay Amiantov
2026-04-16 22:54 ` Matthew Wilcox
0 siblings, 2 replies; 9+ messages in thread
From: Miklos Szeredi @ 2026-04-16 12:12 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
fuse-devel, linux-mm
On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote:
> > I've recently encountered a weird issue with JuiceFS [1], a network FS
> > which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> > of the same file on another host would periodically read a block of zero
> > bytes instead of the actual data.
Thanks for the report.
I wonder if we could clear PG_uptodate on the page which had its zero
bytes exposed by the i_size increase?
Willy?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-16 12:12 ` Miklos Szeredi
@ 2026-04-16 12:41 ` Nikolay Amiantov
2026-04-16 12:49 ` Nikolay Amiantov
2026-04-16 23:19 ` Matthew Wilcox
2026-04-16 22:54 ` Matthew Wilcox
1 sibling, 2 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-04-16 12:41 UTC (permalink / raw)
To: Miklos Szeredi, Matthew Wilcox
Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm
[-- Attachment #1.1: Type: text/plain, Size: 1336 bytes --]
On 4/16/26 19:12, Miklos Szeredi wrote:
> I wonder if we could clear PG_uptodate on the page which had its zero
> bytes exposed by the i_size increase?
I've actually tried that first. The idea was to get or create a new page
on the EOF boundary, lock it and poison it with an uptodate reset if we
need to. But this resulted in an instantaneous EIO in my test. If I
undestand correctly, this is because of another race condition:
* A fresh page gets created and read by FUSE; uptodate is true;
* The page is unlocked on return from `fuse_read_folio`;
* Simultaneously, we run `getattr`. The page gets locked, uptodate is
reset, the page is unlocked;
* Now back from `fuse_read_folio`, `filemap_read_folio` gets this page,
waits on `folio_wait_locked_killable` (waiting for the getattr to reset
uptodate), and then checks `folio_test_uptodate`;
* The page is !uptodate, so an EIO is returned.
So it effectively results in inability to have a successful `read` when
a `getattr` for a growing file happens simultaneously.
Finally, if I understand correctly, this also leaves a (much smaller)
theoretical race condition in `filemap_read` between checking uptodate
and getting the current inode size.
Attached is the patch with this attempt; please check that it does what
you meant in case I misunderstood.
Cheers,
Nikolay.
[-- Attachment #1.2: Type: text/html, Size: 1901 bytes --]
[-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --]
[-- Type: text/x-patch, Size: 1830 bytes --]
From 512194b982fd0edbc1dcaa50fafad75b1be26d42 Mon Sep 17 00:00:00 2001
From: Nikolay Amiantov <ab@fmap.me>
Date: Wed, 15 Apr 2026 07:28:19 +0000
Subject: [PATCH] fuse: fix stale page cache data race on file growth
---
fs/fuse/inode.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 735abf426a06..20741869ac2f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
* extend local i_size without keeping userspace server in sync. So,
* attr->size coming from server can be stale. We cannot trust it.
*/
- if (!(cache_mask & STATX_SIZE))
- i_size_write(inode, attr->size);
+ if (!(cache_mask & STATX_SIZE)) {
+ if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
+ struct folio *folio;
+ pgoff_t index = oldsize >> PAGE_SHIFT;
+
+ spin_unlock(&fi->lock);
+ folio = __filemap_get_folio(inode->i_mapping, index,
+ FGP_LOCK | FGP_CREAT,
+ mapping_gfp_mask(inode->i_mapping));
+ if (!IS_ERR(folio)) {
+ spin_lock(&fi->lock);
+ if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
+ folio_clear_uptodate(folio);
+ i_size_write(inode, attr->size);
+ }
+ spin_unlock(&fi->lock);
+
+ folio_unlock(folio);
+ folio_put(folio);
+ goto size_updated;
+ }
+ spin_lock(&fi->lock);
+ /*
+ * Folio alloc failed (ENOMEM). Recheck in case a
+ * write/truncate started while we dropped the lock.
+ */
+ if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state))
+ i_size_write(inode, attr->size);
+ } else {
+ i_size_write(inode, attr->size);
+ }
+ }
spin_unlock(&fi->lock);
+size_updated:
+
if (!cache_mask && S_ISREG(inode->i_mode)) {
bool inval = false;
--
2.47.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-16 12:41 ` Nikolay Amiantov
@ 2026-04-16 12:49 ` Nikolay Amiantov
2026-04-16 23:19 ` Matthew Wilcox
1 sibling, 0 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-04-16 12:49 UTC (permalink / raw)
To: Miklos Szeredi, Matthew Wilcox
Cc: fuse-devel, linux-fsdevel, fuse-devel, linux-mm
On 4/16/26 19:41, Nikolay Amiantov via fuse-devel wrote:
> Finally, if I understand correctly, this also leaves a (much smaller)
> theoretical race condition in `filemap_read` between checking uptodate
> and getting the current inode size.
Correction: "would have resulted" in a race condition if we would be
retrying to get a fresh folio instead of returning an EIO; I have
assumed that's the case when I tried the patch.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-16 12:12 ` Miklos Szeredi
2026-04-16 12:41 ` Nikolay Amiantov
@ 2026-04-16 22:54 ` Matthew Wilcox
1 sibling, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-16 22:54 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
fuse-devel, linux-mm
On Thu, Apr 16, 2026 at 02:12:37PM +0200, Miklos Szeredi wrote:
> On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > I've recently encountered a weird issue with JuiceFS [1], a network FS
> > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> > > of the same file on another host would periodically read a block of zero
> > > bytes instead of the actual data.
>
> Thanks for the report.
>
> I wonder if we could clear PG_uptodate on the page which had its zero
> bytes exposed by the i_size increase?
>
> Willy?
I think every filesystem which clear PG_uptodate is doing it wrong.
I know we have ~30 places which do it, and I haven't audited them all,
but clearing the uptodate bit can lead to the VM throwing an absolute
fit if any of the pages in that folio are mapped.
I don't think it'll make much difference whether it's cleared or
invalidated from the page cache. Either way we're re-reading all
the data in it, which would dominate the time saved by not doing a trip
through the page allocator.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-16 12:41 ` Nikolay Amiantov
2026-04-16 12:49 ` Nikolay Amiantov
@ 2026-04-16 23:19 ` Matthew Wilcox
[not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
2026-04-17 13:48 ` Miklos Szeredi
1 sibling, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-16 23:19 UTC (permalink / raw)
To: Nikolay Amiantov
Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein,
fuse-devel, linux-mm
On Thu, Apr 16, 2026 at 07:41:37PM +0700, Nikolay Amiantov wrote:
> +++ b/fs/fuse/inode.c
> @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> * extend local i_size without keeping userspace server in sync. So,
> * attr->size coming from server can be stale. We cannot trust it.
> */
> - if (!(cache_mask & STATX_SIZE))
> - i_size_write(inode, attr->size);
> + if (!(cache_mask & STATX_SIZE)) {
> + if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
> + struct folio *folio;
> + pgoff_t index = oldsize >> PAGE_SHIFT;
> +
> + spin_unlock(&fi->lock);
> + folio = __filemap_get_folio(inode->i_mapping, index,
> + FGP_LOCK | FGP_CREAT,
> + mapping_gfp_mask(inode->i_mapping));
> + if (!IS_ERR(folio)) {
> + spin_lock(&fi->lock);
> + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
> + folio_clear_uptodate(folio);
> + i_size_write(inode, attr->size);
> + }
> + spin_unlock(&fi->lock);
> +
> + folio_unlock(folio);
> + folio_put(folio);
> + goto size_updated;
Yes, at this point, you've left the folio in an error state. I'm sure you
didn't mean to do that, but the VFS interprets unlocked && !uptodate as
"an error happened" (there is a minor exception to this involving failed
readahead, but let's set that aside).
What you could do, rather than unlock the folio here is to initiate a
read of the folio and allow the read to unlock the folio. But I don't
think this is a good idea, I like the idea of invalidating the folio
much better.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
[not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
@ 2026-04-17 6:30 ` Nikolay Amiantov
0 siblings, 0 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-04-17 6:30 UTC (permalink / raw)
To: fuse-devel, linux-fsdevel
Re-sending this to the lists which rejected the previous message since I
failed to configure the email client to always use plain text.
On 4/17/26 13:24, Nikolay Amiantov via fuse-devel wrote:
> On 4/17/26 06:19, Matthew Wilcox wrote:
>> Yes, at this point, you've left the folio in an error state. I'm sure you
>> didn't mean to do that, but the VFS interprets unlocked && !uptodate as
>> "an error happened" (there is a minor exception to this involving failed
>> readahead, but let's set that aside).
>
> Thanks, I see!
>
> To save my reasoning somewhere: another way to do this would be
> NFS/CIFS-style, in a lazy way. They set a flag in `getattr` and
> invalidate later in `read()` instead. This could avoid relocking the
> spinlock; I still opted for invalidating inside `getattr` though since
> FUSE already has invalidation later in the same call, and the cost of
> relocking feels low to me in this case.
>
> Any ideas on how to resolve the remaining race condition [1]? If I'm
> correct it affects any network FS, and can't be fixed without changing
> the common VFS code somehow. I'd like someone to confirm my
> conclusions though.
>
> I'm way over my head here though willing to learn; if someone is
> willing to mentor me on designing the fix, I'd be happy to. My best
> uneducated guess is to introduce another flag for a page and check it
> *after* we get the inode size in `filemap_read()`; if it's set, retry
> reading.
>
> 1: https://github.com/abbradar/nfs_stale_cache_test
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-16 23:19 ` Matthew Wilcox
[not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
@ 2026-04-17 13:48 ` Miklos Szeredi
2026-05-04 16:49 ` Nikolay Amiantov
1 sibling, 1 reply; 9+ messages in thread
From: Miklos Szeredi @ 2026-04-17 13:48 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
fuse-devel, linux-mm
On Fri, 17 Apr 2026 at 01:19, Matthew Wilcox <willy@infradead.org> wrote:
> What you could do, rather than unlock the folio here is to initiate a
> read of the folio and allow the read to unlock the folio. But I don't
> think this is a good idea, I like the idea of invalidating the folio
> much better.
There's still a race window if the page is invalidated after being
ref-ed by filemap_read() and before i_size is read.
Should that code check for a truncated page and retry?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth
2026-04-17 13:48 ` Miklos Szeredi
@ 2026-05-04 16:49 ` Nikolay Amiantov
0 siblings, 0 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-05-04 16:49 UTC (permalink / raw)
To: Miklos Szeredi, Matthew Wilcox
Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm
Kindly bringing this discussion up.
On 4/17/26 20:48, Miklos Szeredi wrote:
> Should that code check for a truncated page and retry?
If I understand you correctly, this can be accomplished with a new page
flag because there's no other way to find out the truncation happened;
would you say this is the right way forward?
Sadly I didn't have time to try and make a PoC fix yet, but I want to
try tackle it this weekend; meanwhile, any guidance is appreciated!
Thanks,
Nikolay.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-04 16:54 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
2026-04-16 7:24 ` [fuse-devel] Debugging a stale kernel cache during file growth Amir Goldstein
2026-04-16 12:12 ` Miklos Szeredi
2026-04-16 12:41 ` Nikolay Amiantov
2026-04-16 12:49 ` Nikolay Amiantov
2026-04-16 23:19 ` Matthew Wilcox
[not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
2026-04-17 6:30 ` Nikolay Amiantov
2026-04-17 13:48 ` Miklos Szeredi
2026-05-04 16:49 ` Nikolay Amiantov
2026-04-16 22:54 ` Matthew Wilcox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox