Re: [fuse-devel] Debugging a stale kernel cache during file growth

Linux filesystem development
 help / color / mirror / Atom feed

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
       [not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
@ 2026-04-16  7:24 ` Amir Goldstein
  2026-04-16 12:12   ` Miklos Szeredi
  0 siblings, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2026-04-16  7:24 UTC (permalink / raw)
  To: Nikolay Amiantov; +Cc: fuse-devel, linux-fsdevel, fuse-devel

[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]

[CC new fuse-devel list and fsdevel]

On Wed, Apr 15, 2026 at 7:24 PM Nikolay Amiantov via fuse-devel
<fuse-devel@lists.sourceforge.net> wrote:
>
> Hi everybody,
>
> I've recently encountered a weird issue with JuiceFS [1], a network FS
> which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> of the same file on another host would periodically read a block of zero
> bytes instead of the actual data.
>
> While researching it I've built an MRE; turns out, this issue can be
> triggered on a fresh kernel and libfuse3 with:
> * A test FUSE FS which exposes a slowly growing file containing only
> 0xAA bytes, with disabled kernel stat cache and no handling of the read
> cache (so, with automatic cache handling by the kernel);
> * A script which reads the file sequentially checking for zeroes, while
> also hammering `os.stat` on the same file from separate threads.
>
> I couldn't find any prior discussion of this issue. I'm suspecting a
> kernel bug; sadly, I have no prior experience with the kernel-side FUSE
> subsystem, but with a lot of (disclosure) LLM help explaining the FUSE
> module and the kernel read cache architecture, I think I understand what
> happens and implemented a partial fix which, to me, makes sense.
>
> If I understand correctly, a full fix is impossible without VFS-level
> locking changes; I've actually managed to reproduce a similar issue in
> NFS [2]; this may be applicable to any network FS which may increase an
> inode size outside of `read`/`write`s.
>
> All my code, experiments and a kernel patch which makes the issue less
> frequent (but still existing) can be found at
> https://github.com/abbradar/fuse_growtest
>
> Any help checking my findings and pointing me in the right direction is
> appreciated!
>
> Thanks for your help,
> Nikolay.
>
> [1]: https://github.com/juicedata/juicefs/issues/5038
> [2]: https://github.com/abbradar/nfs_stale_cache_test
>
>

Hi Nikolay,

Your question is related to kernel filesystem and you also mention NFS,
so added fsdevel list for attention of relevant developers which are
not subscribed
to the fuse-devel list.

Also attaching your patch here for convenience.

Thanks,
Amir.

[-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --]
[-- Type: text/x-patch, Size: 2943 bytes --]

From 007c8531c0644e259321b8e1b151003439f75ebf Mon Sep 17 00:00:00 2001
From: Nikolay Amiantov <ab@fmap.me>
Date: Wed, 15 Apr 2026 07:28:19 +0000
Subject: [PATCH] fuse: fix stale page cache data race on file growth

When a FUSE server reports a larger file size via lookup or getattr,
fuse_change_attributes() updates i_size before invalidating stale page
cache entries. The page before the EOF contains kernel-generated
zero-fill beyond the old i_size. Once i_size is increased, these zeroes
become visible to concurrent readers before we invalidate the cache
later on.

Fix this by evicting the affected page before updating i_size. The fi->lock
spinlock is dropped before the invalidation (which can sleep), then
reacquired to recheck FUSE_I_SIZE_UNSTABLE before updating i_size.
---
 fs/fuse/inode.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 735abf426a06..61ae0f94e4dc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -334,8 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	 * extend local i_size without keeping userspace server in sync. So,
 	 * attr->size coming from server can be stale. We cannot trust it.
 	 */
-	if (!(cache_mask & STATX_SIZE))
-		i_size_write(inode, attr->size);
+	if (!(cache_mask & STATX_SIZE)) {
+		/*
+		 * When a file grows remotely, the page straddling the old
+		 * EOF contains zero-fill beyond oldsize.  Those zeroes are
+		 * valid while i_size equals oldsize (they are beyond EOF),
+		 * but become stale once i_size is increased: concurrent
+		 * readers would see zeroes instead of the data written by
+		 * the remote host. Evict the affected page(s) BEFORE updating
+		 * i_size.  Any reader that re-populates the cache between the
+		 * invalidation and the i_size update will issue a fresh
+		 * FUSE_READ with the new data there.
+		 *
+		 * There is a residual race: a reader that has
+		 * already obtained a folio reference via the lockless
+		 * filemap_get_read_batch() but has not yet reached the
+		 * i_size_read() in filemap_read() would hold a ref that
+		 * prevents invalidate_inode_pages2_range() from evicting
+		 * the folio.  The reader would then use the new i_size
+		 * (read after our i_size_write) to copy stale data.
+		 */
+		if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
+			spin_unlock(&fi->lock);
+			invalidate_inode_pages2_range(inode->i_mapping,
+						      oldsize >> PAGE_SHIFT,
+						      (attr->size - 1) >> PAGE_SHIFT);
+			spin_lock(&fi->lock);
+			/*
+			 * Recheck — a write or truncate may have set
+			 * FUSE_I_SIZE_UNSTABLE while we dropped the lock.
+			 */
+			if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state))
+				i_size_write(inode, attr->size);
+		} else {
+			i_size_write(inode, attr->size);
+		}
+	}
 	spin_unlock(&fi->lock);
 
 	if (!cache_mask && S_ISREG(inode->i_mode)) {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16  7:24 ` [fuse-devel] Debugging a stale kernel cache during file growth Amir Goldstein
@ 2026-04-16 12:12   ` Miklos Szeredi
  2026-04-16 12:41     ` Nikolay Amiantov
  2026-04-16 22:54     ` Matthew Wilcox
  0 siblings, 2 replies; 9+ messages in thread
From: Miklos Szeredi @ 2026-04-16 12:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote:

> > I've recently encountered a weird issue with JuiceFS [1], a network FS
> > which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> > of the same file on another host would periodically read a block of zero
> > bytes instead of the actual data.

Thanks for the report.

I wonder if we could clear PG_uptodate on the page which had its zero
bytes exposed by the i_size increase?

Willy?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:12   ` Miklos Szeredi
@ 2026-04-16 12:41     ` Nikolay Amiantov
  2026-04-16 12:49       ` Nikolay Amiantov
  2026-04-16 23:19       ` Matthew Wilcox
  2026-04-16 22:54     ` Matthew Wilcox
  1 sibling, 2 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-04-16 12:41 UTC (permalink / raw)
  To: Miklos Szeredi, Matthew Wilcox
  Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm


[-- Attachment #1.1: Type: text/plain, Size: 1336 bytes --]

On 4/16/26 19:12, Miklos Szeredi wrote:
> I wonder if we could clear PG_uptodate on the page which had its zero
> bytes exposed by the i_size increase?

I've actually tried that first. The idea was to get or create a new page 
on the EOF boundary, lock it and poison it with an uptodate reset if we 
need to. But this resulted in an instantaneous EIO in my test. If I 
undestand correctly, this is because of another race condition:

* A fresh page gets created and read by FUSE; uptodate is true;
* The page is unlocked on return from `fuse_read_folio`;
* Simultaneously, we run `getattr`. The page gets locked, uptodate is 
reset, the page is unlocked;
* Now back from `fuse_read_folio`, `filemap_read_folio` gets this page, 
waits on `folio_wait_locked_killable` (waiting for the getattr to reset 
uptodate), and then checks `folio_test_uptodate`;
* The page is !uptodate, so an EIO is returned.

So it effectively results in inability to have a successful `read` when 
a `getattr` for a growing file happens simultaneously.

Finally, if I understand correctly, this also leaves a (much smaller) 
theoretical race condition in `filemap_read` between checking uptodate 
and getting the current inode size.

Attached is the patch with this attempt; please check that it does what 
you meant in case I misunderstood.

Cheers,
Nikolay.

[-- Attachment #1.2: Type: text/html, Size: 1901 bytes --]

[-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --]
[-- Type: text/x-patch, Size: 1830 bytes --]

From 512194b982fd0edbc1dcaa50fafad75b1be26d42 Mon Sep 17 00:00:00 2001
From: Nikolay Amiantov <ab@fmap.me>
Date: Wed, 15 Apr 2026 07:28:19 +0000
Subject: [PATCH] fuse: fix stale page cache data race on file growth

---
 fs/fuse/inode.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 735abf426a06..20741869ac2f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	 * extend local i_size without keeping userspace server in sync. So,
 	 * attr->size coming from server can be stale. We cannot trust it.
 	 */
-	if (!(cache_mask & STATX_SIZE))
-		i_size_write(inode, attr->size);
+	if (!(cache_mask & STATX_SIZE)) {
+		if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
+			struct folio *folio;
+			pgoff_t index = oldsize >> PAGE_SHIFT;
+
+			spin_unlock(&fi->lock);
+			folio = __filemap_get_folio(inode->i_mapping, index,
+						    FGP_LOCK | FGP_CREAT,
+						    mapping_gfp_mask(inode->i_mapping));
+			if (!IS_ERR(folio)) {
+				spin_lock(&fi->lock);
+				if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
+					folio_clear_uptodate(folio);
+					i_size_write(inode, attr->size);
+				}
+				spin_unlock(&fi->lock);
+
+				folio_unlock(folio);
+				folio_put(folio);
+				goto size_updated;
+			}
+			spin_lock(&fi->lock);
+			/*
+			 * Folio alloc failed (ENOMEM). Recheck in case a
+			 * write/truncate started while we dropped the lock.
+			 */
+			if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state))
+				i_size_write(inode, attr->size);
+		} else {
+			i_size_write(inode, attr->size);
+		}
+	}
 	spin_unlock(&fi->lock);
 
+size_updated:
+
 	if (!cache_mask && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:41     ` Nikolay Amiantov
@ 2026-04-16 12:49       ` Nikolay Amiantov
  2026-04-16 23:19       ` Matthew Wilcox
  1 sibling, 0 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-04-16 12:49 UTC (permalink / raw)
  To: Miklos Szeredi, Matthew Wilcox
  Cc: fuse-devel, linux-fsdevel, fuse-devel, linux-mm

On 4/16/26 19:41, Nikolay Amiantov via fuse-devel wrote:
> Finally, if I understand correctly, this also leaves a (much smaller) 
> theoretical race condition in `filemap_read` between checking uptodate 
> and getting the current inode size.
Correction: "would have resulted" in a race condition if we would be 
retrying to get a fresh folio instead of returning an EIO; I have 
assumed that's the case when I tried the patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:12   ` Miklos Szeredi
  2026-04-16 12:41     ` Nikolay Amiantov
@ 2026-04-16 22:54     ` Matthew Wilcox
  1 sibling, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-16 22:54 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Thu, Apr 16, 2026 at 02:12:37PM +0200, Miklos Szeredi wrote:
> On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > > I've recently encountered a weird issue with JuiceFS [1], a network FS
> > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> > > of the same file on another host would periodically read a block of zero
> > > bytes instead of the actual data.
> 
> Thanks for the report.
> 
> I wonder if we could clear PG_uptodate on the page which had its zero
> bytes exposed by the i_size increase?
> 
> Willy?

I think every filesystem which clear PG_uptodate is doing it wrong.
I know we have ~30 places which do it, and I haven't audited them all, 
but clearing the uptodate bit can lead to the VM throwing an absolute
fit if any of the pages in that folio are mapped.

I don't think it'll make much difference whether it's cleared or
invalidated from the page cache.  Either way we're re-reading all
the data in it, which would dominate the time saved by not doing a trip
through the page allocator.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:41     ` Nikolay Amiantov
  2026-04-16 12:49       ` Nikolay Amiantov
@ 2026-04-16 23:19       ` Matthew Wilcox
       [not found]         ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
  2026-04-17 13:48         ` Miklos Szeredi
  1 sibling, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-16 23:19 UTC (permalink / raw)
  To: Nikolay Amiantov
  Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Thu, Apr 16, 2026 at 07:41:37PM +0700, Nikolay Amiantov wrote:
> +++ b/fs/fuse/inode.c
> @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>  	 * extend local i_size without keeping userspace server in sync. So,
>  	 * attr->size coming from server can be stale. We cannot trust it.
>  	 */
> -	if (!(cache_mask & STATX_SIZE))
> -		i_size_write(inode, attr->size);
> +	if (!(cache_mask & STATX_SIZE)) {
> +		if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
> +			struct folio *folio;
> +			pgoff_t index = oldsize >> PAGE_SHIFT;
> +
> +			spin_unlock(&fi->lock);
> +			folio = __filemap_get_folio(inode->i_mapping, index,
> +						    FGP_LOCK | FGP_CREAT,
> +						    mapping_gfp_mask(inode->i_mapping));
> +			if (!IS_ERR(folio)) {
> +				spin_lock(&fi->lock);
> +				if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
> +					folio_clear_uptodate(folio);
> +					i_size_write(inode, attr->size);
> +				}
> +				spin_unlock(&fi->lock);
> +
> +				folio_unlock(folio);
> +				folio_put(folio);
> +				goto size_updated;

Yes, at this point, you've left the folio in an error state.  I'm sure you
didn't mean to do that, but the VFS interprets unlocked && !uptodate as
"an error happened" (there is a minor exception to this involving failed
readahead, but let's set that aside).

What you could do, rather than unlock the folio here is to initiate a
read of the folio and allow the read to unlock the folio.  But I don't
think this is a good idea, I like the idea of invalidating the folio
much better.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
       [not found]         ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
@ 2026-04-17  6:30           ` Nikolay Amiantov
  0 siblings, 0 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-04-17  6:30 UTC (permalink / raw)
  To: fuse-devel, linux-fsdevel

Re-sending this to the lists which rejected the previous message since I 
failed to configure the email client to always use plain text.

On 4/17/26 13:24, Nikolay Amiantov via fuse-devel wrote:
> On 4/17/26 06:19, Matthew Wilcox wrote:
>> Yes, at this point, you've left the folio in an error state.  I'm sure you
>> didn't mean to do that, but the VFS interprets unlocked && !uptodate as
>> "an error happened" (there is a minor exception to this involving failed
>> readahead, but let's set that aside).
>
> Thanks, I see!
>
> To save my reasoning somewhere: another way to do this would be 
> NFS/CIFS-style, in a lazy way. They set a flag in `getattr` and 
> invalidate later in `read()` instead. This could avoid relocking the 
> spinlock; I still opted for invalidating inside `getattr` though since 
> FUSE already has invalidation later in the same call, and the cost of 
> relocking feels low to me in this case.
>
> Any ideas on how to resolve the remaining race condition [1]? If I'm 
> correct it affects any network FS, and can't be fixed without changing 
> the common VFS code somehow. I'd like someone to confirm my 
> conclusions though.
>
> I'm way over my head here though willing to learn; if someone is 
> willing to mentor me on designing the fix, I'd be happy to. My best 
> uneducated guess is to introduce another flag for a page and check it 
> *after* we get the inode size in `filemap_read()`; if it's set, retry 
> reading.
>
> 1: https://github.com/abbradar/nfs_stale_cache_test
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 23:19       ` Matthew Wilcox
       [not found]         ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
@ 2026-04-17 13:48         ` Miklos Szeredi
  2026-05-04 16:49           ` Nikolay Amiantov
  1 sibling, 1 reply; 9+ messages in thread
From: Miklos Szeredi @ 2026-04-17 13:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Fri, 17 Apr 2026 at 01:19, Matthew Wilcox <willy@infradead.org> wrote:
> What you could do, rather than unlock the folio here is to initiate a
> read of the folio and allow the read to unlock the folio.  But I don't
> think this is a good idea, I like the idea of invalidating the folio
> much better.

There's still a race window if the page is invalidated after being
ref-ed by filemap_read() and before i_size is read.

Should that code check for a truncated page and retry?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-17 13:48         ` Miklos Szeredi
@ 2026-05-04 16:49           ` Nikolay Amiantov
  0 siblings, 0 replies; 9+ messages in thread
From: Nikolay Amiantov @ 2026-05-04 16:49 UTC (permalink / raw)
  To: Miklos Szeredi, Matthew Wilcox
  Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm

Kindly bringing this discussion up.

On 4/17/26 20:48, Miklos Szeredi wrote:
> Should that code check for a truncated page and retry?

If I understand you correctly, this can be accomplished with a new page 
flag because there's no other way to find out the truncation happened; 
would you say this is the right way forward?

Sadly I didn't have time to try and make a PoC fix yet, but I want to 
try tackle it this weekend; meanwhile, any guidance is appreciated!

Thanks,
Nikolay.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-04 16:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
2026-04-16  7:24 ` [fuse-devel] Debugging a stale kernel cache during file growth Amir Goldstein
2026-04-16 12:12   ` Miklos Szeredi
2026-04-16 12:41     ` Nikolay Amiantov
2026-04-16 12:49       ` Nikolay Amiantov
2026-04-16 23:19       ` Matthew Wilcox
     [not found]         ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
2026-04-17  6:30           ` Nikolay Amiantov
2026-04-17 13:48         ` Miklos Szeredi
2026-05-04 16:49           ` Nikolay Amiantov
2026-04-16 22:54     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox