* Re: [fuse-devel] Debugging a stale kernel cache during file growth [not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me> @ 2026-04-16 7:24 ` Amir Goldstein 2026-04-16 12:12 ` Miklos Szeredi 0 siblings, 1 reply; 9+ messages in thread From: Amir Goldstein @ 2026-04-16 7:24 UTC (permalink / raw) To: Nikolay Amiantov; +Cc: fuse-devel, linux-fsdevel, fuse-devel [-- Attachment #1: Type: text/plain, Size: 2243 bytes --] [CC new fuse-devel list and fsdevel] On Wed, Apr 15, 2026 at 7:24 PM Nikolay Amiantov via fuse-devel <fuse-devel@lists.sourceforge.net> wrote: > > Hi everybody, > > I've recently encountered a weird issue with JuiceFS [1], a network FS > which uses FUSE. tl;dr: when a file was being slowly appended, a reader > of the same file on another host would periodically read a block of zero > bytes instead of the actual data. > > While researching it I've built an MRE; turns out, this issue can be > triggered on a fresh kernel and libfuse3 with: > * A test FUSE FS which exposes a slowly growing file containing only > 0xAA bytes, with disabled kernel stat cache and no handling of the read > cache (so, with automatic cache handling by the kernel); > * A script which reads the file sequentially checking for zeroes, while > also hammering `os.stat` on the same file from separate threads. > > I couldn't find any prior discussion of this issue. I'm suspecting a > kernel bug; sadly, I have no prior experience with the kernel-side FUSE > subsystem, but with a lot of (disclosure) LLM help explaining the FUSE > module and the kernel read cache architecture, I think I understand what > happens and implemented a partial fix which, to me, makes sense. > > If I understand correctly, a full fix is impossible without VFS-level > locking changes; I've actually managed to reproduce a similar issue in > NFS [2]; this may be applicable to any network FS which may increase an > inode size outside of `read`/`write`s. > > All my code, experiments and a kernel patch which makes the issue less > frequent (but still existing) can be found at > https://github.com/abbradar/fuse_growtest > > Any help checking my findings and pointing me in the right direction is > appreciated! > > Thanks for your help, > Nikolay. > > [1]: https://github.com/juicedata/juicefs/issues/5038 > [2]: https://github.com/abbradar/nfs_stale_cache_test > > Hi Nikolay, Your question is related to kernel filesystem and you also mention NFS, so added fsdevel list for attention of relevant developers which are not subscribed to the fuse-devel list. Also attaching your patch here for convenience. Thanks, Amir. [-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --] [-- Type: text/x-patch, Size: 2943 bytes --] From 007c8531c0644e259321b8e1b151003439f75ebf Mon Sep 17 00:00:00 2001 From: Nikolay Amiantov <ab@fmap.me> Date: Wed, 15 Apr 2026 07:28:19 +0000 Subject: [PATCH] fuse: fix stale page cache data race on file growth When a FUSE server reports a larger file size via lookup or getattr, fuse_change_attributes() updates i_size before invalidating stale page cache entries. The page before the EOF contains kernel-generated zero-fill beyond the old i_size. Once i_size is increased, these zeroes become visible to concurrent readers before we invalidate the cache later on. Fix this by evicting the affected page before updating i_size. The fi->lock spinlock is dropped before the invalidation (which can sleep), then reacquired to recheck FUSE_I_SIZE_UNSTABLE before updating i_size. --- fs/fuse/inode.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 735abf426a06..61ae0f94e4dc 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -334,8 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, * extend local i_size without keeping userspace server in sync. So, * attr->size coming from server can be stale. We cannot trust it. */ - if (!(cache_mask & STATX_SIZE)) - i_size_write(inode, attr->size); + if (!(cache_mask & STATX_SIZE)) { + /* + * When a file grows remotely, the page straddling the old + * EOF contains zero-fill beyond oldsize. Those zeroes are + * valid while i_size equals oldsize (they are beyond EOF), + * but become stale once i_size is increased: concurrent + * readers would see zeroes instead of the data written by + * the remote host. Evict the affected page(s) BEFORE updating + * i_size. Any reader that re-populates the cache between the + * invalidation and the i_size update will issue a fresh + * FUSE_READ with the new data there. + * + * There is a residual race: a reader that has + * already obtained a folio reference via the lockless + * filemap_get_read_batch() but has not yet reached the + * i_size_read() in filemap_read() would hold a ref that + * prevents invalidate_inode_pages2_range() from evicting + * the folio. The reader would then use the new i_size + * (read after our i_size_write) to copy stale data. + */ + if (S_ISREG(inode->i_mode) && attr->size > oldsize) { + spin_unlock(&fi->lock); + invalidate_inode_pages2_range(inode->i_mapping, + oldsize >> PAGE_SHIFT, + (attr->size - 1) >> PAGE_SHIFT); + spin_lock(&fi->lock); + /* + * Recheck — a write or truncate may have set + * FUSE_I_SIZE_UNSTABLE while we dropped the lock. + */ + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) + i_size_write(inode, attr->size); + } else { + i_size_write(inode, attr->size); + } + } spin_unlock(&fi->lock); if (!cache_mask && S_ISREG(inode->i_mode)) { -- 2.47.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 7:24 ` [fuse-devel] Debugging a stale kernel cache during file growth Amir Goldstein @ 2026-04-16 12:12 ` Miklos Szeredi 2026-04-16 12:41 ` Nikolay Amiantov 2026-04-16 22:54 ` Matthew Wilcox 0 siblings, 2 replies; 9+ messages in thread From: Miklos Szeredi @ 2026-04-16 12:12 UTC (permalink / raw) To: Matthew Wilcox Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote: > > I've recently encountered a weird issue with JuiceFS [1], a network FS > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader > > of the same file on another host would periodically read a block of zero > > bytes instead of the actual data. Thanks for the report. I wonder if we could clear PG_uptodate on the page which had its zero bytes exposed by the i_size increase? Willy? Thanks, Miklos ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:12 ` Miklos Szeredi @ 2026-04-16 12:41 ` Nikolay Amiantov 2026-04-16 12:49 ` Nikolay Amiantov 2026-04-16 23:19 ` Matthew Wilcox 2026-04-16 22:54 ` Matthew Wilcox 1 sibling, 2 replies; 9+ messages in thread From: Nikolay Amiantov @ 2026-04-16 12:41 UTC (permalink / raw) To: Miklos Szeredi, Matthew Wilcox Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm [-- Attachment #1.1: Type: text/plain, Size: 1336 bytes --] On 4/16/26 19:12, Miklos Szeredi wrote: > I wonder if we could clear PG_uptodate on the page which had its zero > bytes exposed by the i_size increase? I've actually tried that first. The idea was to get or create a new page on the EOF boundary, lock it and poison it with an uptodate reset if we need to. But this resulted in an instantaneous EIO in my test. If I undestand correctly, this is because of another race condition: * A fresh page gets created and read by FUSE; uptodate is true; * The page is unlocked on return from `fuse_read_folio`; * Simultaneously, we run `getattr`. The page gets locked, uptodate is reset, the page is unlocked; * Now back from `fuse_read_folio`, `filemap_read_folio` gets this page, waits on `folio_wait_locked_killable` (waiting for the getattr to reset uptodate), and then checks `folio_test_uptodate`; * The page is !uptodate, so an EIO is returned. So it effectively results in inability to have a successful `read` when a `getattr` for a growing file happens simultaneously. Finally, if I understand correctly, this also leaves a (much smaller) theoretical race condition in `filemap_read` between checking uptodate and getting the current inode size. Attached is the patch with this attempt; please check that it does what you meant in case I misunderstood. Cheers, Nikolay. [-- Attachment #1.2: Type: text/html, Size: 1901 bytes --] [-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --] [-- Type: text/x-patch, Size: 1830 bytes --] From 512194b982fd0edbc1dcaa50fafad75b1be26d42 Mon Sep 17 00:00:00 2001 From: Nikolay Amiantov <ab@fmap.me> Date: Wed, 15 Apr 2026 07:28:19 +0000 Subject: [PATCH] fuse: fix stale page cache data race on file growth --- fs/fuse/inode.c | 36 ++++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 735abf426a06..20741869ac2f 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, * extend local i_size without keeping userspace server in sync. So, * attr->size coming from server can be stale. We cannot trust it. */ - if (!(cache_mask & STATX_SIZE)) - i_size_write(inode, attr->size); + if (!(cache_mask & STATX_SIZE)) { + if (S_ISREG(inode->i_mode) && attr->size > oldsize) { + struct folio *folio; + pgoff_t index = oldsize >> PAGE_SHIFT; + + spin_unlock(&fi->lock); + folio = __filemap_get_folio(inode->i_mapping, index, + FGP_LOCK | FGP_CREAT, + mapping_gfp_mask(inode->i_mapping)); + if (!IS_ERR(folio)) { + spin_lock(&fi->lock); + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { + folio_clear_uptodate(folio); + i_size_write(inode, attr->size); + } + spin_unlock(&fi->lock); + + folio_unlock(folio); + folio_put(folio); + goto size_updated; + } + spin_lock(&fi->lock); + /* + * Folio alloc failed (ENOMEM). Recheck in case a + * write/truncate started while we dropped the lock. + */ + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) + i_size_write(inode, attr->size); + } else { + i_size_write(inode, attr->size); + } + } spin_unlock(&fi->lock); +size_updated: + if (!cache_mask && S_ISREG(inode->i_mode)) { bool inval = false; -- 2.47.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:41 ` Nikolay Amiantov @ 2026-04-16 12:49 ` Nikolay Amiantov 2026-04-16 23:19 ` Matthew Wilcox 1 sibling, 0 replies; 9+ messages in thread From: Nikolay Amiantov @ 2026-04-16 12:49 UTC (permalink / raw) To: Miklos Szeredi, Matthew Wilcox Cc: fuse-devel, linux-fsdevel, fuse-devel, linux-mm On 4/16/26 19:41, Nikolay Amiantov via fuse-devel wrote: > Finally, if I understand correctly, this also leaves a (much smaller) > theoretical race condition in `filemap_read` between checking uptodate > and getting the current inode size. Correction: "would have resulted" in a race condition if we would be retrying to get a fresh folio instead of returning an EIO; I have assumed that's the case when I tried the patch. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:41 ` Nikolay Amiantov 2026-04-16 12:49 ` Nikolay Amiantov @ 2026-04-16 23:19 ` Matthew Wilcox [not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me> 2026-04-17 13:48 ` Miklos Szeredi 1 sibling, 2 replies; 9+ messages in thread From: Matthew Wilcox @ 2026-04-16 23:19 UTC (permalink / raw) To: Nikolay Amiantov Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Thu, Apr 16, 2026 at 07:41:37PM +0700, Nikolay Amiantov wrote: > +++ b/fs/fuse/inode.c > @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, > * extend local i_size without keeping userspace server in sync. So, > * attr->size coming from server can be stale. We cannot trust it. > */ > - if (!(cache_mask & STATX_SIZE)) > - i_size_write(inode, attr->size); > + if (!(cache_mask & STATX_SIZE)) { > + if (S_ISREG(inode->i_mode) && attr->size > oldsize) { > + struct folio *folio; > + pgoff_t index = oldsize >> PAGE_SHIFT; > + > + spin_unlock(&fi->lock); > + folio = __filemap_get_folio(inode->i_mapping, index, > + FGP_LOCK | FGP_CREAT, > + mapping_gfp_mask(inode->i_mapping)); > + if (!IS_ERR(folio)) { > + spin_lock(&fi->lock); > + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { > + folio_clear_uptodate(folio); > + i_size_write(inode, attr->size); > + } > + spin_unlock(&fi->lock); > + > + folio_unlock(folio); > + folio_put(folio); > + goto size_updated; Yes, at this point, you've left the folio in an error state. I'm sure you didn't mean to do that, but the VFS interprets unlocked && !uptodate as "an error happened" (there is a minor exception to this involving failed readahead, but let's set that aside). What you could do, rather than unlock the folio here is to initiate a read of the folio and allow the read to unlock the folio. But I don't think this is a good idea, I like the idea of invalidating the folio much better. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>]
* Re: [fuse-devel] Debugging a stale kernel cache during file growth [not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me> @ 2026-04-17 6:30 ` Nikolay Amiantov 0 siblings, 0 replies; 9+ messages in thread From: Nikolay Amiantov @ 2026-04-17 6:30 UTC (permalink / raw) To: fuse-devel, linux-fsdevel Re-sending this to the lists which rejected the previous message since I failed to configure the email client to always use plain text. On 4/17/26 13:24, Nikolay Amiantov via fuse-devel wrote: > On 4/17/26 06:19, Matthew Wilcox wrote: >> Yes, at this point, you've left the folio in an error state. I'm sure you >> didn't mean to do that, but the VFS interprets unlocked && !uptodate as >> "an error happened" (there is a minor exception to this involving failed >> readahead, but let's set that aside). > > Thanks, I see! > > To save my reasoning somewhere: another way to do this would be > NFS/CIFS-style, in a lazy way. They set a flag in `getattr` and > invalidate later in `read()` instead. This could avoid relocking the > spinlock; I still opted for invalidating inside `getattr` though since > FUSE already has invalidation later in the same call, and the cost of > relocking feels low to me in this case. > > Any ideas on how to resolve the remaining race condition [1]? If I'm > correct it affects any network FS, and can't be fixed without changing > the common VFS code somehow. I'd like someone to confirm my > conclusions though. > > I'm way over my head here though willing to learn; if someone is > willing to mentor me on designing the fix, I'd be happy to. My best > uneducated guess is to introduce another flag for a page and check it > *after* we get the inode size in `filemap_read()`; if it's set, retry > reading. > > 1: https://github.com/abbradar/nfs_stale_cache_test > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 23:19 ` Matthew Wilcox [not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me> @ 2026-04-17 13:48 ` Miklos Szeredi 2026-05-04 16:49 ` Nikolay Amiantov 1 sibling, 1 reply; 9+ messages in thread From: Miklos Szeredi @ 2026-04-17 13:48 UTC (permalink / raw) To: Matthew Wilcox Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Fri, 17 Apr 2026 at 01:19, Matthew Wilcox <willy@infradead.org> wrote: > What you could do, rather than unlock the folio here is to initiate a > read of the folio and allow the read to unlock the folio. But I don't > think this is a good idea, I like the idea of invalidating the folio > much better. There's still a race window if the page is invalidated after being ref-ed by filemap_read() and before i_size is read. Should that code check for a truncated page and retry? Thanks, Miklos ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-17 13:48 ` Miklos Szeredi @ 2026-05-04 16:49 ` Nikolay Amiantov 0 siblings, 0 replies; 9+ messages in thread From: Nikolay Amiantov @ 2026-05-04 16:49 UTC (permalink / raw) To: Miklos Szeredi, Matthew Wilcox Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm Kindly bringing this discussion up. On 4/17/26 20:48, Miklos Szeredi wrote: > Should that code check for a truncated page and retry? If I understand you correctly, this can be accomplished with a new page flag because there's no other way to find out the truncation happened; would you say this is the right way forward? Sadly I didn't have time to try and make a PoC fix yet, but I want to try tackle it this weekend; meanwhile, any guidance is appreciated! Thanks, Nikolay. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:12 ` Miklos Szeredi 2026-04-16 12:41 ` Nikolay Amiantov @ 2026-04-16 22:54 ` Matthew Wilcox 1 sibling, 0 replies; 9+ messages in thread From: Matthew Wilcox @ 2026-04-16 22:54 UTC (permalink / raw) To: Miklos Szeredi Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Thu, Apr 16, 2026 at 02:12:37PM +0200, Miklos Szeredi wrote: > On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote: > > > > I've recently encountered a weird issue with JuiceFS [1], a network FS > > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader > > > of the same file on another host would periodically read a block of zero > > > bytes instead of the actual data. > > Thanks for the report. > > I wonder if we could clear PG_uptodate on the page which had its zero > bytes exposed by the i_size increase? > > Willy? I think every filesystem which clear PG_uptodate is doing it wrong. I know we have ~30 places which do it, and I haven't audited them all, but clearing the uptodate bit can lead to the VM throwing an absolute fit if any of the pages in that folio are mapped. I don't think it'll make much difference whether it's cleared or invalidated from the page cache. Either way we're re-reading all the data in it, which would dominate the time saved by not doing a trip through the page allocator. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-04 16:54 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
2026-04-16 7:24 ` [fuse-devel] Debugging a stale kernel cache during file growth Amir Goldstein
2026-04-16 12:12 ` Miklos Szeredi
2026-04-16 12:41 ` Nikolay Amiantov
2026-04-16 12:49 ` Nikolay Amiantov
2026-04-16 23:19 ` Matthew Wilcox
[not found] ` <800fa535-da92-41c0-bea9-40ee27639502@fmap.me>
2026-04-17 6:30 ` Nikolay Amiantov
2026-04-17 13:48 ` Miklos Szeredi
2026-05-04 16:49 ` Nikolay Amiantov
2026-04-16 22:54 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox