* [PATCH] fuse: fix readahead reclaim deadlock @ 2025-09-25 22:44 Joanne Koong 2025-09-26 6:51 ` Gao Xiang 2025-09-26 9:01 ` Miklos Szeredi 0 siblings, 2 replies; 12+ messages in thread From: Joanne Koong @ 2025-09-25 22:44 UTC (permalink / raw) To: miklos; +Cc: linux-fsdevel, osandov, kernel-team A deadlock can occur if the server triggers reclaim while servicing a readahead request, and reclaim attempts to evict the inode of the file being read ahead: >>> stack_trace(1504735) folio_wait_bit_common (mm/filemap.c:1308:4) folio_lock (./include/linux/pagemap.h:1052:3) truncate_inode_pages_range (mm/truncate.c:336:10) fuse_evict_inode (fs/fuse/inode.c:161:2) evict (fs/inode.c:704:3) dentry_unlink_inode (fs/dcache.c:412:3) __dentry_kill (fs/dcache.c:615:3) shrink_kill (fs/dcache.c:1060:12) shrink_dentry_list (fs/dcache.c:1087:3) prune_dcache_sb (fs/dcache.c:1168:2) super_cache_scan (fs/super.c:221:10) do_shrink_slab (mm/shrinker.c:435:9) shrink_slab (mm/shrinker.c:626:10) shrink_node (mm/vmscan.c:5951:2) shrink_zones (mm/vmscan.c:6195:3) do_try_to_free_pages (mm/vmscan.c:6257:3) do_swap_page (mm/memory.c:4136:11) handle_pte_fault (mm/memory.c:5562:10) handle_mm_fault (mm/memory.c:5870:9) do_user_addr_fault (arch/x86/mm/fault.c:1338:10) handle_page_fault (arch/x86/mm/fault.c:1481:3) exc_page_fault (arch/x86/mm/fault.c:1539:2) asm_exc_page_fault+0x22/0x27 During readahead, the folio is locked. When fuse_evict_inode() is called, it attempts to remove all folios associated with the inode from the page cache (truncate_inode_pages_range()), which requires acquiring the folio lock. If the server triggers reclaim while servicing a readahead request, reclaim will block indefinitely waiting for the folio lock, while readahead cannot relinquish the lock because it is itself blocked in reclaim, resulting in a deadlock. The inode is only evicted if it has no remaining references after its dentry is unlinked. Since readahead is asynchronous, it is not guaranteed that the inode will have any references at this point. This fixes the deadlock by holding a reference on the inode while readahead is in progress, which prevents the inode from being evicted until readahead completes. Additionally, this also prevents a malicious or buggy server from indefinitely blocking kswapd if it never fulfills a readahead request. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reported-by: Omar Sandoval <osandov@fb.com> --- fs/fuse/file.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index f1ef77a0be05..8e759061b843 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, if (ia->ff) fuse_file_put(ia->ff, false); + iput(inode); fuse_io_free(ia); } @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac) ia = fuse_io_alloc(NULL, cur_pages); if (!ia) break; + /* + * Acquire the inode ref here to prevent reclaim from + * deadlocking. The ref gets dropped in fuse_readpages_end(). + */ + igrab(inode); + ap = &ia->ap; while (pages < cur_pages) { -- 2.47.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-25 22:44 [PATCH] fuse: fix readahead reclaim deadlock Joanne Koong @ 2025-09-26 6:51 ` Gao Xiang 2025-09-26 7:19 ` Gao Xiang 2025-09-26 9:01 ` Miklos Szeredi 1 sibling, 1 reply; 12+ messages in thread From: Gao Xiang @ 2025-09-26 6:51 UTC (permalink / raw) To: Joanne Koong, miklos; +Cc: linux-fsdevel, osandov, kernel-team On 2025/9/26 06:44, Joanne Koong wrote: > A deadlock can occur if the server triggers reclaim while servicing a > readahead request, and reclaim attempts to evict the inode of the file > being read ahead: > >>>> stack_trace(1504735) > folio_wait_bit_common (mm/filemap.c:1308:4) > folio_lock (./include/linux/pagemap.h:1052:3) > truncate_inode_pages_range (mm/truncate.c:336:10) > fuse_evict_inode (fs/fuse/inode.c:161:2) > evict (fs/inode.c:704:3) > dentry_unlink_inode (fs/dcache.c:412:3) > __dentry_kill (fs/dcache.c:615:3) > shrink_kill (fs/dcache.c:1060:12) > shrink_dentry_list (fs/dcache.c:1087:3) > prune_dcache_sb (fs/dcache.c:1168:2) > super_cache_scan (fs/super.c:221:10) > do_shrink_slab (mm/shrinker.c:435:9) > shrink_slab (mm/shrinker.c:626:10) > shrink_node (mm/vmscan.c:5951:2) > shrink_zones (mm/vmscan.c:6195:3) > do_try_to_free_pages (mm/vmscan.c:6257:3) > do_swap_page (mm/memory.c:4136:11) > handle_pte_fault (mm/memory.c:5562:10) > handle_mm_fault (mm/memory.c:5870:9) > do_user_addr_fault (arch/x86/mm/fault.c:1338:10) > handle_page_fault (arch/x86/mm/fault.c:1481:3) > exc_page_fault (arch/x86/mm/fault.c:1539:2) > asm_exc_page_fault+0x22/0x27 > > During readahead, the folio is locked. When fuse_evict_inode() is > called, it attempts to remove all folios associated with the inode from > the page cache (truncate_inode_pages_range()), which requires acquiring > the folio lock. If the server triggers reclaim while servicing a > readahead request, reclaim will block indefinitely waiting for the folio > lock, while readahead cannot relinquish the lock because it is itself > blocked in reclaim, resulting in a deadlock. > > The inode is only evicted if it has no remaining references after its > dentry is unlinked. Since readahead is asynchronous, it is not > guaranteed that the inode will have any references at this point. > > This fixes the deadlock by holding a reference on the inode while > readahead is in progress, which prevents the inode from being evicted > until readahead completes. Additionally, this also prevents a malicious > or buggy server from indefinitely blocking kswapd if it never fulfills a > readahead request. > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > Reported-by: Omar Sandoval <osandov@fb.com> > --- > fs/fuse/file.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index f1ef77a0be05..8e759061b843 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, > if (ia->ff) > fuse_file_put(ia->ff, false); > > + iput(inode); It's somewhat odd to use `igrab` and `iput` in the read(ahead) context. I wonder for FUSE, if it's possible to just wait ongoing locked folios when i_count == 0 (e.g. in .drop_inode) before adding into lru so that the final inode eviction won't wait its readahead requests itself so that deadlock like this can be avoided. Thanks, Gao Xiang > fuse_io_free(ia); > } > > @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac) > ia = fuse_io_alloc(NULL, cur_pages); > if (!ia) > break; > + /* > + * Acquire the inode ref here to prevent reclaim from > + * deadlocking. The ref gets dropped in fuse_readpages_end(). > + */ > + igrab(inode); > + > ap = &ia->ap; > > while (pages < cur_pages) { ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-26 6:51 ` Gao Xiang @ 2025-09-26 7:19 ` Gao Xiang 2025-09-29 17:25 ` Joanne Koong 0 siblings, 1 reply; 12+ messages in thread From: Gao Xiang @ 2025-09-26 7:19 UTC (permalink / raw) To: Joanne Koong, miklos; +Cc: linux-fsdevel, osandov, kernel-team On 2025/9/26 14:51, Gao Xiang wrote: > > > On 2025/9/26 06:44, Joanne Koong wrote: >> A deadlock can occur if the server triggers reclaim while servicing a >> readahead request, and reclaim attempts to evict the inode of the file >> being read ahead: >> >>>>> stack_trace(1504735) >> folio_wait_bit_common (mm/filemap.c:1308:4) >> folio_lock (./include/linux/pagemap.h:1052:3) >> truncate_inode_pages_range (mm/truncate.c:336:10) >> fuse_evict_inode (fs/fuse/inode.c:161:2) >> evict (fs/inode.c:704:3) >> dentry_unlink_inode (fs/dcache.c:412:3) >> __dentry_kill (fs/dcache.c:615:3) >> shrink_kill (fs/dcache.c:1060:12) >> shrink_dentry_list (fs/dcache.c:1087:3) >> prune_dcache_sb (fs/dcache.c:1168:2) >> super_cache_scan (fs/super.c:221:10) >> do_shrink_slab (mm/shrinker.c:435:9) >> shrink_slab (mm/shrinker.c:626:10) >> shrink_node (mm/vmscan.c:5951:2) >> shrink_zones (mm/vmscan.c:6195:3) >> do_try_to_free_pages (mm/vmscan.c:6257:3) >> do_swap_page (mm/memory.c:4136:11) >> handle_pte_fault (mm/memory.c:5562:10) >> handle_mm_fault (mm/memory.c:5870:9) >> do_user_addr_fault (arch/x86/mm/fault.c:1338:10) >> handle_page_fault (arch/x86/mm/fault.c:1481:3) >> exc_page_fault (arch/x86/mm/fault.c:1539:2) >> asm_exc_page_fault+0x22/0x27 >> >> During readahead, the folio is locked. When fuse_evict_inode() is >> called, it attempts to remove all folios associated with the inode from >> the page cache (truncate_inode_pages_range()), which requires acquiring >> the folio lock. If the server triggers reclaim while servicing a >> readahead request, reclaim will block indefinitely waiting for the folio >> lock, while readahead cannot relinquish the lock because it is itself >> blocked in reclaim, resulting in a deadlock. >> >> The inode is only evicted if it has no remaining references after its >> dentry is unlinked. Since readahead is asynchronous, it is not >> guaranteed that the inode will have any references at this point. >> >> This fixes the deadlock by holding a reference on the inode while >> readahead is in progress, which prevents the inode from being evicted >> until readahead completes. Additionally, this also prevents a malicious >> or buggy server from indefinitely blocking kswapd if it never fulfills a >> readahead request. >> >> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >> Reported-by: Omar Sandoval <osandov@fb.com> >> --- >> fs/fuse/file.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >> index f1ef77a0be05..8e759061b843 100644 >> --- a/fs/fuse/file.c >> +++ b/fs/fuse/file.c >> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, >> if (ia->ff) >> fuse_file_put(ia->ff, false); >> + iput(inode); > > It's somewhat odd to use `igrab` and `iput` in the read(ahead) > context. > > I wonder for FUSE, if it's possible to just wait ongoing > locked folios when i_count == 0 (e.g. in .drop_inode) before > adding into lru so that the final inode eviction won't wait > its readahead requests itself so that deadlock like this can > be avoided. Oh, it was in the dentry LRU list instead, I don't think it can work. Or normally the kernel filesystem uses GFP_NOFS to avoid such deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in super_cache_scan()), I wonder if the daemon should simply use prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called into the fs reclaim context again. Thanks, Gao Xiang > > Thanks, > Gao Xiang > > >> fuse_io_free(ia); >> } >> @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac) >> ia = fuse_io_alloc(NULL, cur_pages); >> if (!ia) >> break; >> + /* >> + * Acquire the inode ref here to prevent reclaim from >> + * deadlocking. The ref gets dropped in fuse_readpages_end(). >> + */ >> + igrab(inode); >> + >> ap = &ia->ap; >> while (pages < cur_pages) { > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-26 7:19 ` Gao Xiang @ 2025-09-29 17:25 ` Joanne Koong 2025-09-30 2:21 ` Gao Xiang 0 siblings, 1 reply; 12+ messages in thread From: Joanne Koong @ 2025-09-29 17:25 UTC (permalink / raw) To: Gao Xiang; +Cc: miklos, linux-fsdevel, osandov, kernel-team On Fri, Sep 26, 2025 at 12:19 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > On 2025/9/26 14:51, Gao Xiang wrote: > > > > On 2025/9/26 06:44, Joanne Koong wrote: > >> A deadlock can occur if the server triggers reclaim while servicing a > >> readahead request, and reclaim attempts to evict the inode of the file > >> being read ahead: > >> > >>>>> stack_trace(1504735) > >> folio_wait_bit_common (mm/filemap.c:1308:4) > >> folio_lock (./include/linux/pagemap.h:1052:3) > >> truncate_inode_pages_range (mm/truncate.c:336:10) > >> fuse_evict_inode (fs/fuse/inode.c:161:2) > >> evict (fs/inode.c:704:3) > >> dentry_unlink_inode (fs/dcache.c:412:3) > >> __dentry_kill (fs/dcache.c:615:3) > >> shrink_kill (fs/dcache.c:1060:12) > >> shrink_dentry_list (fs/dcache.c:1087:3) > >> prune_dcache_sb (fs/dcache.c:1168:2) > >> super_cache_scan (fs/super.c:221:10) > >> do_shrink_slab (mm/shrinker.c:435:9) > >> shrink_slab (mm/shrinker.c:626:10) > >> shrink_node (mm/vmscan.c:5951:2) > >> shrink_zones (mm/vmscan.c:6195:3) > >> do_try_to_free_pages (mm/vmscan.c:6257:3) > >> do_swap_page (mm/memory.c:4136:11) > >> handle_pte_fault (mm/memory.c:5562:10) > >> handle_mm_fault (mm/memory.c:5870:9) > >> do_user_addr_fault (arch/x86/mm/fault.c:1338:10) > >> handle_page_fault (arch/x86/mm/fault.c:1481:3) > >> exc_page_fault (arch/x86/mm/fault.c:1539:2) > >> asm_exc_page_fault+0x22/0x27 > >> > >> During readahead, the folio is locked. When fuse_evict_inode() is > >> called, it attempts to remove all folios associated with the inode from > >> the page cache (truncate_inode_pages_range()), which requires acquiring > >> the folio lock. If the server triggers reclaim while servicing a > >> readahead request, reclaim will block indefinitely waiting for the folio > >> lock, while readahead cannot relinquish the lock because it is itself > >> blocked in reclaim, resulting in a deadlock. > >> > >> The inode is only evicted if it has no remaining references after its > >> dentry is unlinked. Since readahead is asynchronous, it is not > >> guaranteed that the inode will have any references at this point. > >> > >> This fixes the deadlock by holding a reference on the inode while > >> readahead is in progress, which prevents the inode from being evicted > >> until readahead completes. Additionally, this also prevents a malicious > >> or buggy server from indefinitely blocking kswapd if it never fulfills a > >> readahead request. > >> > >> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > >> Reported-by: Omar Sandoval <osandov@fb.com> > >> --- > >> fs/fuse/file.c | 7 +++++++ > >> 1 file changed, 7 insertions(+) > >> > >> diff --git a/fs/fuse/file.c b/fs/fuse/file.c > >> index f1ef77a0be05..8e759061b843 100644 > >> --- a/fs/fuse/file.c > >> +++ b/fs/fuse/file.c > >> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, > >> if (ia->ff) > >> fuse_file_put(ia->ff, false); > >> + iput(inode); > > > > It's somewhat odd to use `igrab` and `iput` in the read(ahead) > > context. > > > > I wonder for FUSE, if it's possible to just wait ongoing > > locked folios when i_count == 0 (e.g. in .drop_inode) before > > adding into lru so that the final inode eviction won't wait > > its readahead requests itself so that deadlock like this can > > be avoided. > > Oh, it was in the dentry LRU list instead, I don't think it can > work. > > Or normally the kernel filesystem uses GFP_NOFS to avoid such > deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in > super_cache_scan()), I wonder if the daemon should simply use > prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called > into the fs reclaim context again. Hi Gao, We cannot rely on the daemon to set this unfortunately. This can tie up reclaim and kswapd for the entire system so I think this enforcement needs to be guaranteed and at the kernel level. For example, there is the possibility of malicious servers, which we cannot rely on to set FR_SET_IO_FLUSHER. Thanks, Joanne > > Thanks, > Gao Xiang > > > > > Thanks, > > Gao Xiang > > > > > >> fuse_io_free(ia); > >> } > >> @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac) > >> ia = fuse_io_alloc(NULL, cur_pages); > >> if (!ia) > >> break; > >> + /* > >> + * Acquire the inode ref here to prevent reclaim from > >> + * deadlocking. The ref gets dropped in fuse_readpages_end(). > >> + */ > >> + igrab(inode); > >> + > >> ap = &ia->ap; > >> while (pages < cur_pages) { > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-29 17:25 ` Joanne Koong @ 2025-09-30 2:21 ` Gao Xiang 2025-09-30 2:35 ` Gao Xiang 2025-09-30 10:08 ` Miklos Szeredi 0 siblings, 2 replies; 12+ messages in thread From: Gao Xiang @ 2025-09-30 2:21 UTC (permalink / raw) To: Joanne Koong; +Cc: miklos, linux-fsdevel, osandov, kernel-team On 2025/9/30 01:25, Joanne Koong wrote: > On Fri, Sep 26, 2025 at 12:19 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> On 2025/9/26 14:51, Gao Xiang wrote: >>> >>> On 2025/9/26 06:44, Joanne Koong wrote: >>>> A deadlock can occur if the server triggers reclaim while servicing a >>>> readahead request, and reclaim attempts to evict the inode of the file >>>> being read ahead: >>>> >>>>>>> stack_trace(1504735) >>>> folio_wait_bit_common (mm/filemap.c:1308:4) >>>> folio_lock (./include/linux/pagemap.h:1052:3) >>>> truncate_inode_pages_range (mm/truncate.c:336:10) >>>> fuse_evict_inode (fs/fuse/inode.c:161:2) >>>> evict (fs/inode.c:704:3) >>>> dentry_unlink_inode (fs/dcache.c:412:3) >>>> __dentry_kill (fs/dcache.c:615:3) >>>> shrink_kill (fs/dcache.c:1060:12) >>>> shrink_dentry_list (fs/dcache.c:1087:3) >>>> prune_dcache_sb (fs/dcache.c:1168:2) >>>> super_cache_scan (fs/super.c:221:10) >>>> do_shrink_slab (mm/shrinker.c:435:9) >>>> shrink_slab (mm/shrinker.c:626:10) >>>> shrink_node (mm/vmscan.c:5951:2) >>>> shrink_zones (mm/vmscan.c:6195:3) >>>> do_try_to_free_pages (mm/vmscan.c:6257:3) >>>> do_swap_page (mm/memory.c:4136:11) >>>> handle_pte_fault (mm/memory.c:5562:10) >>>> handle_mm_fault (mm/memory.c:5870:9) >>>> do_user_addr_fault (arch/x86/mm/fault.c:1338:10) >>>> handle_page_fault (arch/x86/mm/fault.c:1481:3) >>>> exc_page_fault (arch/x86/mm/fault.c:1539:2) >>>> asm_exc_page_fault+0x22/0x27 >>>> >>>> During readahead, the folio is locked. When fuse_evict_inode() is >>>> called, it attempts to remove all folios associated with the inode from >>>> the page cache (truncate_inode_pages_range()), which requires acquiring >>>> the folio lock. If the server triggers reclaim while servicing a >>>> readahead request, reclaim will block indefinitely waiting for the folio >>>> lock, while readahead cannot relinquish the lock because it is itself >>>> blocked in reclaim, resulting in a deadlock. >>>> >>>> The inode is only evicted if it has no remaining references after its >>>> dentry is unlinked. Since readahead is asynchronous, it is not >>>> guaranteed that the inode will have any references at this point. >>>> >>>> This fixes the deadlock by holding a reference on the inode while >>>> readahead is in progress, which prevents the inode from being evicted >>>> until readahead completes. Additionally, this also prevents a malicious >>>> or buggy server from indefinitely blocking kswapd if it never fulfills a >>>> readahead request. >>>> >>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>> Reported-by: Omar Sandoval <osandov@fb.com> >>>> --- >>>> fs/fuse/file.c | 7 +++++++ >>>> 1 file changed, 7 insertions(+) >>>> >>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >>>> index f1ef77a0be05..8e759061b843 100644 >>>> --- a/fs/fuse/file.c >>>> +++ b/fs/fuse/file.c >>>> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, >>>> if (ia->ff) >>>> fuse_file_put(ia->ff, false); >>>> + iput(inode); >>> >>> It's somewhat odd to use `igrab` and `iput` in the read(ahead) >>> context. >>> >>> I wonder for FUSE, if it's possible to just wait ongoing >>> locked folios when i_count == 0 (e.g. in .drop_inode) before >>> adding into lru so that the final inode eviction won't wait >>> its readahead requests itself so that deadlock like this can >>> be avoided. >> >> Oh, it was in the dentry LRU list instead, I don't think it can >> work. >> >> Or normally the kernel filesystem uses GFP_NOFS to avoid such >> deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in >> super_cache_scan()), I wonder if the daemon should simply use >> prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called >> into the fs reclaim context again. > > Hi Gao, > > We cannot rely on the daemon to set this unfortunately. This can tie > up reclaim and kswapd for the entire system so I think this > enforcement needs to be guaranteed and at the kernel level. For > example, there is the possibility of malicious servers, which we > cannot rely on to set FR_SET_IO_FLUSHER. Hi Joanne, Yes, currently I don't have a saner way in my mind but iput() in such nested context sounds a new entry (e.g. I thought kernel page fault path should have nothing tangled with evict() directly but I may be wrong.) In principle, typical the kernel filesystem holds a valid `file` during the entire buffered read (file)/mmap (vma->vm_file) submission path (and of course they won't upcall to userspace and then do random behavior in the userspace for I/O processing). So for the kernel filesystems I think the GFP_NOFS allocation isn't needed since `file.f_path` always takes a valid dentry ref during the submission so that such dentry/inode reclaim above is impossible IMO.) Thanks, Gao Xiang > > Thanks, > Joanne > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-30 2:21 ` Gao Xiang @ 2025-09-30 2:35 ` Gao Xiang 2025-09-30 10:08 ` Miklos Szeredi 1 sibling, 0 replies; 12+ messages in thread From: Gao Xiang @ 2025-09-30 2:35 UTC (permalink / raw) To: Joanne Koong; +Cc: miklos, linux-fsdevel, osandov, kernel-team On 2025/9/30 10:21, Gao Xiang wrote: > > > On 2025/9/30 01:25, Joanne Koong wrote: >> On Fri, Sep 26, 2025 at 12:19 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>> >>> On 2025/9/26 14:51, Gao Xiang wrote: >>>> >>>> On 2025/9/26 06:44, Joanne Koong wrote: >>>>> A deadlock can occur if the server triggers reclaim while servicing a >>>>> readahead request, and reclaim attempts to evict the inode of the file >>>>> being read ahead: >>>>> >>>>>>>> stack_trace(1504735) >>>>> folio_wait_bit_common (mm/filemap.c:1308:4) >>>>> folio_lock (./include/linux/pagemap.h:1052:3) >>>>> truncate_inode_pages_range (mm/truncate.c:336:10) >>>>> fuse_evict_inode (fs/fuse/inode.c:161:2) >>>>> evict (fs/inode.c:704:3) >>>>> dentry_unlink_inode (fs/dcache.c:412:3) >>>>> __dentry_kill (fs/dcache.c:615:3) >>>>> shrink_kill (fs/dcache.c:1060:12) >>>>> shrink_dentry_list (fs/dcache.c:1087:3) >>>>> prune_dcache_sb (fs/dcache.c:1168:2) >>>>> super_cache_scan (fs/super.c:221:10) >>>>> do_shrink_slab (mm/shrinker.c:435:9) >>>>> shrink_slab (mm/shrinker.c:626:10) >>>>> shrink_node (mm/vmscan.c:5951:2) >>>>> shrink_zones (mm/vmscan.c:6195:3) >>>>> do_try_to_free_pages (mm/vmscan.c:6257:3) >>>>> do_swap_page (mm/memory.c:4136:11) >>>>> handle_pte_fault (mm/memory.c:5562:10) >>>>> handle_mm_fault (mm/memory.c:5870:9) >>>>> do_user_addr_fault (arch/x86/mm/fault.c:1338:10) >>>>> handle_page_fault (arch/x86/mm/fault.c:1481:3) >>>>> exc_page_fault (arch/x86/mm/fault.c:1539:2) >>>>> asm_exc_page_fault+0x22/0x27 >>>>> >>>>> During readahead, the folio is locked. When fuse_evict_inode() is >>>>> called, it attempts to remove all folios associated with the inode from >>>>> the page cache (truncate_inode_pages_range()), which requires acquiring >>>>> the folio lock. If the server triggers reclaim while servicing a >>>>> readahead request, reclaim will block indefinitely waiting for the folio >>>>> lock, while readahead cannot relinquish the lock because it is itself >>>>> blocked in reclaim, resulting in a deadlock. >>>>> >>>>> The inode is only evicted if it has no remaining references after its >>>>> dentry is unlinked. Since readahead is asynchronous, it is not >>>>> guaranteed that the inode will have any references at this point. >>>>> >>>>> This fixes the deadlock by holding a reference on the inode while >>>>> readahead is in progress, which prevents the inode from being evicted >>>>> until readahead completes. Additionally, this also prevents a malicious >>>>> or buggy server from indefinitely blocking kswapd if it never fulfills a >>>>> readahead request. >>>>> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>> Reported-by: Omar Sandoval <osandov@fb.com> >>>>> --- >>>>> fs/fuse/file.c | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >>>>> index f1ef77a0be05..8e759061b843 100644 >>>>> --- a/fs/fuse/file.c >>>>> +++ b/fs/fuse/file.c >>>>> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, >>>>> if (ia->ff) >>>>> fuse_file_put(ia->ff, false); >>>>> + iput(inode); >>>> >>>> It's somewhat odd to use `igrab` and `iput` in the read(ahead) >>>> context. >>>> >>>> I wonder for FUSE, if it's possible to just wait ongoing >>>> locked folios when i_count == 0 (e.g. in .drop_inode) before >>>> adding into lru so that the final inode eviction won't wait >>>> its readahead requests itself so that deadlock like this can >>>> be avoided. >>> >>> Oh, it was in the dentry LRU list instead, I don't think it can >>> work. >>> >>> Or normally the kernel filesystem uses GFP_NOFS to avoid such >>> deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in >>> super_cache_scan()), I wonder if the daemon should simply use >>> prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called >>> into the fs reclaim context again. >> >> Hi Gao, >> >> We cannot rely on the daemon to set this unfortunately. This can tie >> up reclaim and kswapd for the entire system so I think this >> enforcement needs to be guaranteed and at the kernel level. For >> example, there is the possibility of malicious servers, which we >> cannot rely on to set FR_SET_IO_FLUSHER. > > Hi Joanne, > > Yes, currently I don't have a saner way in my mind but iput() > in such nested context sounds a new entry (e.g. I thought > kernel page fault path should have nothing tangled with To clarify: ^ kernel file read page fault path tangled with this particular inode in progress (I doesn't mean random inode reclaimation). > evict() directly but I may be wrong.) > > In principle, typical the kernel filesystem holds a valid `file` > during the entire buffered read (file)/mmap (vma->vm_file) > submission path (and of course they won't upcall to userspace > and then do random behavior in the userspace for I/O processing). > > So for the kernel filesystems I think the GFP_NOFS allocation > isn't needed since `file.f_path` always takes a valid dentry > ref during the submission so that such dentry/inode reclaim > above is impossible IMO.) > > Thanks, > Gao Xiang > >> >> Thanks, >> Joanne >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-30 2:21 ` Gao Xiang 2025-09-30 2:35 ` Gao Xiang @ 2025-09-30 10:08 ` Miklos Szeredi 2025-09-30 18:47 ` Joanne Koong 2025-10-07 0:37 ` Joanne Koong 1 sibling, 2 replies; 12+ messages in thread From: Miklos Szeredi @ 2025-09-30 10:08 UTC (permalink / raw) To: Gao Xiang; +Cc: Joanne Koong, linux-fsdevel, osandov, kernel-team On Tue, 30 Sept 2025 at 04:21, Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > In principle, typical the kernel filesystem holds a valid `file` > during the entire buffered read (file)/mmap (vma->vm_file) Actually, fuse does hold a ref to fuse_file, which should make sure that the inode is not released while the readahead is ongoing. See igrab() in fuse_prepare_release() and iput() in fuse_release_end(). So I don't understand what's going on. Joanne, do you have a reproducer? Thanks, Miklos ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-30 10:08 ` Miklos Szeredi @ 2025-09-30 18:47 ` Joanne Koong 2025-09-30 18:55 ` Miklos Szeredi 2025-10-07 0:37 ` Joanne Koong 1 sibling, 1 reply; 12+ messages in thread From: Joanne Koong @ 2025-09-30 18:47 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team On Tue, Sep 30, 2025 at 3:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 30 Sept 2025 at 04:21, Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > In principle, typical the kernel filesystem holds a valid `file` > > during the entire buffered read (file)/mmap (vma->vm_file) > > Actually, fuse does hold a ref to fuse_file, which should make sure > that the inode is not released while the readahead is ongoing. > > See igrab() in fuse_prepare_release() and iput() in fuse_release_end(). If the file is mmaped, couldn't the release happen before the page fault? This is the chain of events I'm thinking of: file is opened, file is mmapped, file is closed (which triggers the igrab() and iput() you mentioned in fuse_prepare_release()/fuse_release_end()), then client tries to read from the mmapped address which triggers a page fault which triggers readahead. Or am I missing something here? I'm not super familiar with the mmap code but as far as I can tell from the logic in vm_mmap_pgoff() -> do_mmap() -> __mmap_region(), it doesn't grab a refcount on the inode or the file descriptor. > > So I don't understand what's going on. > > Joanne, do you have a reproducer? I don't have a reproducer but we saw the stack trace in the commit message on our servers a couple of times and used drgn to pinpoint that there was a readahead request submitted. Thanks, Joanne > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-30 18:47 ` Joanne Koong @ 2025-09-30 18:55 ` Miklos Szeredi 2025-10-01 0:18 ` Joanne Koong 0 siblings, 1 reply; 12+ messages in thread From: Miklos Szeredi @ 2025-09-30 18:55 UTC (permalink / raw) To: Joanne Koong; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team On Tue, 30 Sept 2025 at 20:47, Joanne Koong <joannelkoong@gmail.com> wrote: > Or am I missing something here? I'm not super familiar with the mmap > code but as far as I can tell from the logic in vm_mmap_pgoff() -> > do_mmap() -> __mmap_region(), it doesn't grab a refcount on the inode > or the file descriptor. mmap keeps a ref on the file (vma->vm_file), so only munmap() will trigger ->release(). Thanks, Miklos ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-30 18:55 ` Miklos Szeredi @ 2025-10-01 0:18 ` Joanne Koong 0 siblings, 0 replies; 12+ messages in thread From: Joanne Koong @ 2025-10-01 0:18 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team On Tue, Sep 30, 2025 at 11:55 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 30 Sept 2025 at 20:47, Joanne Koong <joannelkoong@gmail.com> wrote: > > > Or am I missing something here? I'm not super familiar with the mmap > > code but as far as I can tell from the logic in vm_mmap_pgoff() -> > > do_mmap() -> __mmap_region(), it doesn't grab a refcount on the inode > > or the file descriptor. > > mmap keeps a ref on the file (vma->vm_file), so only munmap() will > trigger ->release(). Ah okay I see, I thought the .release callback is for when the last fd for a file gets closed but clearly that's not how it works. Thanks for clarifying that. I'm confused too then how we can have a fuse inode with no refcount on it but with one of its folios indefinitely locked. The paths where a folio can be indefinitely locked afaict are from reads/readahead, buffered writes, and notify retrieves. But each of those paths hold a reference to the inode, so I'm confused how we end up in this situation. Thanks, Joanne > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-30 10:08 ` Miklos Szeredi 2025-09-30 18:47 ` Joanne Koong @ 2025-10-07 0:37 ` Joanne Koong 1 sibling, 0 replies; 12+ messages in thread From: Joanne Koong @ 2025-10-07 0:37 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team On Tue, Sep 30, 2025 at 3:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 30 Sept 2025 at 04:21, Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > In principle, typical the kernel filesystem holds a valid `file` > > during the entire buffered read (file)/mmap (vma->vm_file) > > Actually, fuse does hold a ref to fuse_file, which should make sure > that the inode is not released while the readahead is ongoing. > > See igrab() in fuse_prepare_release() and iput() in fuse_release_end(). > > So I don't understand what's going on. > > Joanne, do you have a reproducer? Omar figured out that the servers where we ran into this had fc->no_open set. The igrab() in fuse_prepare_release() you mentioned above happens only when fc->no_open is not set. commit e26ee4efbc79 "fuse: allocate ff->release_args only if release is needed" is what changed this behavior. If we revert this commit, I think this fixes the issue. Unless you prefer another solution? Thanks, Joanne > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] fuse: fix readahead reclaim deadlock 2025-09-25 22:44 [PATCH] fuse: fix readahead reclaim deadlock Joanne Koong 2025-09-26 6:51 ` Gao Xiang @ 2025-09-26 9:01 ` Miklos Szeredi 1 sibling, 0 replies; 12+ messages in thread From: Miklos Szeredi @ 2025-09-26 9:01 UTC (permalink / raw) To: Joanne Koong Cc: linux-fsdevel, osandov, kernel-team, Matthew Wilcox, linux-mm On Fri, 26 Sept 2025 at 00:45, Joanne Koong <joannelkoong@gmail.com> wrote: > > A deadlock can occur if the server triggers reclaim while servicing a > readahead request, and reclaim attempts to evict the inode of the file > being read ahead: > > >>> stack_trace(1504735) > folio_wait_bit_common (mm/filemap.c:1308:4) > folio_lock (./include/linux/pagemap.h:1052:3) > truncate_inode_pages_range (mm/truncate.c:336:10) > fuse_evict_inode (fs/fuse/inode.c:161:2) > evict (fs/inode.c:704:3) > dentry_unlink_inode (fs/dcache.c:412:3) > __dentry_kill (fs/dcache.c:615:3) > shrink_kill (fs/dcache.c:1060:12) > shrink_dentry_list (fs/dcache.c:1087:3) > prune_dcache_sb (fs/dcache.c:1168:2) > super_cache_scan (fs/super.c:221:10) > do_shrink_slab (mm/shrinker.c:435:9) > shrink_slab (mm/shrinker.c:626:10) > shrink_node (mm/vmscan.c:5951:2) > shrink_zones (mm/vmscan.c:6195:3) > do_try_to_free_pages (mm/vmscan.c:6257:3) > do_swap_page (mm/memory.c:4136:11) > handle_pte_fault (mm/memory.c:5562:10) > handle_mm_fault (mm/memory.c:5870:9) > do_user_addr_fault (arch/x86/mm/fault.c:1338:10) > handle_page_fault (arch/x86/mm/fault.c:1481:3) > exc_page_fault (arch/x86/mm/fault.c:1539:2) > asm_exc_page_fault+0x22/0x27 > > During readahead, the folio is locked. When fuse_evict_inode() is > called, it attempts to remove all folios associated with the inode from > the page cache (truncate_inode_pages_range()), which requires acquiring > the folio lock. If the server triggers reclaim while servicing a > readahead request, reclaim will block indefinitely waiting for the folio > lock, while readahead cannot relinquish the lock because it is itself > blocked in reclaim, resulting in a deadlock. > > The inode is only evicted if it has no remaining references after its > dentry is unlinked. Since readahead is asynchronous, it is not > guaranteed that the inode will have any references at this point. > > This fixes the deadlock by holding a reference on the inode while > readahead is in progress, which prevents the inode from being evicted > until readahead completes. Additionally, this also prevents a malicious > or buggy server from indefinitely blocking kswapd if it never fulfills a > readahead request. I don't see a better way to fix this, but Cc-ing Willy as the readahead/mm expert. > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > Reported-by: Omar Sandoval <osandov@fb.com> This is not a new bug, right? So at least add a Cc: stable@vger.kernel.org Thanks, Miklos > --- > fs/fuse/file.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index f1ef77a0be05..8e759061b843 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, > if (ia->ff) > fuse_file_put(ia->ff, false); > > + iput(inode); > fuse_io_free(ia); > } > > @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac) > ia = fuse_io_alloc(NULL, cur_pages); > if (!ia) > break; > + /* > + * Acquire the inode ref here to prevent reclaim from > + * deadlocking. The ref gets dropped in fuse_readpages_end(). > + */ > + igrab(inode); > + > ap = &ia->ap; > > while (pages < cur_pages) { > -- > 2.47.3 > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-10-07 0:37 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-25 22:44 [PATCH] fuse: fix readahead reclaim deadlock Joanne Koong 2025-09-26 6:51 ` Gao Xiang 2025-09-26 7:19 ` Gao Xiang 2025-09-29 17:25 ` Joanne Koong 2025-09-30 2:21 ` Gao Xiang 2025-09-30 2:35 ` Gao Xiang 2025-09-30 10:08 ` Miklos Szeredi 2025-09-30 18:47 ` Joanne Koong 2025-09-30 18:55 ` Miklos Szeredi 2025-10-01 0:18 ` Joanne Koong 2025-10-07 0:37 ` Joanne Koong 2025-09-26 9:01 ` Miklos Szeredi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).