[PATCH] fuse: fix readahead reclaim deadlock

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] fuse: fix readahead reclaim deadlock
@ 2025-09-25 22:44 Joanne Koong
  2025-09-26  6:51 ` Gao Xiang
  2025-09-26  9:01 ` Miklos Szeredi
  0 siblings, 2 replies; 12+ messages in thread
From: Joanne Koong @ 2025-09-25 22:44 UTC (permalink / raw)
  To: miklos; +Cc: linux-fsdevel, osandov, kernel-team

A deadlock can occur if the server triggers reclaim while servicing a
readahead request, and reclaim attempts to evict the inode of the file
being read ahead:

>>> stack_trace(1504735)
 folio_wait_bit_common (mm/filemap.c:1308:4)
 folio_lock (./include/linux/pagemap.h:1052:3)
 truncate_inode_pages_range (mm/truncate.c:336:10)
 fuse_evict_inode (fs/fuse/inode.c:161:2)
 evict (fs/inode.c:704:3)
 dentry_unlink_inode (fs/dcache.c:412:3)
 __dentry_kill (fs/dcache.c:615:3)
 shrink_kill (fs/dcache.c:1060:12)
 shrink_dentry_list (fs/dcache.c:1087:3)
 prune_dcache_sb (fs/dcache.c:1168:2)
 super_cache_scan (fs/super.c:221:10)
 do_shrink_slab (mm/shrinker.c:435:9)
 shrink_slab (mm/shrinker.c:626:10)
 shrink_node (mm/vmscan.c:5951:2)
 shrink_zones (mm/vmscan.c:6195:3)
 do_try_to_free_pages (mm/vmscan.c:6257:3)
 do_swap_page (mm/memory.c:4136:11)
 handle_pte_fault (mm/memory.c:5562:10)
 handle_mm_fault (mm/memory.c:5870:9)
 do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
 handle_page_fault (arch/x86/mm/fault.c:1481:3)
 exc_page_fault (arch/x86/mm/fault.c:1539:2)
 asm_exc_page_fault+0x22/0x27

During readahead, the folio is locked. When fuse_evict_inode() is
called, it attempts to remove all folios associated with the inode from
the page cache (truncate_inode_pages_range()), which requires acquiring
the folio lock. If the server triggers reclaim while servicing a
readahead request, reclaim will block indefinitely waiting for the folio
lock, while readahead cannot relinquish the lock because it is itself
blocked in reclaim, resulting in a deadlock.

The inode is only evicted if it has no remaining references after its
dentry is unlinked. Since readahead is asynchronous, it is not
guaranteed that the inode will have any references at this point.

This fixes the deadlock by holding a reference on the inode while
readahead is in progress, which prevents the inode from being evicted
until readahead completes. Additionally, this also prevents a malicious
or buggy server from indefinitely blocking kswapd if it never fulfills a
readahead request.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reported-by: Omar Sandoval <osandov@fb.com>
---
 fs/fuse/file.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f1ef77a0be05..8e759061b843 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 	if (ia->ff)
 		fuse_file_put(ia->ff, false);
 
+	iput(inode);
 	fuse_io_free(ia);
 }
 
@@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac)
 		ia = fuse_io_alloc(NULL, cur_pages);
 		if (!ia)
 			break;
+		/*
+		 *  Acquire the inode ref here to prevent reclaim from
+		 *  deadlocking. The ref gets dropped in fuse_readpages_end().
+		 */
+		igrab(inode);
+
 		ap = &ia->ap;
 
 		while (pages < cur_pages) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-25 22:44 [PATCH] fuse: fix readahead reclaim deadlock Joanne Koong
@ 2025-09-26  6:51 ` Gao Xiang
  2025-09-26  7:19   ` Gao Xiang
  2025-09-26  9:01 ` Miklos Szeredi
  1 sibling, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-09-26  6:51 UTC (permalink / raw)
  To: Joanne Koong, miklos; +Cc: linux-fsdevel, osandov, kernel-team



On 2025/9/26 06:44, Joanne Koong wrote:
> A deadlock can occur if the server triggers reclaim while servicing a
> readahead request, and reclaim attempts to evict the inode of the file
> being read ahead:
> 
>>>> stack_trace(1504735)
>   folio_wait_bit_common (mm/filemap.c:1308:4)
>   folio_lock (./include/linux/pagemap.h:1052:3)
>   truncate_inode_pages_range (mm/truncate.c:336:10)
>   fuse_evict_inode (fs/fuse/inode.c:161:2)
>   evict (fs/inode.c:704:3)
>   dentry_unlink_inode (fs/dcache.c:412:3)
>   __dentry_kill (fs/dcache.c:615:3)
>   shrink_kill (fs/dcache.c:1060:12)
>   shrink_dentry_list (fs/dcache.c:1087:3)
>   prune_dcache_sb (fs/dcache.c:1168:2)
>   super_cache_scan (fs/super.c:221:10)
>   do_shrink_slab (mm/shrinker.c:435:9)
>   shrink_slab (mm/shrinker.c:626:10)
>   shrink_node (mm/vmscan.c:5951:2)
>   shrink_zones (mm/vmscan.c:6195:3)
>   do_try_to_free_pages (mm/vmscan.c:6257:3)
>   do_swap_page (mm/memory.c:4136:11)
>   handle_pte_fault (mm/memory.c:5562:10)
>   handle_mm_fault (mm/memory.c:5870:9)
>   do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
>   handle_page_fault (arch/x86/mm/fault.c:1481:3)
>   exc_page_fault (arch/x86/mm/fault.c:1539:2)
>   asm_exc_page_fault+0x22/0x27
> 
> During readahead, the folio is locked. When fuse_evict_inode() is
> called, it attempts to remove all folios associated with the inode from
> the page cache (truncate_inode_pages_range()), which requires acquiring
> the folio lock. If the server triggers reclaim while servicing a
> readahead request, reclaim will block indefinitely waiting for the folio
> lock, while readahead cannot relinquish the lock because it is itself
> blocked in reclaim, resulting in a deadlock.
> 
> The inode is only evicted if it has no remaining references after its
> dentry is unlinked. Since readahead is asynchronous, it is not
> guaranteed that the inode will have any references at this point.
> 
> This fixes the deadlock by holding a reference on the inode while
> readahead is in progress, which prevents the inode from being evicted
> until readahead completes. Additionally, this also prevents a malicious
> or buggy server from indefinitely blocking kswapd if it never fulfills a
> readahead request.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reported-by: Omar Sandoval <osandov@fb.com>
> ---
>   fs/fuse/file.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index f1ef77a0be05..8e759061b843 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
>   	if (ia->ff)
>   		fuse_file_put(ia->ff, false);
>   
> +	iput(inode);

It's somewhat odd to use `igrab` and `iput` in the read(ahead)
context.

I wonder for FUSE, if it's possible to just wait ongoing
locked folios when i_count == 0 (e.g. in .drop_inode) before
adding into lru so that the final inode eviction won't wait
its readahead requests itself so that deadlock like this can
be avoided.

Thanks,
Gao Xiang


>   	fuse_io_free(ia);
>   }
>   
> @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac)
>   		ia = fuse_io_alloc(NULL, cur_pages);
>   		if (!ia)
>   			break;
> +		/*
> +		 *  Acquire the inode ref here to prevent reclaim from
> +		 *  deadlocking. The ref gets dropped in fuse_readpages_end().
> +		 */
> +		igrab(inode);
> +
>   		ap = &ia->ap;
>   
>   		while (pages < cur_pages) {


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-26  6:51 ` Gao Xiang
@ 2025-09-26  7:19   ` Gao Xiang
  2025-09-29 17:25     ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-09-26  7:19 UTC (permalink / raw)
  To: Joanne Koong, miklos; +Cc: linux-fsdevel, osandov, kernel-team



On 2025/9/26 14:51, Gao Xiang wrote:
> 
> 
> On 2025/9/26 06:44, Joanne Koong wrote:
>> A deadlock can occur if the server triggers reclaim while servicing a
>> readahead request, and reclaim attempts to evict the inode of the file
>> being read ahead:
>>
>>>>> stack_trace(1504735)
>>   folio_wait_bit_common (mm/filemap.c:1308:4)
>>   folio_lock (./include/linux/pagemap.h:1052:3)
>>   truncate_inode_pages_range (mm/truncate.c:336:10)
>>   fuse_evict_inode (fs/fuse/inode.c:161:2)
>>   evict (fs/inode.c:704:3)
>>   dentry_unlink_inode (fs/dcache.c:412:3)
>>   __dentry_kill (fs/dcache.c:615:3)
>>   shrink_kill (fs/dcache.c:1060:12)
>>   shrink_dentry_list (fs/dcache.c:1087:3)
>>   prune_dcache_sb (fs/dcache.c:1168:2)
>>   super_cache_scan (fs/super.c:221:10)
>>   do_shrink_slab (mm/shrinker.c:435:9)
>>   shrink_slab (mm/shrinker.c:626:10)
>>   shrink_node (mm/vmscan.c:5951:2)
>>   shrink_zones (mm/vmscan.c:6195:3)
>>   do_try_to_free_pages (mm/vmscan.c:6257:3)
>>   do_swap_page (mm/memory.c:4136:11)
>>   handle_pte_fault (mm/memory.c:5562:10)
>>   handle_mm_fault (mm/memory.c:5870:9)
>>   do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
>>   handle_page_fault (arch/x86/mm/fault.c:1481:3)
>>   exc_page_fault (arch/x86/mm/fault.c:1539:2)
>>   asm_exc_page_fault+0x22/0x27
>>
>> During readahead, the folio is locked. When fuse_evict_inode() is
>> called, it attempts to remove all folios associated with the inode from
>> the page cache (truncate_inode_pages_range()), which requires acquiring
>> the folio lock. If the server triggers reclaim while servicing a
>> readahead request, reclaim will block indefinitely waiting for the folio
>> lock, while readahead cannot relinquish the lock because it is itself
>> blocked in reclaim, resulting in a deadlock.
>>
>> The inode is only evicted if it has no remaining references after its
>> dentry is unlinked. Since readahead is asynchronous, it is not
>> guaranteed that the inode will have any references at this point.
>>
>> This fixes the deadlock by holding a reference on the inode while
>> readahead is in progress, which prevents the inode from being evicted
>> until readahead completes. Additionally, this also prevents a malicious
>> or buggy server from indefinitely blocking kswapd if it never fulfills a
>> readahead request.
>>
>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>> Reported-by: Omar Sandoval <osandov@fb.com>
>> ---
>>   fs/fuse/file.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index f1ef77a0be05..8e759061b843 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
>>       if (ia->ff)
>>           fuse_file_put(ia->ff, false);
>> +    iput(inode);
> 
> It's somewhat odd to use `igrab` and `iput` in the read(ahead)
> context.
> 
> I wonder for FUSE, if it's possible to just wait ongoing
> locked folios when i_count == 0 (e.g. in .drop_inode) before
> adding into lru so that the final inode eviction won't wait
> its readahead requests itself so that deadlock like this can
> be avoided.

Oh, it was in the dentry LRU list instead, I don't think it can
work.

Or normally the kernel filesystem uses GFP_NOFS to avoid such
deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in
super_cache_scan()), I wonder if the daemon should simply use
prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called
into the fs reclaim context again.

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
> 
>>       fuse_io_free(ia);
>>   }
>> @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac)
>>           ia = fuse_io_alloc(NULL, cur_pages);
>>           if (!ia)
>>               break;
>> +        /*
>> +         *  Acquire the inode ref here to prevent reclaim from
>> +         *  deadlocking. The ref gets dropped in fuse_readpages_end().
>> +         */
>> +        igrab(inode);
>> +
>>           ap = &ia->ap;
>>           while (pages < cur_pages) {
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-25 22:44 [PATCH] fuse: fix readahead reclaim deadlock Joanne Koong
  2025-09-26  6:51 ` Gao Xiang
@ 2025-09-26  9:01 ` Miklos Szeredi
  1 sibling, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2025-09-26  9:01 UTC (permalink / raw)
  To: Joanne Koong
  Cc: linux-fsdevel, osandov, kernel-team, Matthew Wilcox, linux-mm

On Fri, 26 Sept 2025 at 00:45, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> A deadlock can occur if the server triggers reclaim while servicing a
> readahead request, and reclaim attempts to evict the inode of the file
> being read ahead:
>
> >>> stack_trace(1504735)
>  folio_wait_bit_common (mm/filemap.c:1308:4)
>  folio_lock (./include/linux/pagemap.h:1052:3)
>  truncate_inode_pages_range (mm/truncate.c:336:10)
>  fuse_evict_inode (fs/fuse/inode.c:161:2)
>  evict (fs/inode.c:704:3)
>  dentry_unlink_inode (fs/dcache.c:412:3)
>  __dentry_kill (fs/dcache.c:615:3)
>  shrink_kill (fs/dcache.c:1060:12)
>  shrink_dentry_list (fs/dcache.c:1087:3)
>  prune_dcache_sb (fs/dcache.c:1168:2)
>  super_cache_scan (fs/super.c:221:10)
>  do_shrink_slab (mm/shrinker.c:435:9)
>  shrink_slab (mm/shrinker.c:626:10)
>  shrink_node (mm/vmscan.c:5951:2)
>  shrink_zones (mm/vmscan.c:6195:3)
>  do_try_to_free_pages (mm/vmscan.c:6257:3)
>  do_swap_page (mm/memory.c:4136:11)
>  handle_pte_fault (mm/memory.c:5562:10)
>  handle_mm_fault (mm/memory.c:5870:9)
>  do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
>  handle_page_fault (arch/x86/mm/fault.c:1481:3)
>  exc_page_fault (arch/x86/mm/fault.c:1539:2)
>  asm_exc_page_fault+0x22/0x27
>
> During readahead, the folio is locked. When fuse_evict_inode() is
> called, it attempts to remove all folios associated with the inode from
> the page cache (truncate_inode_pages_range()), which requires acquiring
> the folio lock. If the server triggers reclaim while servicing a
> readahead request, reclaim will block indefinitely waiting for the folio
> lock, while readahead cannot relinquish the lock because it is itself
> blocked in reclaim, resulting in a deadlock.
>
> The inode is only evicted if it has no remaining references after its
> dentry is unlinked. Since readahead is asynchronous, it is not
> guaranteed that the inode will have any references at this point.
>
> This fixes the deadlock by holding a reference on the inode while
> readahead is in progress, which prevents the inode from being evicted
> until readahead completes. Additionally, this also prevents a malicious
> or buggy server from indefinitely blocking kswapd if it never fulfills a
> readahead request.

I don't see a better way to fix this, but Cc-ing Willy as the
readahead/mm expert.

> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reported-by: Omar Sandoval <osandov@fb.com>

This is not a new bug, right?  So at least add a

Cc: stable@vger.kernel.org

Thanks,
Miklos

> ---
>  fs/fuse/file.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index f1ef77a0be05..8e759061b843 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
>         if (ia->ff)
>                 fuse_file_put(ia->ff, false);
>
> +       iput(inode);
>         fuse_io_free(ia);
>  }
>
> @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac)
>                 ia = fuse_io_alloc(NULL, cur_pages);
>                 if (!ia)
>                         break;
> +               /*
> +                *  Acquire the inode ref here to prevent reclaim from
> +                *  deadlocking. The ref gets dropped in fuse_readpages_end().
> +                */
> +               igrab(inode);
> +
>                 ap = &ia->ap;
>
>                 while (pages < cur_pages) {
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-26  7:19   ` Gao Xiang
@ 2025-09-29 17:25     ` Joanne Koong
  2025-09-30  2:21       ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2025-09-29 17:25 UTC (permalink / raw)
  To: Gao Xiang; +Cc: miklos, linux-fsdevel, osandov, kernel-team

On Fri, Sep 26, 2025 at 12:19 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> On 2025/9/26 14:51, Gao Xiang wrote:
> >
> > On 2025/9/26 06:44, Joanne Koong wrote:
> >> A deadlock can occur if the server triggers reclaim while servicing a
> >> readahead request, and reclaim attempts to evict the inode of the file
> >> being read ahead:
> >>
> >>>>> stack_trace(1504735)
> >>   folio_wait_bit_common (mm/filemap.c:1308:4)
> >>   folio_lock (./include/linux/pagemap.h:1052:3)
> >>   truncate_inode_pages_range (mm/truncate.c:336:10)
> >>   fuse_evict_inode (fs/fuse/inode.c:161:2)
> >>   evict (fs/inode.c:704:3)
> >>   dentry_unlink_inode (fs/dcache.c:412:3)
> >>   __dentry_kill (fs/dcache.c:615:3)
> >>   shrink_kill (fs/dcache.c:1060:12)
> >>   shrink_dentry_list (fs/dcache.c:1087:3)
> >>   prune_dcache_sb (fs/dcache.c:1168:2)
> >>   super_cache_scan (fs/super.c:221:10)
> >>   do_shrink_slab (mm/shrinker.c:435:9)
> >>   shrink_slab (mm/shrinker.c:626:10)
> >>   shrink_node (mm/vmscan.c:5951:2)
> >>   shrink_zones (mm/vmscan.c:6195:3)
> >>   do_try_to_free_pages (mm/vmscan.c:6257:3)
> >>   do_swap_page (mm/memory.c:4136:11)
> >>   handle_pte_fault (mm/memory.c:5562:10)
> >>   handle_mm_fault (mm/memory.c:5870:9)
> >>   do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
> >>   handle_page_fault (arch/x86/mm/fault.c:1481:3)
> >>   exc_page_fault (arch/x86/mm/fault.c:1539:2)
> >>   asm_exc_page_fault+0x22/0x27
> >>
> >> During readahead, the folio is locked. When fuse_evict_inode() is
> >> called, it attempts to remove all folios associated with the inode from
> >> the page cache (truncate_inode_pages_range()), which requires acquiring
> >> the folio lock. If the server triggers reclaim while servicing a
> >> readahead request, reclaim will block indefinitely waiting for the folio
> >> lock, while readahead cannot relinquish the lock because it is itself
> >> blocked in reclaim, resulting in a deadlock.
> >>
> >> The inode is only evicted if it has no remaining references after its
> >> dentry is unlinked. Since readahead is asynchronous, it is not
> >> guaranteed that the inode will have any references at this point.
> >>
> >> This fixes the deadlock by holding a reference on the inode while
> >> readahead is in progress, which prevents the inode from being evicted
> >> until readahead completes. Additionally, this also prevents a malicious
> >> or buggy server from indefinitely blocking kswapd if it never fulfills a
> >> readahead request.
> >>
> >> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >> Reported-by: Omar Sandoval <osandov@fb.com>
> >> ---
> >>   fs/fuse/file.c | 7 +++++++
> >>   1 file changed, 7 insertions(+)
> >>
> >> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >> index f1ef77a0be05..8e759061b843 100644
> >> --- a/fs/fuse/file.c
> >> +++ b/fs/fuse/file.c
> >> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
> >>       if (ia->ff)
> >>           fuse_file_put(ia->ff, false);
> >> +    iput(inode);
> >
> > It's somewhat odd to use `igrab` and `iput` in the read(ahead)
> > context.
> >
> > I wonder for FUSE, if it's possible to just wait ongoing
> > locked folios when i_count == 0 (e.g. in .drop_inode) before
> > adding into lru so that the final inode eviction won't wait
> > its readahead requests itself so that deadlock like this can
> > be avoided.
>
> Oh, it was in the dentry LRU list instead, I don't think it can
> work.
>
> Or normally the kernel filesystem uses GFP_NOFS to avoid such
> deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in
> super_cache_scan()), I wonder if the daemon should simply use
> prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called
> into the fs reclaim context again.

Hi Gao,

We cannot rely on the daemon to set this unfortunately. This can tie
up reclaim and kswapd for the entire system so I think this
enforcement needs to be guaranteed and at the kernel level. For
example, there is the possibility of malicious servers, which we
cannot rely on to set FR_SET_IO_FLUSHER.

Thanks,
Joanne

>
> Thanks,
> Gao Xiang
>
> >
> > Thanks,
> > Gao Xiang
> >
> >
> >>       fuse_io_free(ia);
> >>   }
> >> @@ -973,6 +974,12 @@ static void fuse_readahead(struct readahead_control *rac)
> >>           ia = fuse_io_alloc(NULL, cur_pages);
> >>           if (!ia)
> >>               break;
> >> +        /*
> >> +         *  Acquire the inode ref here to prevent reclaim from
> >> +         *  deadlocking. The ref gets dropped in fuse_readpages_end().
> >> +         */
> >> +        igrab(inode);
> >> +
> >>           ap = &ia->ap;
> >>           while (pages < cur_pages) {
> >
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-29 17:25     ` Joanne Koong
@ 2025-09-30  2:21       ` Gao Xiang
  2025-09-30  2:35         ` Gao Xiang
  2025-09-30 10:08         ` Miklos Szeredi
  0 siblings, 2 replies; 12+ messages in thread
From: Gao Xiang @ 2025-09-30  2:21 UTC (permalink / raw)
  To: Joanne Koong; +Cc: miklos, linux-fsdevel, osandov, kernel-team



On 2025/9/30 01:25, Joanne Koong wrote:
> On Fri, Sep 26, 2025 at 12:19 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> On 2025/9/26 14:51, Gao Xiang wrote:
>>>
>>> On 2025/9/26 06:44, Joanne Koong wrote:
>>>> A deadlock can occur if the server triggers reclaim while servicing a
>>>> readahead request, and reclaim attempts to evict the inode of the file
>>>> being read ahead:
>>>>
>>>>>>> stack_trace(1504735)
>>>>    folio_wait_bit_common (mm/filemap.c:1308:4)
>>>>    folio_lock (./include/linux/pagemap.h:1052:3)
>>>>    truncate_inode_pages_range (mm/truncate.c:336:10)
>>>>    fuse_evict_inode (fs/fuse/inode.c:161:2)
>>>>    evict (fs/inode.c:704:3)
>>>>    dentry_unlink_inode (fs/dcache.c:412:3)
>>>>    __dentry_kill (fs/dcache.c:615:3)
>>>>    shrink_kill (fs/dcache.c:1060:12)
>>>>    shrink_dentry_list (fs/dcache.c:1087:3)
>>>>    prune_dcache_sb (fs/dcache.c:1168:2)
>>>>    super_cache_scan (fs/super.c:221:10)
>>>>    do_shrink_slab (mm/shrinker.c:435:9)
>>>>    shrink_slab (mm/shrinker.c:626:10)
>>>>    shrink_node (mm/vmscan.c:5951:2)
>>>>    shrink_zones (mm/vmscan.c:6195:3)
>>>>    do_try_to_free_pages (mm/vmscan.c:6257:3)
>>>>    do_swap_page (mm/memory.c:4136:11)
>>>>    handle_pte_fault (mm/memory.c:5562:10)
>>>>    handle_mm_fault (mm/memory.c:5870:9)
>>>>    do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
>>>>    handle_page_fault (arch/x86/mm/fault.c:1481:3)
>>>>    exc_page_fault (arch/x86/mm/fault.c:1539:2)
>>>>    asm_exc_page_fault+0x22/0x27
>>>>
>>>> During readahead, the folio is locked. When fuse_evict_inode() is
>>>> called, it attempts to remove all folios associated with the inode from
>>>> the page cache (truncate_inode_pages_range()), which requires acquiring
>>>> the folio lock. If the server triggers reclaim while servicing a
>>>> readahead request, reclaim will block indefinitely waiting for the folio
>>>> lock, while readahead cannot relinquish the lock because it is itself
>>>> blocked in reclaim, resulting in a deadlock.
>>>>
>>>> The inode is only evicted if it has no remaining references after its
>>>> dentry is unlinked. Since readahead is asynchronous, it is not
>>>> guaranteed that the inode will have any references at this point.
>>>>
>>>> This fixes the deadlock by holding a reference on the inode while
>>>> readahead is in progress, which prevents the inode from being evicted
>>>> until readahead completes. Additionally, this also prevents a malicious
>>>> or buggy server from indefinitely blocking kswapd if it never fulfills a
>>>> readahead request.
>>>>
>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>> Reported-by: Omar Sandoval <osandov@fb.com>
>>>> ---
>>>>    fs/fuse/file.c | 7 +++++++
>>>>    1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>> index f1ef77a0be05..8e759061b843 100644
>>>> --- a/fs/fuse/file.c
>>>> +++ b/fs/fuse/file.c
>>>> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
>>>>        if (ia->ff)
>>>>            fuse_file_put(ia->ff, false);
>>>> +    iput(inode);
>>>
>>> It's somewhat odd to use `igrab` and `iput` in the read(ahead)
>>> context.
>>>
>>> I wonder for FUSE, if it's possible to just wait ongoing
>>> locked folios when i_count == 0 (e.g. in .drop_inode) before
>>> adding into lru so that the final inode eviction won't wait
>>> its readahead requests itself so that deadlock like this can
>>> be avoided.
>>
>> Oh, it was in the dentry LRU list instead, I don't think it can
>> work.
>>
>> Or normally the kernel filesystem uses GFP_NOFS to avoid such
>> deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in
>> super_cache_scan()), I wonder if the daemon should simply use
>> prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called
>> into the fs reclaim context again.
> 
> Hi Gao,
> 
> We cannot rely on the daemon to set this unfortunately. This can tie
> up reclaim and kswapd for the entire system so I think this
> enforcement needs to be guaranteed and at the kernel level. For
> example, there is the possibility of malicious servers, which we
> cannot rely on to set FR_SET_IO_FLUSHER.

Hi Joanne,

Yes, currently I don't have a saner way in my mind but iput()
in such nested context sounds a new entry (e.g. I thought
kernel page fault path should have nothing tangled with
evict() directly but I may be wrong.)

In principle, typical the kernel filesystem holds a valid `file`
during the entire buffered read (file)/mmap (vma->vm_file)
submission path (and of course they won't upcall to userspace
and then do random behavior in the userspace for I/O processing).

So for the kernel filesystems I think the GFP_NOFS allocation
isn't needed since `file.f_path` always takes a valid dentry
ref during the submission so that such dentry/inode reclaim
above is impossible IMO.)

Thanks,
Gao Xiang

> 
> Thanks,
> Joanne
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-30  2:21       ` Gao Xiang
@ 2025-09-30  2:35         ` Gao Xiang
  2025-09-30 10:08         ` Miklos Szeredi
  1 sibling, 0 replies; 12+ messages in thread
From: Gao Xiang @ 2025-09-30  2:35 UTC (permalink / raw)
  To: Joanne Koong; +Cc: miklos, linux-fsdevel, osandov, kernel-team



On 2025/9/30 10:21, Gao Xiang wrote:
> 
> 
> On 2025/9/30 01:25, Joanne Koong wrote:
>> On Fri, Sep 26, 2025 at 12:19 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>
>>> On 2025/9/26 14:51, Gao Xiang wrote:
>>>>
>>>> On 2025/9/26 06:44, Joanne Koong wrote:
>>>>> A deadlock can occur if the server triggers reclaim while servicing a
>>>>> readahead request, and reclaim attempts to evict the inode of the file
>>>>> being read ahead:
>>>>>
>>>>>>>> stack_trace(1504735)
>>>>>    folio_wait_bit_common (mm/filemap.c:1308:4)
>>>>>    folio_lock (./include/linux/pagemap.h:1052:3)
>>>>>    truncate_inode_pages_range (mm/truncate.c:336:10)
>>>>>    fuse_evict_inode (fs/fuse/inode.c:161:2)
>>>>>    evict (fs/inode.c:704:3)
>>>>>    dentry_unlink_inode (fs/dcache.c:412:3)
>>>>>    __dentry_kill (fs/dcache.c:615:3)
>>>>>    shrink_kill (fs/dcache.c:1060:12)
>>>>>    shrink_dentry_list (fs/dcache.c:1087:3)
>>>>>    prune_dcache_sb (fs/dcache.c:1168:2)
>>>>>    super_cache_scan (fs/super.c:221:10)
>>>>>    do_shrink_slab (mm/shrinker.c:435:9)
>>>>>    shrink_slab (mm/shrinker.c:626:10)
>>>>>    shrink_node (mm/vmscan.c:5951:2)
>>>>>    shrink_zones (mm/vmscan.c:6195:3)
>>>>>    do_try_to_free_pages (mm/vmscan.c:6257:3)
>>>>>    do_swap_page (mm/memory.c:4136:11)
>>>>>    handle_pte_fault (mm/memory.c:5562:10)
>>>>>    handle_mm_fault (mm/memory.c:5870:9)
>>>>>    do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
>>>>>    handle_page_fault (arch/x86/mm/fault.c:1481:3)
>>>>>    exc_page_fault (arch/x86/mm/fault.c:1539:2)
>>>>>    asm_exc_page_fault+0x22/0x27
>>>>>
>>>>> During readahead, the folio is locked. When fuse_evict_inode() is
>>>>> called, it attempts to remove all folios associated with the inode from
>>>>> the page cache (truncate_inode_pages_range()), which requires acquiring
>>>>> the folio lock. If the server triggers reclaim while servicing a
>>>>> readahead request, reclaim will block indefinitely waiting for the folio
>>>>> lock, while readahead cannot relinquish the lock because it is itself
>>>>> blocked in reclaim, resulting in a deadlock.
>>>>>
>>>>> The inode is only evicted if it has no remaining references after its
>>>>> dentry is unlinked. Since readahead is asynchronous, it is not
>>>>> guaranteed that the inode will have any references at this point.
>>>>>
>>>>> This fixes the deadlock by holding a reference on the inode while
>>>>> readahead is in progress, which prevents the inode from being evicted
>>>>> until readahead completes. Additionally, this also prevents a malicious
>>>>> or buggy server from indefinitely blocking kswapd if it never fulfills a
>>>>> readahead request.
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>> Reported-by: Omar Sandoval <osandov@fb.com>
>>>>> ---
>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>> index f1ef77a0be05..8e759061b843 100644
>>>>> --- a/fs/fuse/file.c
>>>>> +++ b/fs/fuse/file.c
>>>>> @@ -893,6 +893,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
>>>>>        if (ia->ff)
>>>>>            fuse_file_put(ia->ff, false);
>>>>> +    iput(inode);
>>>>
>>>> It's somewhat odd to use `igrab` and `iput` in the read(ahead)
>>>> context.
>>>>
>>>> I wonder for FUSE, if it's possible to just wait ongoing
>>>> locked folios when i_count == 0 (e.g. in .drop_inode) before
>>>> adding into lru so that the final inode eviction won't wait
>>>> its readahead requests itself so that deadlock like this can
>>>> be avoided.
>>>
>>> Oh, it was in the dentry LRU list instead, I don't think it can
>>> work.
>>>
>>> Or normally the kernel filesystem uses GFP_NOFS to avoid such
>>> deadlock (see `if (!(sc->gfp_mask & __GFP_FS))` in
>>> super_cache_scan()), I wonder if the daemon should simply use
>>> prctl(PR_SET_IO_FLUSHER) so that the user daemon won't be called
>>> into the fs reclaim context again.
>>
>> Hi Gao,
>>
>> We cannot rely on the daemon to set this unfortunately. This can tie
>> up reclaim and kswapd for the entire system so I think this
>> enforcement needs to be guaranteed and at the kernel level. For
>> example, there is the possibility of malicious servers, which we
>> cannot rely on to set FR_SET_IO_FLUSHER.
> 
> Hi Joanne,
> 
> Yes, currently I don't have a saner way in my mind but iput()
> in such nested context sounds a new entry (e.g. I thought
> kernel page fault path should have nothing tangled with

To clarify:
  ^ kernel file read page fault path tangled with this particular
inode in progress (I doesn't mean random inode reclaimation).

> evict() directly but I may be wrong.)
> 
> In principle, typical the kernel filesystem holds a valid `file`
> during the entire buffered read (file)/mmap (vma->vm_file)
> submission path (and of course they won't upcall to userspace
> and then do random behavior in the userspace for I/O processing).
> 
> So for the kernel filesystems I think the GFP_NOFS allocation
> isn't needed since `file.f_path` always takes a valid dentry
> ref during the submission so that such dentry/inode reclaim
> above is impossible IMO.)
> 
> Thanks,
> Gao Xiang
> 
>>
>> Thanks,
>> Joanne
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-30  2:21       ` Gao Xiang
  2025-09-30  2:35         ` Gao Xiang
@ 2025-09-30 10:08         ` Miklos Szeredi
  2025-09-30 18:47           ` Joanne Koong
  2025-10-07  0:37           ` Joanne Koong
  1 sibling, 2 replies; 12+ messages in thread
From: Miklos Szeredi @ 2025-09-30 10:08 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Joanne Koong, linux-fsdevel, osandov, kernel-team

On Tue, 30 Sept 2025 at 04:21, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> In principle, typical the kernel filesystem holds a valid `file`
> during the entire buffered read (file)/mmap (vma->vm_file)

Actually, fuse does hold a ref to fuse_file, which should make sure
that the inode is not released while the readahead is ongoing.

See igrab() in fuse_prepare_release() and iput() in fuse_release_end().

So I don't understand what's going on.

Joanne, do you have a reproducer?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-30 10:08         ` Miklos Szeredi
@ 2025-09-30 18:47           ` Joanne Koong
  2025-09-30 18:55             ` Miklos Szeredi
  2025-10-07  0:37           ` Joanne Koong
  1 sibling, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2025-09-30 18:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team

On Tue, Sep 30, 2025 at 3:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 30 Sept 2025 at 04:21, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> > In principle, typical the kernel filesystem holds a valid `file`
> > during the entire buffered read (file)/mmap (vma->vm_file)
>
> Actually, fuse does hold a ref to fuse_file, which should make sure
> that the inode is not released while the readahead is ongoing.
>
> See igrab() in fuse_prepare_release() and iput() in fuse_release_end().

If the file is mmaped, couldn't the release happen before the page fault?

This is the chain of events I'm thinking of:
file is opened, file is mmapped, file is closed (which triggers the
igrab() and iput() you mentioned in
fuse_prepare_release()/fuse_release_end()), then client tries to read
from the mmapped address which triggers a page fault which triggers
readahead.

Or am I missing something here? I'm not super familiar with the mmap
code but as far as I can tell from the logic in vm_mmap_pgoff() ->
do_mmap() -> __mmap_region(), it doesn't grab a refcount on the inode
or the file descriptor.

>
> So I don't understand what's going on.
>
> Joanne, do you have a reproducer?

I don't have a reproducer but we saw the stack trace in the commit
message on our servers a couple of times and used drgn to pinpoint
that there was a readahead request submitted.

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-30 18:47           ` Joanne Koong
@ 2025-09-30 18:55             ` Miklos Szeredi
  2025-10-01  0:18               ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Miklos Szeredi @ 2025-09-30 18:55 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team

On Tue, 30 Sept 2025 at 20:47, Joanne Koong <joannelkoong@gmail.com> wrote:

> Or am I missing something here? I'm not super familiar with the mmap
> code but as far as I can tell from the logic in vm_mmap_pgoff() ->
> do_mmap() -> __mmap_region(), it doesn't grab a refcount on the inode
> or the file descriptor.

mmap keeps a ref on the file (vma->vm_file), so only munmap() will
trigger ->release().

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-30 18:55             ` Miklos Szeredi
@ 2025-10-01  0:18               ` Joanne Koong
  0 siblings, 0 replies; 12+ messages in thread
From: Joanne Koong @ 2025-10-01  0:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team

On Tue, Sep 30, 2025 at 11:55 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 30 Sept 2025 at 20:47, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > Or am I missing something here? I'm not super familiar with the mmap
> > code but as far as I can tell from the logic in vm_mmap_pgoff() ->
> > do_mmap() -> __mmap_region(), it doesn't grab a refcount on the inode
> > or the file descriptor.
>
> mmap keeps a ref on the file (vma->vm_file), so only munmap() will
> trigger ->release().

Ah okay I see, I thought the .release callback is for when the last fd
for a file gets closed but clearly that's not how it works. Thanks for
clarifying that.

I'm confused too then how we can have a fuse inode with no refcount on
it but with one of its folios indefinitely locked. The paths where a
folio can be indefinitely locked afaict are from reads/readahead,
buffered writes, and notify retrieves. But each of those paths hold a
reference to the inode, so I'm confused how we end up in this
situation.

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fuse: fix readahead reclaim deadlock
  2025-09-30 10:08         ` Miklos Szeredi
  2025-09-30 18:47           ` Joanne Koong
@ 2025-10-07  0:37           ` Joanne Koong
  1 sibling, 0 replies; 12+ messages in thread
From: Joanne Koong @ 2025-10-07  0:37 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Gao Xiang, linux-fsdevel, osandov, kernel-team

On Tue, Sep 30, 2025 at 3:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 30 Sept 2025 at 04:21, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> > In principle, typical the kernel filesystem holds a valid `file`
> > during the entire buffered read (file)/mmap (vma->vm_file)
>
> Actually, fuse does hold a ref to fuse_file, which should make sure
> that the inode is not released while the readahead is ongoing.
>
> See igrab() in fuse_prepare_release() and iput() in fuse_release_end().
>
> So I don't understand what's going on.
>
> Joanne, do you have a reproducer?

Omar figured out that the servers where we ran into this had fc->no_open set.

The igrab() in fuse_prepare_release() you mentioned above happens only
when fc->no_open is not set. commit e26ee4efbc79 "fuse: allocate
ff->release_args only if release is needed" is what changed this
behavior.

If we revert this commit, I think this fixes the issue. Unless you
prefer another solution?

Thanks,
Joanne

>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-10-07  0:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25 22:44 [PATCH] fuse: fix readahead reclaim deadlock Joanne Koong
2025-09-26  6:51 ` Gao Xiang
2025-09-26  7:19   ` Gao Xiang
2025-09-29 17:25     ` Joanne Koong
2025-09-30  2:21       ` Gao Xiang
2025-09-30  2:35         ` Gao Xiang
2025-09-30 10:08         ` Miklos Szeredi
2025-09-30 18:47           ` Joanne Koong
2025-09-30 18:55             ` Miklos Szeredi
2025-10-01  0:18               ` Joanne Koong
2025-10-07  0:37           ` Joanne Koong
2025-09-26  9:01 ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).