Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
@ 2025-03-06  2:42 Luka
  2025-03-06  5:13 ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Luka @ 2025-03-06  2:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

Dear Linux Kernel Experts,

Hello!

I am a security researcher focused on testing Linux kernel
vulnerabilities. Recently, while testing the v6.13-rc5 Linux kernel,
we encountered a crash related to the mm kernel module. We have
successfully captured the call trace information for this crash.

Unfortunately, we have not been able to reproduce the issue in our
local environment, so we are unable to provide a PoC (Proof of
Concept) at this time.

We fully understand the complexity and importance of Linux kernel
maintenance, and we would like to share this finding with you for
further analysis and confirmation of the root cause. Below is a
summary of the relevant information:

Kernel Version: v6.13.0-rc5

Kernel Module: mm/page_alloc.c

————————————————————————————————————————Call
Trace——————————————————————————————————————————————————

WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240
__alloc_pages_slowpath mm/page_alloc.c:4240 [inline]
WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240
__alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766
Modules linked in:
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__alloc_pages_slowpath mm/page_alloc.c:4240 [inline]
RIP: 0010:__alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766
Code: 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0
7c 08 84 d2 0f 85 b3 07 00 00 f6 43 2d 08 0f 84 30 ed ff ff 90 <0f> 0b
90 e9 27 ed ff ff 44 89 4c 24 38 65 8b 15 c0 89 52 78 89 d2
RSP: 0018:ffff8880141ee990 EFLAGS: 00010202
RAX: 0000000000000007 RBX: ffff888012544400 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801254442c
RBP: 0000000000048c40 R08: 0000000000000801 R09: 00000000000000f7
R10: 0000000000000000 R11: ffff88813fffdc40 R12: 0000000000000000
R13: 0000000000000400 R14: 0000000000048c40 R15: 0000000000000000
FS:  0000555589d15480(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e47d593e61 CR3: 00000000141ce000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 alloc_pages_mpol_noprof+0xda/0x300 mm/mempolicy.c:2269
 folio_alloc_noprof+0x1e/0x70 mm/mempolicy.c:2355
 filemap_alloc_folio_noprof+0x2b2/0x2f0 mm/filemap.c:1009
 __filemap_get_folio+0x16d/0x3d0 mm/filemap.c:1951
 ext4_mb_load_buddy_gfp+0x42b/0xc00 fs/ext4/mballoc.c:1640
 ext4_discard_preallocations+0x45c/0xc70 fs/ext4/mballoc.c:5592
 ext4_clear_inode+0x3d/0x1e0 fs/ext4/super.c:1523
 ext4_evict_inode+0x1b2/0x1330 fs/ext4/inode.c:323
 evict+0x337/0x7c0 fs/inode.c:796
 dispose_list fs/inode.c:845 [inline]
 prune_icache_sb+0x189/0x290 fs/inode.c:1033
 super_cache_scan+0x33d/0x510 fs/super.c:223
 do_shrink_slab mm/shrinker.c:437 [inline]
 shrink_slab+0x43e/0x930 mm/shrinker.c:664
 shrink_node_memcgs mm/vmscan.c:5931 [inline]
 shrink_node+0x4dd/0x15c0 mm/vmscan.c:5970
 shrink_zones mm/vmscan.c:6215 [inline]
 do_try_to_free_pages+0x284/0x1160 mm/vmscan.c:6277
 try_to_free_pages+0x1ee/0x3e0 mm/vmscan.c:6527
 __perform_reclaim mm/page_alloc.c:3929 [inline]
 __alloc_pages_direct_reclaim mm/page_alloc.c:3951 [inline]
 __alloc_pages_slowpath mm/page_alloc.c:4382 [inline]
 __alloc_pages_noprof+0xa48/0x2040 mm/page_alloc.c:4766
 alloc_pages_bulk_noprof+0x6d6/0xf40 mm/page_alloc.c:4701
 alloc_pages_bulk_array_mempolicy_noprof+0x1fd/0xcb0 mm/mempolicy.c:2559
 vm_area_alloc_pages mm/vmalloc.c:3565 [inline]
 __vmalloc_area_node mm/vmalloc.c:3669 [inline]
 __vmalloc_node_range_noprof+0x453/0x1170 mm/vmalloc.c:3846
 __vmalloc_node_noprof+0xad/0xf0 mm/vmalloc.c:3911
 xt_counters_alloc+0x32/0x60 net/netfilter/x_tables.c:1380
 __do_replace net/ipv4/netfilter/ip_tables.c:1046 [inline]
 do_replace net/ipv4/netfilter/ip_tables.c:1141 [inline]
 do_ipt_set_ctl+0x6d8/0x10d0 net/ipv4/netfilter/ip_tables.c:1635
 nf_setsockopt+0x7d/0xe0 net/netfilter/nf_sockopt.c:101
 ip_setsockopt+0xa4/0xc0 net/ipv4/ip_sockglue.c:1424
 tcp_setsockopt+0x9c/0x100 net/ipv4/tcp.c:4030
 do_sock_setsockopt+0xd3/0x1a0 net/socket.c:2313
 __sys_setsockopt+0x105/0x170 net/socket.c:2338
 __do_sys_setsockopt net/socket.c:2344 [inline]
 __se_sys_setsockopt net/socket.c:2341 [inline]
 __x64_sys_setsockopt+0xbd/0x160 net/socket.c:2341
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fc5c73fa87e
Code: 0f 1f 40 00 48 c7 c2 b0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff
ff ff eb b1 0f 1f 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d
00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 c7 c2 b0
RSP: 002b:00007ffc1866e9a8 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007ffc1866ea30 RCX: 00007fc5c73fa87e
RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00000000000002d8 R09: 00007ffc1866ef30
R10: 00007fc5c75c0c60 R11: 0000000000000206 R12: 00007fc5c75c0c00
R13: 00007ffc1866e9cc R14: 0000000000000000 R15: 00007fc5c75c2dc0
 </TASK>

————————————————————————————————————————Call
Trace——————————————————————————————————————————————————

If you need more details or additional test results, please feel free
to let us know. Thank you so much for your attention! Please don't
hesitate to reach out if you have any suggestions or need further
communication.

Best regards,
Luka


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-03-06  2:42 Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 Luka
@ 2025-03-06  5:13 ` Matthew Wilcox
  2025-03-26 10:59   ` Matt Fleming
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2025-03-06  5:13 UTC (permalink / raw)
  To: Luka
  Cc: Andrew Morton, linux-mm, linux-kernel, Theodore Ts'o,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Thu, Mar 06, 2025 at 10:42:58AM +0800, Luka wrote:
> We fully understand the complexity and importance of Linux kernel
> maintenance, and we would like to share this finding with you for
> further analysis and confirmation of the root cause. Below is a
> summary of the relevant information:

This is the exact same problem I just analysed for you.  Except this
time it's ext4 rather than FAT.

https://lore.kernel.org/linux-mm/Z8kuWyqj8cS-stKA@casper.infradead.org/
for the benefit of the ext4 people who're just finding out about this.

> Kernel Version: v6.13.0-rc5
> 
> Kernel Module: mm/page_alloc.c
> 
> ————————————————————————————————————————Call
> Trace——————————————————————————————————————————————————
> 
> WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240
> __alloc_pages_slowpath mm/page_alloc.c:4240 [inline]
> WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240
> __alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766
> Modules linked in:
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:__alloc_pages_slowpath mm/page_alloc.c:4240 [inline]
> RIP: 0010:__alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766
> Code: 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0
> 7c 08 84 d2 0f 85 b3 07 00 00 f6 43 2d 08 0f 84 30 ed ff ff 90 <0f> 0b
> 90 e9 27 ed ff ff 44 89 4c 24 38 65 8b 15 c0 89 52 78 89 d2
> RSP: 0018:ffff8880141ee990 EFLAGS: 00010202
> RAX: 0000000000000007 RBX: ffff888012544400 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801254442c
> RBP: 0000000000048c40 R08: 0000000000000801 R09: 00000000000000f7
> R10: 0000000000000000 R11: ffff88813fffdc40 R12: 0000000000000000
> R13: 0000000000000400 R14: 0000000000048c40 R15: 0000000000000000
> FS:  0000555589d15480(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055e47d593e61 CR3: 00000000141ce000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  alloc_pages_mpol_noprof+0xda/0x300 mm/mempolicy.c:2269
>  folio_alloc_noprof+0x1e/0x70 mm/mempolicy.c:2355
>  filemap_alloc_folio_noprof+0x2b2/0x2f0 mm/filemap.c:1009
>  __filemap_get_folio+0x16d/0x3d0 mm/filemap.c:1951
>  ext4_mb_load_buddy_gfp+0x42b/0xc00 fs/ext4/mballoc.c:1640
>  ext4_discard_preallocations+0x45c/0xc70 fs/ext4/mballoc.c:5592
>  ext4_clear_inode+0x3d/0x1e0 fs/ext4/super.c:1523
>  ext4_evict_inode+0x1b2/0x1330 fs/ext4/inode.c:323
>  evict+0x337/0x7c0 fs/inode.c:796
>  dispose_list fs/inode.c:845 [inline]
>  prune_icache_sb+0x189/0x290 fs/inode.c:1033
>  super_cache_scan+0x33d/0x510 fs/super.c:223
>  do_shrink_slab mm/shrinker.c:437 [inline]
>  shrink_slab+0x43e/0x930 mm/shrinker.c:664
>  shrink_node_memcgs mm/vmscan.c:5931 [inline]
>  shrink_node+0x4dd/0x15c0 mm/vmscan.c:5970
>  shrink_zones mm/vmscan.c:6215 [inline]
>  do_try_to_free_pages+0x284/0x1160 mm/vmscan.c:6277
>  try_to_free_pages+0x1ee/0x3e0 mm/vmscan.c:6527
>  __perform_reclaim mm/page_alloc.c:3929 [inline]
>  __alloc_pages_direct_reclaim mm/page_alloc.c:3951 [inline]
>  __alloc_pages_slowpath mm/page_alloc.c:4382 [inline]
>  __alloc_pages_noprof+0xa48/0x2040 mm/page_alloc.c:4766
>  alloc_pages_bulk_noprof+0x6d6/0xf40 mm/page_alloc.c:4701
>  alloc_pages_bulk_array_mempolicy_noprof+0x1fd/0xcb0 mm/mempolicy.c:2559
>  vm_area_alloc_pages mm/vmalloc.c:3565 [inline]
>  __vmalloc_area_node mm/vmalloc.c:3669 [inline]
>  __vmalloc_node_range_noprof+0x453/0x1170 mm/vmalloc.c:3846
>  __vmalloc_node_noprof+0xad/0xf0 mm/vmalloc.c:3911
>  xt_counters_alloc+0x32/0x60 net/netfilter/x_tables.c:1380
>  __do_replace net/ipv4/netfilter/ip_tables.c:1046 [inline]
>  do_replace net/ipv4/netfilter/ip_tables.c:1141 [inline]
>  do_ipt_set_ctl+0x6d8/0x10d0 net/ipv4/netfilter/ip_tables.c:1635
>  nf_setsockopt+0x7d/0xe0 net/netfilter/nf_sockopt.c:101
>  ip_setsockopt+0xa4/0xc0 net/ipv4/ip_sockglue.c:1424
>  tcp_setsockopt+0x9c/0x100 net/ipv4/tcp.c:4030
>  do_sock_setsockopt+0xd3/0x1a0 net/socket.c:2313
>  __sys_setsockopt+0x105/0x170 net/socket.c:2338
>  __do_sys_setsockopt net/socket.c:2344 [inline]
>  __se_sys_setsockopt net/socket.c:2341 [inline]
>  __x64_sys_setsockopt+0xbd/0x160 net/socket.c:2341
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7fc5c73fa87e
> Code: 0f 1f 40 00 48 c7 c2 b0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff
> ff ff eb b1 0f 1f 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d
> 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 c7 c2 b0
> RSP: 002b:00007ffc1866e9a8 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00007ffc1866ea30 RCX: 00007fc5c73fa87e
> RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003
> RBP: 0000000000000003 R08: 00000000000002d8 R09: 00007ffc1866ef30
> R10: 00007fc5c75c0c60 R11: 0000000000000206 R12: 00007fc5c75c0c00
> R13: 00007ffc1866e9cc R14: 0000000000000000 R15: 00007fc5c75c2dc0
>  </TASK>
> 
> ————————————————————————————————————————Call
> Trace——————————————————————————————————————————————————
> 
> If you need more details or additional test results, please feel free
> to let us know. Thank you so much for your attention! Please don't
> hesitate to reach out if you have any suggestions or need further
> communication.
> 
> Best regards,
> Luka
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-03-06  5:13 ` Matthew Wilcox
@ 2025-03-26 10:59   ` Matt Fleming
  2025-04-03 12:29     ` Matt Fleming
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Fleming @ 2025-03-26 10:59 UTC (permalink / raw)
  To: willy
  Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel,
	linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team,
	Vlastimil Babka, Miklos Szeredi, Amir Goldstein

On Thu, Mar 06, 2025 at 05:13:51 +0000, Matthew wrote:
> This is the exact same problem I just analysed for you.  Except this
> time it's ext4 rather than FAT.
>
> https://lore.kernel.org/linux-mm/Z8kuWyqj8cS-stKA@casper.infradead.org/
> for the benefit of the ext4 people who're just finding out about this.

Hi there,

I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.

Does overlayfs need some kind of background inode reclaim support?

  Call Trace:
   <TASK>
   __alloc_pages_noprof+0x31c/0x330
   alloc_pages_mpol_noprof+0xe3/0x1d0
   folio_alloc_noprof+0x5b/0xa0
   __filemap_get_folio+0x1f3/0x380
   __getblk_slow+0xa3/0x1e0
   __ext4_get_inode_loc+0x121/0x4b0
   ext4_get_inode_loc+0x40/0xa0
   ext4_reserve_inode_write+0x39/0xc0
   __ext4_mark_inode_dirty+0x5b/0x220
   ext4_evict_inode+0x26d/0x690
   evict+0x112/0x2a0
   __dentry_kill+0x71/0x180
   dput+0xeb/0x1b0
   ovl_stack_put+0x2e/0x50 [overlay]
   ovl_destroy_inode+0x3a/0x60 [overlay]
   destroy_inode+0x3b/0x70
   __dentry_kill+0x71/0x180
   shrink_dentry_list+0x6b/0xe0
   prune_dcache_sb+0x56/0x80
   super_cache_scan+0x12c/0x1e0
   do_shrink_slab+0x13b/0x350
   shrink_slab+0x278/0x3a0
   shrink_node+0x328/0x880
   balance_pgdat+0x36d/0x740
   kswapd+0x1f0/0x380
   kthread+0xd2/0x100
   ret_from_fork+0x34/0x50
   ret_from_fork_asm+0x1a/0x30
   </TASK>

Thanks,
Matt


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-03-26 10:59   ` Matt Fleming
@ 2025-04-03 12:29     ` Matt Fleming
  2025-04-03 12:58       ` Vlastimil Babka
  2025-04-03 17:12       ` Matthew Wilcox
  0 siblings, 2 replies; 11+ messages in thread
From: Matt Fleming @ 2025-04-03 12:29 UTC (permalink / raw)
  To: willy
  Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel,
	linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team,
	Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Dave Chinner,
	Qi Zheng, Roman Gushchin, Muchun Song

On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> Hi there,
>
> I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.
>
> Does overlayfs need some kind of background inode reclaim support?

Hey everyone, I know there was some off-list discussion last week at
LSFMM, but I don't think a definite solution has been proposed for the
below stacktrace.

What is the shrinker API policy wrt memory allocation and I/O? Should
overlayfs do something more like XFS and background reclaim to avoid
GFP_NOFAIL
allocations when kswapd is shrinking caches?

>   Call Trace:
>    <TASK>
>    __alloc_pages_noprof+0x31c/0x330
>    alloc_pages_mpol_noprof+0xe3/0x1d0
>    folio_alloc_noprof+0x5b/0xa0
>    __filemap_get_folio+0x1f3/0x380
>    __getblk_slow+0xa3/0x1e0
>    __ext4_get_inode_loc+0x121/0x4b0
>    ext4_get_inode_loc+0x40/0xa0
>    ext4_reserve_inode_write+0x39/0xc0
>    __ext4_mark_inode_dirty+0x5b/0x220
>    ext4_evict_inode+0x26d/0x690
>    evict+0x112/0x2a0
>    __dentry_kill+0x71/0x180
>    dput+0xeb/0x1b0
>    ovl_stack_put+0x2e/0x50 [overlay]
>    ovl_destroy_inode+0x3a/0x60 [overlay]
>    destroy_inode+0x3b/0x70
>    __dentry_kill+0x71/0x180
>    shrink_dentry_list+0x6b/0xe0
>    prune_dcache_sb+0x56/0x80
>    super_cache_scan+0x12c/0x1e0
>    do_shrink_slab+0x13b/0x350
>    shrink_slab+0x278/0x3a0
>    shrink_node+0x328/0x880
>    balance_pgdat+0x36d/0x740
>    kswapd+0x1f0/0x380
>    kthread+0xd2/0x100
>    ret_from_fork+0x34/0x50
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-03 12:29     ` Matt Fleming
@ 2025-04-03 12:58       ` Vlastimil Babka
  2025-04-03 14:33         ` Michal Hocko
  2025-04-03 17:12       ` Matthew Wilcox
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2025-04-03 12:58 UTC (permalink / raw)
  To: Matt Fleming, willy
  Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel,
	linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team,
	Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng,
	Roman Gushchin, Muchun Song, Michal Hocko

On 4/3/25 14:29, Matt Fleming wrote:
> On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote:
>>
>> Hi there,

+ Cc also Michal

>> I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.

We're talking about __alloc_pages_slowpath() doing WARN_ON_ONCE(current-
>flags & PF_MEMALLOC); for __GFP_NOFAIL allocations.

kswapd() sets:

tsk->flags |= PF_MEMALLOC | PF_KSWAPD;

so any __GFP_NOFAIL allocation done in the kswapd context risks this
warning. It's also objectively bad IMHO because for direct reclaim we can
loop and hope kswapd rescues us, but kswapd would then have to rely on
direct reclaimers to get unstuck. I don't see an easy generic solution?

>> Does overlayfs need some kind of background inode reclaim support?
> 
> Hey everyone, I know there was some off-list discussion last week at
> LSFMM, but I don't think a definite solution has been proposed for the
> below stacktrace.
> 
> What is the shrinker API policy wrt memory allocation and I/O? Should
> overlayfs do something more like XFS and background reclaim to avoid
> GFP_NOFAIL
> allocations when kswapd is shrinking caches?
> 
>>   Call Trace:
>>    <TASK>
>>    __alloc_pages_noprof+0x31c/0x330
>>    alloc_pages_mpol_noprof+0xe3/0x1d0
>>    folio_alloc_noprof+0x5b/0xa0
>>    __filemap_get_folio+0x1f3/0x380
>>    __getblk_slow+0xa3/0x1e0
>>    __ext4_get_inode_loc+0x121/0x4b0
>>    ext4_get_inode_loc+0x40/0xa0
>>    ext4_reserve_inode_write+0x39/0xc0
>>    __ext4_mark_inode_dirty+0x5b/0x220
>>    ext4_evict_inode+0x26d/0x690
>>    evict+0x112/0x2a0
>>    __dentry_kill+0x71/0x180
>>    dput+0xeb/0x1b0
>>    ovl_stack_put+0x2e/0x50 [overlay]
>>    ovl_destroy_inode+0x3a/0x60 [overlay]
>>    destroy_inode+0x3b/0x70
>>    __dentry_kill+0x71/0x180
>>    shrink_dentry_list+0x6b/0xe0
>>    prune_dcache_sb+0x56/0x80
>>    super_cache_scan+0x12c/0x1e0
>>    do_shrink_slab+0x13b/0x350
>>    shrink_slab+0x278/0x3a0
>>    shrink_node+0x328/0x880
>>    balance_pgdat+0x36d/0x740
>>    kswapd+0x1f0/0x380
>>    kthread+0xd2/0x100
>>    ret_from_fork+0x34/0x50
>>    ret_from_fork_asm+0x1a/0x30
>>    </TASK>
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-03 12:58       ` Vlastimil Babka
@ 2025-04-03 14:33         ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2025-04-03 14:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matt Fleming, willy, adilger.kernel, akpm, linux-ext4,
	linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso,
	Barry Song, kernel-team, Miklos Szeredi, Amir Goldstein,
	Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song

On Thu 03-04-25 14:58:25, Vlastimil Babka wrote:
> On 4/3/25 14:29, Matt Fleming wrote:
> > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote:
> >>
> >> Hi there,
> 
> + Cc also Michal
> 
> >> I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.
> 
> We're talking about __alloc_pages_slowpath() doing WARN_ON_ONCE(current-
> >flags & PF_MEMALLOC); for __GFP_NOFAIL allocations.
> 
> kswapd() sets:
> 
> tsk->flags |= PF_MEMALLOC | PF_KSWAPD;
> 
> so any __GFP_NOFAIL allocation done in the kswapd context risks this
> warning. It's also objectively bad IMHO because for direct reclaim we can
> loop and hope kswapd rescues us, but kswapd would then have to rely on
> direct reclaimers to get unstuck. I don't see an easy generic solution?

Right. I do not think NOFAIL request from the reclaim context is really
something we can commit to support. This really needs to be addressed on
the shrinker side.

> >> Does overlayfs need some kind of background inode reclaim support?
> > 
> > Hey everyone, I know there was some off-list discussion last week at
> > LSFMM, but I don't think a definite solution has been proposed for the
> > below stacktrace.
> > 
> > What is the shrinker API policy wrt memory allocation and I/O? Should
> > overlayfs do something more like XFS and background reclaim to avoid
> > GFP_NOFAIL
> > allocations when kswapd is shrinking caches?
> > 
> >>   Call Trace:
> >>    <TASK>
> >>    __alloc_pages_noprof+0x31c/0x330
> >>    alloc_pages_mpol_noprof+0xe3/0x1d0
> >>    folio_alloc_noprof+0x5b/0xa0
> >>    __filemap_get_folio+0x1f3/0x380
> >>    __getblk_slow+0xa3/0x1e0
> >>    __ext4_get_inode_loc+0x121/0x4b0
> >>    ext4_get_inode_loc+0x40/0xa0
> >>    ext4_reserve_inode_write+0x39/0xc0
> >>    __ext4_mark_inode_dirty+0x5b/0x220
> >>    ext4_evict_inode+0x26d/0x690
> >>    evict+0x112/0x2a0
> >>    __dentry_kill+0x71/0x180
> >>    dput+0xeb/0x1b0
> >>    ovl_stack_put+0x2e/0x50 [overlay]
> >>    ovl_destroy_inode+0x3a/0x60 [overlay]
> >>    destroy_inode+0x3b/0x70
> >>    __dentry_kill+0x71/0x180
> >>    shrink_dentry_list+0x6b/0xe0
> >>    prune_dcache_sb+0x56/0x80
> >>    super_cache_scan+0x12c/0x1e0
> >>    do_shrink_slab+0x13b/0x350
> >>    shrink_slab+0x278/0x3a0
> >>    shrink_node+0x328/0x880
> >>    balance_pgdat+0x36d/0x740
> >>    kswapd+0x1f0/0x380
> >>    kthread+0xd2/0x100
> >>    ret_from_fork+0x34/0x50
> >>    ret_from_fork_asm+0x1a/0x30
> >>    </TASK>
> > 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-03 12:29     ` Matt Fleming
  2025-04-03 12:58       ` Vlastimil Babka
@ 2025-04-03 17:12       ` Matthew Wilcox
  2025-04-03 19:32         ` James Bottomley
                           ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Matthew Wilcox @ 2025-04-03 17:12 UTC (permalink / raw)
  To: Matt Fleming
  Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel,
	linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team,
	Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Dave Chinner,
	Qi Zheng, Roman Gushchin, Muchun Song

On Thu, Apr 03, 2025 at 01:29:44PM +0100, Matt Fleming wrote:
> On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote:
> >
> > Hi there,
> >
> > I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.
> >
> > Does overlayfs need some kind of background inode reclaim support?
> 
> Hey everyone, I know there was some off-list discussion last week at
> LSFMM, but I don't think a definite solution has been proposed for the
> below stacktrace.

Hi Matt,

We did have a substantial discussion at LSFMM and we just had another
discussion on the ext4 call.  I'm going to try to summarise those
discussions here, and people can jump in to correct me (I'm not really
an expert on this part of MM-FS interaction).

At LSFMM, we came up with a solution that doesn't work, so let's start
with ideas that don't work:

 - Allow PF_MEMALLOC to dip into the atomic reserves.  With large block
   devices, we might end up doing emergency high-order allocations, and
   that makes everybody nervous
 - Only allow inode reclaim from kswapd and not from direct reclaim.
   Your stack trace here is from kswapd, so obviously that doesn't work.
 - Allow ->evict_inode to return an error.  At this point the inode has
   been taken off the lists which means that somebody else may have
   started to start constructing it again, and we can't just put it back
   on the lists.

Jan explained that _usually_ the reclaim path is not the last holder of
a reference to the inode.  What's happening here is that we've lost a
race where the dentry is being turned negative by somebody else at the
same time, and usually they'd have the last reference and call evict.
But if the shrinker has the last reference, it has to do the eviction.

Jan does not think that Overlayfs is a factor here.  It may change the
timing somewhat but should not make the race wider (nor narrower).

Ideas still on the table:

 - Convert all filesystems to use the XFS inode management scheme.
   Nobody is thrilled by this large amount of work.
 - Find a simpler version of the XFS scheme to implement for other
   filesystems.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-03 17:12       ` Matthew Wilcox
@ 2025-04-03 19:32         ` James Bottomley
  2025-04-04  9:09         ` Vlastimil Babka
  2025-04-07 23:00         ` Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: James Bottomley @ 2025-04-03 19:32 UTC (permalink / raw)
  To: Matthew Wilcox, Matt Fleming
  Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel,
	linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team,
	Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Dave Chinner,
	Qi Zheng, Roman Gushchin, Muchun Song

On Thu, 2025-04-03 at 18:12 +0100, Matthew Wilcox wrote:
[...]
> Ideas still on the table:
> 
>  - Convert all filesystems to use the XFS inode management scheme.
>    Nobody is thrilled by this large amount of work.
>  - Find a simpler version of the XFS scheme to implement for other
>    filesystems.

What's wrong with a simpler fix: if we're in PF_MEMALLOC when we try to
run inode.c:evict(), send it through a workqueue?  It will require some
preallocation (say using a superblock based work entry ... or simply
reuse the destroy_work)  but it should be doable.  The analysis says
that evicting from reclaim is very rare because it's a deletion race,
so it shouldn't matter that it's firing once per inode with this
condition.

Regards,

James




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-03 17:12       ` Matthew Wilcox
  2025-04-03 19:32         ` James Bottomley
@ 2025-04-04  9:09         ` Vlastimil Babka
  2025-04-04 13:50           ` Matthew Wilcox
  2025-04-07 23:00         ` Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2025-04-04  9:09 UTC (permalink / raw)
  To: Matthew Wilcox, Matt Fleming
  Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel,
	linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team,
	Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng,
	Roman Gushchin, Muchun Song, James E.J. Bottomley

On 4/3/25 19:12, Matthew Wilcox wrote:
> Ideas still on the table:
> 
>  - Convert all filesystems to use the XFS inode management scheme.
>    Nobody is thrilled by this large amount of work.
>  - Find a simpler version of the XFS scheme to implement for other
>    filesystems.

I don't know the XFS scheme, but this situation seems like a match for the
mempool semantics? (I assume it's also a lot of work to implement)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-04  9:09         ` Vlastimil Babka
@ 2025-04-04 13:50           ` Matthew Wilcox
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2025-04-04 13:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matt Fleming, adilger.kernel, akpm, linux-ext4, linux-fsdevel,
	linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song,
	kernel-team, Miklos Szeredi, Amir Goldstein, Dave Chinner,
	Qi Zheng, Roman Gushchin, Muchun Song, James E.J. Bottomley

On Fri, Apr 04, 2025 at 11:09:37AM +0200, Vlastimil Babka wrote:
> On 4/3/25 19:12, Matthew Wilcox wrote:
> > Ideas still on the table:
> > 
> >  - Convert all filesystems to use the XFS inode management scheme.
> >    Nobody is thrilled by this large amount of work.
> >  - Find a simpler version of the XFS scheme to implement for other
> >    filesystems.
> 
> I don't know the XFS scheme, but this situation seems like a match for the
> mempool semantics? (I assume it's also a lot of work to implement)

Ah; no.  evicting an inode may consume an arbitrary amount of memory,
run transactions, wait for I/O, etc, etc.  We really shouldn't be doing
it as part of memory reclaim.  I should probably have said that as part
of this writeup, so thanks for bringing it up.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5
  2025-04-03 17:12       ` Matthew Wilcox
  2025-04-03 19:32         ` James Bottomley
  2025-04-04  9:09         ` Vlastimil Babka
@ 2025-04-07 23:00         ` Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2025-04-07 23:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matt Fleming, adilger.kernel, akpm, linux-ext4, linux-fsdevel,
	linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song,
	kernel-team, Vlastimil Babka, Miklos Szeredi, Amir Goldstein,
	Qi Zheng, Roman Gushchin, Muchun Song

On Thu, Apr 03, 2025 at 06:12:26PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 03, 2025 at 01:29:44PM +0100, Matt Fleming wrote:
> > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote:
> > >
> > > Hi there,
> > >
> > > I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.
> > >
> > > Does overlayfs need some kind of background inode reclaim support?
> > 
> > Hey everyone, I know there was some off-list discussion last week at
> > LSFMM, but I don't think a definite solution has been proposed for the
> > below stacktrace.
> 
> Hi Matt,
> 
> We did have a substantial discussion at LSFMM and we just had another
> discussion on the ext4 call.  I'm going to try to summarise those
> discussions here, and people can jump in to correct me (I'm not really
> an expert on this part of MM-FS interaction).
> 
> At LSFMM, we came up with a solution that doesn't work, so let's start
> with ideas that don't work:
> 
>  - Allow PF_MEMALLOC to dip into the atomic reserves.  With large block
>    devices, we might end up doing emergency high-order allocations, and
>    that makes everybody nervous
>  - Only allow inode reclaim from kswapd and not from direct reclaim.

That's what GFP_NOFS does. We already rely on kswapd to do inode
reclaim rather than direct reclaim when filesystem cache pressure
is driving memory reclaim...

>    Your stack trace here is from kswapd, so obviously that doesn't work.
>  - Allow ->evict_inode to return an error.  At this point the inode has
>    been taken off the lists which means that somebody else may have
>    started to start constructing it again, and we can't just put it back
>    on the lists.

No. When ->evict_inode is called, the inode hasn't been taken off
the inode hash list. Hence the inode can still be found
via cache lookups whilst evict_inode() is running. However, the
inode will have I_FREEING set, so lookups will call
wait_on_freeing_inode() before retrying the lookup. They will
get woken by the inode_wake_up_bit() call in evict() that happens
after ->evict_inode returns, so I_FREEING is what provides
->evict_inode serialisation against new lookups trying to recreate
the inode whilst it is being torn down.

IOWs, nothing should be reconstructing the inode whilst evict() is
tearing it down because it can still be found in the inode hash.

> Jan explained that _usually_ the reclaim path is not the last
> holder of a reference to the inode.  What's happening here is that
> we've lost a race where the dentry is being turned negative by
> somebody else at the same time, and usually they'd have the last
> reference and call evict.  But if the shrinker has the last
> reference, it has to do the eviction.
> 
> Jan does not think that Overlayfs is a factor here.  It may change
> the timing somewhat but should not make the race wider (nor
> narrower).
> 
> Ideas still on the table:
> 
>  - Convert all filesystems to use the XFS inode management scheme.
>  Nobody is thrilled by this large amount of work.

There is no need to do that.

>  - Find a simpler version of the XFS scheme to implement for other
>    filesystems.

If we push the last half of evict_inode() out to the background
thread (i.e. go async before remove_inode_hash() is called), then
new lookups will still serialise on the inode hash due to I_FREEING
being set. i.e. Problems only arise if the inode is removed from
lookup visibility whilst they still have cleanup work pending.

e.g. have the filesystem provide a ->evict_inode_async() method
that either completes inode eviction directly or punts it to a
workqueue where it does the work and then completes inode eviction.
As long as all this work is done whilst the inode is marked
I_FREEING and is present in the inode hash, then new lookups will
serialise on the eviction work regardless of how it is scheduled.

It is likely we could simplify the XFS code by converting it over to
a mechanism like this, rather than playing the long-standing "defer
everything to background threads from ->destroy_inode()" game that
we current do.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-04-07 23:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06  2:42 Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 Luka
2025-03-06  5:13 ` Matthew Wilcox
2025-03-26 10:59   ` Matt Fleming
2025-04-03 12:29     ` Matt Fleming
2025-04-03 12:58       ` Vlastimil Babka
2025-04-03 14:33         ` Michal Hocko
2025-04-03 17:12       ` Matthew Wilcox
2025-04-03 19:32         ` James Bottomley
2025-04-04  9:09         ` Vlastimil Babka
2025-04-04 13:50           ` Matthew Wilcox
2025-04-07 23:00         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).