* Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 @ 2025-03-06 2:42 Luka 2025-03-06 5:13 ` Matthew Wilcox 0 siblings, 1 reply; 11+ messages in thread From: Luka @ 2025-03-06 2:42 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel Dear Linux Kernel Experts, Hello! I am a security researcher focused on testing Linux kernel vulnerabilities. Recently, while testing the v6.13-rc5 Linux kernel, we encountered a crash related to the mm kernel module. We have successfully captured the call trace information for this crash. Unfortunately, we have not been able to reproduce the issue in our local environment, so we are unable to provide a PoC (Proof of Concept) at this time. We fully understand the complexity and importance of Linux kernel maintenance, and we would like to share this finding with you for further analysis and confirmation of the root cause. Below is a summary of the relevant information: Kernel Version: v6.13.0-rc5 Kernel Module: mm/page_alloc.c ————————————————————————————————————————Call Trace—————————————————————————————————————————————————— WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240 __alloc_pages_slowpath mm/page_alloc.c:4240 [inline] WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240 __alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766 Modules linked in: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__alloc_pages_slowpath mm/page_alloc.c:4240 [inline] RIP: 0010:__alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766 Code: 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 b3 07 00 00 f6 43 2d 08 0f 84 30 ed ff ff 90 <0f> 0b 90 e9 27 ed ff ff 44 89 4c 24 38 65 8b 15 c0 89 52 78 89 d2 RSP: 0018:ffff8880141ee990 EFLAGS: 00010202 RAX: 0000000000000007 RBX: ffff888012544400 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801254442c RBP: 0000000000048c40 R08: 0000000000000801 R09: 00000000000000f7 R10: 0000000000000000 R11: ffff88813fffdc40 R12: 0000000000000000 R13: 0000000000000400 R14: 0000000000048c40 R15: 0000000000000000 FS: 0000555589d15480(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055e47d593e61 CR3: 00000000141ce000 CR4: 0000000000350ef0 Call Trace: <TASK> alloc_pages_mpol_noprof+0xda/0x300 mm/mempolicy.c:2269 folio_alloc_noprof+0x1e/0x70 mm/mempolicy.c:2355 filemap_alloc_folio_noprof+0x2b2/0x2f0 mm/filemap.c:1009 __filemap_get_folio+0x16d/0x3d0 mm/filemap.c:1951 ext4_mb_load_buddy_gfp+0x42b/0xc00 fs/ext4/mballoc.c:1640 ext4_discard_preallocations+0x45c/0xc70 fs/ext4/mballoc.c:5592 ext4_clear_inode+0x3d/0x1e0 fs/ext4/super.c:1523 ext4_evict_inode+0x1b2/0x1330 fs/ext4/inode.c:323 evict+0x337/0x7c0 fs/inode.c:796 dispose_list fs/inode.c:845 [inline] prune_icache_sb+0x189/0x290 fs/inode.c:1033 super_cache_scan+0x33d/0x510 fs/super.c:223 do_shrink_slab mm/shrinker.c:437 [inline] shrink_slab+0x43e/0x930 mm/shrinker.c:664 shrink_node_memcgs mm/vmscan.c:5931 [inline] shrink_node+0x4dd/0x15c0 mm/vmscan.c:5970 shrink_zones mm/vmscan.c:6215 [inline] do_try_to_free_pages+0x284/0x1160 mm/vmscan.c:6277 try_to_free_pages+0x1ee/0x3e0 mm/vmscan.c:6527 __perform_reclaim mm/page_alloc.c:3929 [inline] __alloc_pages_direct_reclaim mm/page_alloc.c:3951 [inline] __alloc_pages_slowpath mm/page_alloc.c:4382 [inline] __alloc_pages_noprof+0xa48/0x2040 mm/page_alloc.c:4766 alloc_pages_bulk_noprof+0x6d6/0xf40 mm/page_alloc.c:4701 alloc_pages_bulk_array_mempolicy_noprof+0x1fd/0xcb0 mm/mempolicy.c:2559 vm_area_alloc_pages mm/vmalloc.c:3565 [inline] __vmalloc_area_node mm/vmalloc.c:3669 [inline] __vmalloc_node_range_noprof+0x453/0x1170 mm/vmalloc.c:3846 __vmalloc_node_noprof+0xad/0xf0 mm/vmalloc.c:3911 xt_counters_alloc+0x32/0x60 net/netfilter/x_tables.c:1380 __do_replace net/ipv4/netfilter/ip_tables.c:1046 [inline] do_replace net/ipv4/netfilter/ip_tables.c:1141 [inline] do_ipt_set_ctl+0x6d8/0x10d0 net/ipv4/netfilter/ip_tables.c:1635 nf_setsockopt+0x7d/0xe0 net/netfilter/nf_sockopt.c:101 ip_setsockopt+0xa4/0xc0 net/ipv4/ip_sockglue.c:1424 tcp_setsockopt+0x9c/0x100 net/ipv4/tcp.c:4030 do_sock_setsockopt+0xd3/0x1a0 net/socket.c:2313 __sys_setsockopt+0x105/0x170 net/socket.c:2338 __do_sys_setsockopt net/socket.c:2344 [inline] __se_sys_setsockopt net/socket.c:2341 [inline] __x64_sys_setsockopt+0xbd/0x160 net/socket.c:2341 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fc5c73fa87e Code: 0f 1f 40 00 48 c7 c2 b0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 0f 1f 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 c7 c2 b0 RSP: 002b:00007ffc1866e9a8 EFLAGS: 00000206 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007ffc1866ea30 RCX: 00007fc5c73fa87e RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000003 R08: 00000000000002d8 R09: 00007ffc1866ef30 R10: 00007fc5c75c0c60 R11: 0000000000000206 R12: 00007fc5c75c0c00 R13: 00007ffc1866e9cc R14: 0000000000000000 R15: 00007fc5c75c2dc0 </TASK> ————————————————————————————————————————Call Trace—————————————————————————————————————————————————— If you need more details or additional test results, please feel free to let us know. Thank you so much for your attention! Please don't hesitate to reach out if you have any suggestions or need further communication. Best regards, Luka ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-03-06 2:42 Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 Luka @ 2025-03-06 5:13 ` Matthew Wilcox 2025-03-26 10:59 ` Matt Fleming 0 siblings, 1 reply; 11+ messages in thread From: Matthew Wilcox @ 2025-03-06 5:13 UTC (permalink / raw) To: Luka Cc: Andrew Morton, linux-mm, linux-kernel, Theodore Ts'o, Andreas Dilger, linux-ext4, linux-fsdevel On Thu, Mar 06, 2025 at 10:42:58AM +0800, Luka wrote: > We fully understand the complexity and importance of Linux kernel > maintenance, and we would like to share this finding with you for > further analysis and confirmation of the root cause. Below is a > summary of the relevant information: This is the exact same problem I just analysed for you. Except this time it's ext4 rather than FAT. https://lore.kernel.org/linux-mm/Z8kuWyqj8cS-stKA@casper.infradead.org/ for the benefit of the ext4 people who're just finding out about this. > Kernel Version: v6.13.0-rc5 > > Kernel Module: mm/page_alloc.c > > ————————————————————————————————————————Call > Trace—————————————————————————————————————————————————— > > WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240 > __alloc_pages_slowpath mm/page_alloc.c:4240 [inline] > WARNING: CPU: 1 PID: 333 at mm/page_alloc.c:4240 > __alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766 > Modules linked in: > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 > RIP: 0010:__alloc_pages_slowpath mm/page_alloc.c:4240 [inline] > RIP: 0010:__alloc_pages_noprof+0x1808/0x2040 mm/page_alloc.c:4766 > Code: 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 > 7c 08 84 d2 0f 85 b3 07 00 00 f6 43 2d 08 0f 84 30 ed ff ff 90 <0f> 0b > 90 e9 27 ed ff ff 44 89 4c 24 38 65 8b 15 c0 89 52 78 89 d2 > RSP: 0018:ffff8880141ee990 EFLAGS: 00010202 > RAX: 0000000000000007 RBX: ffff888012544400 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801254442c > RBP: 0000000000048c40 R08: 0000000000000801 R09: 00000000000000f7 > R10: 0000000000000000 R11: ffff88813fffdc40 R12: 0000000000000000 > R13: 0000000000000400 R14: 0000000000048c40 R15: 0000000000000000 > FS: 0000555589d15480(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000055e47d593e61 CR3: 00000000141ce000 CR4: 0000000000350ef0 > Call Trace: > <TASK> > alloc_pages_mpol_noprof+0xda/0x300 mm/mempolicy.c:2269 > folio_alloc_noprof+0x1e/0x70 mm/mempolicy.c:2355 > filemap_alloc_folio_noprof+0x2b2/0x2f0 mm/filemap.c:1009 > __filemap_get_folio+0x16d/0x3d0 mm/filemap.c:1951 > ext4_mb_load_buddy_gfp+0x42b/0xc00 fs/ext4/mballoc.c:1640 > ext4_discard_preallocations+0x45c/0xc70 fs/ext4/mballoc.c:5592 > ext4_clear_inode+0x3d/0x1e0 fs/ext4/super.c:1523 > ext4_evict_inode+0x1b2/0x1330 fs/ext4/inode.c:323 > evict+0x337/0x7c0 fs/inode.c:796 > dispose_list fs/inode.c:845 [inline] > prune_icache_sb+0x189/0x290 fs/inode.c:1033 > super_cache_scan+0x33d/0x510 fs/super.c:223 > do_shrink_slab mm/shrinker.c:437 [inline] > shrink_slab+0x43e/0x930 mm/shrinker.c:664 > shrink_node_memcgs mm/vmscan.c:5931 [inline] > shrink_node+0x4dd/0x15c0 mm/vmscan.c:5970 > shrink_zones mm/vmscan.c:6215 [inline] > do_try_to_free_pages+0x284/0x1160 mm/vmscan.c:6277 > try_to_free_pages+0x1ee/0x3e0 mm/vmscan.c:6527 > __perform_reclaim mm/page_alloc.c:3929 [inline] > __alloc_pages_direct_reclaim mm/page_alloc.c:3951 [inline] > __alloc_pages_slowpath mm/page_alloc.c:4382 [inline] > __alloc_pages_noprof+0xa48/0x2040 mm/page_alloc.c:4766 > alloc_pages_bulk_noprof+0x6d6/0xf40 mm/page_alloc.c:4701 > alloc_pages_bulk_array_mempolicy_noprof+0x1fd/0xcb0 mm/mempolicy.c:2559 > vm_area_alloc_pages mm/vmalloc.c:3565 [inline] > __vmalloc_area_node mm/vmalloc.c:3669 [inline] > __vmalloc_node_range_noprof+0x453/0x1170 mm/vmalloc.c:3846 > __vmalloc_node_noprof+0xad/0xf0 mm/vmalloc.c:3911 > xt_counters_alloc+0x32/0x60 net/netfilter/x_tables.c:1380 > __do_replace net/ipv4/netfilter/ip_tables.c:1046 [inline] > do_replace net/ipv4/netfilter/ip_tables.c:1141 [inline] > do_ipt_set_ctl+0x6d8/0x10d0 net/ipv4/netfilter/ip_tables.c:1635 > nf_setsockopt+0x7d/0xe0 net/netfilter/nf_sockopt.c:101 > ip_setsockopt+0xa4/0xc0 net/ipv4/ip_sockglue.c:1424 > tcp_setsockopt+0x9c/0x100 net/ipv4/tcp.c:4030 > do_sock_setsockopt+0xd3/0x1a0 net/socket.c:2313 > __sys_setsockopt+0x105/0x170 net/socket.c:2338 > __do_sys_setsockopt net/socket.c:2344 [inline] > __se_sys_setsockopt net/socket.c:2341 [inline] > __x64_sys_setsockopt+0xbd/0x160 net/socket.c:2341 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7fc5c73fa87e > Code: 0f 1f 40 00 48 c7 c2 b0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff > ff ff eb b1 0f 1f 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d > 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 c7 c2 b0 > RSP: 002b:00007ffc1866e9a8 EFLAGS: 00000206 ORIG_RAX: 0000000000000036 > RAX: ffffffffffffffda RBX: 00007ffc1866ea30 RCX: 00007fc5c73fa87e > RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003 > RBP: 0000000000000003 R08: 00000000000002d8 R09: 00007ffc1866ef30 > R10: 00007fc5c75c0c60 R11: 0000000000000206 R12: 00007fc5c75c0c00 > R13: 00007ffc1866e9cc R14: 0000000000000000 R15: 00007fc5c75c2dc0 > </TASK> > > ————————————————————————————————————————Call > Trace—————————————————————————————————————————————————— > > If you need more details or additional test results, please feel free > to let us know. Thank you so much for your attention! Please don't > hesitate to reach out if you have any suggestions or need further > communication. > > Best regards, > Luka > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-03-06 5:13 ` Matthew Wilcox @ 2025-03-26 10:59 ` Matt Fleming 2025-04-03 12:29 ` Matt Fleming 0 siblings, 1 reply; 11+ messages in thread From: Matt Fleming @ 2025-03-26 10:59 UTC (permalink / raw) To: willy Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Vlastimil Babka, Miklos Szeredi, Amir Goldstein On Thu, Mar 06, 2025 at 05:13:51 +0000, Matthew wrote: > This is the exact same problem I just analysed for you. Except this > time it's ext4 rather than FAT. > > https://lore.kernel.org/linux-mm/Z8kuWyqj8cS-stKA@casper.infradead.org/ > for the benefit of the ext4 people who're just finding out about this. Hi there, I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19. Does overlayfs need some kind of background inode reclaim support? Call Trace: <TASK> __alloc_pages_noprof+0x31c/0x330 alloc_pages_mpol_noprof+0xe3/0x1d0 folio_alloc_noprof+0x5b/0xa0 __filemap_get_folio+0x1f3/0x380 __getblk_slow+0xa3/0x1e0 __ext4_get_inode_loc+0x121/0x4b0 ext4_get_inode_loc+0x40/0xa0 ext4_reserve_inode_write+0x39/0xc0 __ext4_mark_inode_dirty+0x5b/0x220 ext4_evict_inode+0x26d/0x690 evict+0x112/0x2a0 __dentry_kill+0x71/0x180 dput+0xeb/0x1b0 ovl_stack_put+0x2e/0x50 [overlay] ovl_destroy_inode+0x3a/0x60 [overlay] destroy_inode+0x3b/0x70 __dentry_kill+0x71/0x180 shrink_dentry_list+0x6b/0xe0 prune_dcache_sb+0x56/0x80 super_cache_scan+0x12c/0x1e0 do_shrink_slab+0x13b/0x350 shrink_slab+0x278/0x3a0 shrink_node+0x328/0x880 balance_pgdat+0x36d/0x740 kswapd+0x1f0/0x380 kthread+0xd2/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> Thanks, Matt ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-03-26 10:59 ` Matt Fleming @ 2025-04-03 12:29 ` Matt Fleming 2025-04-03 12:58 ` Vlastimil Babka 2025-04-03 17:12 ` Matthew Wilcox 0 siblings, 2 replies; 11+ messages in thread From: Matt Fleming @ 2025-04-03 12:29 UTC (permalink / raw) To: willy Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote: > > Hi there, > > I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19. > > Does overlayfs need some kind of background inode reclaim support? Hey everyone, I know there was some off-list discussion last week at LSFMM, but I don't think a definite solution has been proposed for the below stacktrace. What is the shrinker API policy wrt memory allocation and I/O? Should overlayfs do something more like XFS and background reclaim to avoid GFP_NOFAIL allocations when kswapd is shrinking caches? > Call Trace: > <TASK> > __alloc_pages_noprof+0x31c/0x330 > alloc_pages_mpol_noprof+0xe3/0x1d0 > folio_alloc_noprof+0x5b/0xa0 > __filemap_get_folio+0x1f3/0x380 > __getblk_slow+0xa3/0x1e0 > __ext4_get_inode_loc+0x121/0x4b0 > ext4_get_inode_loc+0x40/0xa0 > ext4_reserve_inode_write+0x39/0xc0 > __ext4_mark_inode_dirty+0x5b/0x220 > ext4_evict_inode+0x26d/0x690 > evict+0x112/0x2a0 > __dentry_kill+0x71/0x180 > dput+0xeb/0x1b0 > ovl_stack_put+0x2e/0x50 [overlay] > ovl_destroy_inode+0x3a/0x60 [overlay] > destroy_inode+0x3b/0x70 > __dentry_kill+0x71/0x180 > shrink_dentry_list+0x6b/0xe0 > prune_dcache_sb+0x56/0x80 > super_cache_scan+0x12c/0x1e0 > do_shrink_slab+0x13b/0x350 > shrink_slab+0x278/0x3a0 > shrink_node+0x328/0x880 > balance_pgdat+0x36d/0x740 > kswapd+0x1f0/0x380 > kthread+0xd2/0x100 > ret_from_fork+0x34/0x50 > ret_from_fork_asm+0x1a/0x30 > </TASK> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-03 12:29 ` Matt Fleming @ 2025-04-03 12:58 ` Vlastimil Babka 2025-04-03 14:33 ` Michal Hocko 2025-04-03 17:12 ` Matthew Wilcox 1 sibling, 1 reply; 11+ messages in thread From: Vlastimil Babka @ 2025-04-03 12:58 UTC (permalink / raw) To: Matt Fleming, willy Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, Michal Hocko On 4/3/25 14:29, Matt Fleming wrote: > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote: >> >> Hi there, + Cc also Michal >> I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19. We're talking about __alloc_pages_slowpath() doing WARN_ON_ONCE(current- >flags & PF_MEMALLOC); for __GFP_NOFAIL allocations. kswapd() sets: tsk->flags |= PF_MEMALLOC | PF_KSWAPD; so any __GFP_NOFAIL allocation done in the kswapd context risks this warning. It's also objectively bad IMHO because for direct reclaim we can loop and hope kswapd rescues us, but kswapd would then have to rely on direct reclaimers to get unstuck. I don't see an easy generic solution? >> Does overlayfs need some kind of background inode reclaim support? > > Hey everyone, I know there was some off-list discussion last week at > LSFMM, but I don't think a definite solution has been proposed for the > below stacktrace. > > What is the shrinker API policy wrt memory allocation and I/O? Should > overlayfs do something more like XFS and background reclaim to avoid > GFP_NOFAIL > allocations when kswapd is shrinking caches? > >> Call Trace: >> <TASK> >> __alloc_pages_noprof+0x31c/0x330 >> alloc_pages_mpol_noprof+0xe3/0x1d0 >> folio_alloc_noprof+0x5b/0xa0 >> __filemap_get_folio+0x1f3/0x380 >> __getblk_slow+0xa3/0x1e0 >> __ext4_get_inode_loc+0x121/0x4b0 >> ext4_get_inode_loc+0x40/0xa0 >> ext4_reserve_inode_write+0x39/0xc0 >> __ext4_mark_inode_dirty+0x5b/0x220 >> ext4_evict_inode+0x26d/0x690 >> evict+0x112/0x2a0 >> __dentry_kill+0x71/0x180 >> dput+0xeb/0x1b0 >> ovl_stack_put+0x2e/0x50 [overlay] >> ovl_destroy_inode+0x3a/0x60 [overlay] >> destroy_inode+0x3b/0x70 >> __dentry_kill+0x71/0x180 >> shrink_dentry_list+0x6b/0xe0 >> prune_dcache_sb+0x56/0x80 >> super_cache_scan+0x12c/0x1e0 >> do_shrink_slab+0x13b/0x350 >> shrink_slab+0x278/0x3a0 >> shrink_node+0x328/0x880 >> balance_pgdat+0x36d/0x740 >> kswapd+0x1f0/0x380 >> kthread+0xd2/0x100 >> ret_from_fork+0x34/0x50 >> ret_from_fork_asm+0x1a/0x30 >> </TASK> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-03 12:58 ` Vlastimil Babka @ 2025-04-03 14:33 ` Michal Hocko 0 siblings, 0 replies; 11+ messages in thread From: Michal Hocko @ 2025-04-03 14:33 UTC (permalink / raw) To: Vlastimil Babka Cc: Matt Fleming, willy, adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song On Thu 03-04-25 14:58:25, Vlastimil Babka wrote: > On 4/3/25 14:29, Matt Fleming wrote: > > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote: > >> > >> Hi there, > > + Cc also Michal > > >> I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19. > > We're talking about __alloc_pages_slowpath() doing WARN_ON_ONCE(current- > >flags & PF_MEMALLOC); for __GFP_NOFAIL allocations. > > kswapd() sets: > > tsk->flags |= PF_MEMALLOC | PF_KSWAPD; > > so any __GFP_NOFAIL allocation done in the kswapd context risks this > warning. It's also objectively bad IMHO because for direct reclaim we can > loop and hope kswapd rescues us, but kswapd would then have to rely on > direct reclaimers to get unstuck. I don't see an easy generic solution? Right. I do not think NOFAIL request from the reclaim context is really something we can commit to support. This really needs to be addressed on the shrinker side. > >> Does overlayfs need some kind of background inode reclaim support? > > > > Hey everyone, I know there was some off-list discussion last week at > > LSFMM, but I don't think a definite solution has been proposed for the > > below stacktrace. > > > > What is the shrinker API policy wrt memory allocation and I/O? Should > > overlayfs do something more like XFS and background reclaim to avoid > > GFP_NOFAIL > > allocations when kswapd is shrinking caches? > > > >> Call Trace: > >> <TASK> > >> __alloc_pages_noprof+0x31c/0x330 > >> alloc_pages_mpol_noprof+0xe3/0x1d0 > >> folio_alloc_noprof+0x5b/0xa0 > >> __filemap_get_folio+0x1f3/0x380 > >> __getblk_slow+0xa3/0x1e0 > >> __ext4_get_inode_loc+0x121/0x4b0 > >> ext4_get_inode_loc+0x40/0xa0 > >> ext4_reserve_inode_write+0x39/0xc0 > >> __ext4_mark_inode_dirty+0x5b/0x220 > >> ext4_evict_inode+0x26d/0x690 > >> evict+0x112/0x2a0 > >> __dentry_kill+0x71/0x180 > >> dput+0xeb/0x1b0 > >> ovl_stack_put+0x2e/0x50 [overlay] > >> ovl_destroy_inode+0x3a/0x60 [overlay] > >> destroy_inode+0x3b/0x70 > >> __dentry_kill+0x71/0x180 > >> shrink_dentry_list+0x6b/0xe0 > >> prune_dcache_sb+0x56/0x80 > >> super_cache_scan+0x12c/0x1e0 > >> do_shrink_slab+0x13b/0x350 > >> shrink_slab+0x278/0x3a0 > >> shrink_node+0x328/0x880 > >> balance_pgdat+0x36d/0x740 > >> kswapd+0x1f0/0x380 > >> kthread+0xd2/0x100 > >> ret_from_fork+0x34/0x50 > >> ret_from_fork_asm+0x1a/0x30 > >> </TASK> > > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-03 12:29 ` Matt Fleming 2025-04-03 12:58 ` Vlastimil Babka @ 2025-04-03 17:12 ` Matthew Wilcox 2025-04-03 19:32 ` James Bottomley ` (2 more replies) 1 sibling, 3 replies; 11+ messages in thread From: Matthew Wilcox @ 2025-04-03 17:12 UTC (permalink / raw) To: Matt Fleming Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song On Thu, Apr 03, 2025 at 01:29:44PM +0100, Matt Fleming wrote: > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote: > > > > Hi there, > > > > I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19. > > > > Does overlayfs need some kind of background inode reclaim support? > > Hey everyone, I know there was some off-list discussion last week at > LSFMM, but I don't think a definite solution has been proposed for the > below stacktrace. Hi Matt, We did have a substantial discussion at LSFMM and we just had another discussion on the ext4 call. I'm going to try to summarise those discussions here, and people can jump in to correct me (I'm not really an expert on this part of MM-FS interaction). At LSFMM, we came up with a solution that doesn't work, so let's start with ideas that don't work: - Allow PF_MEMALLOC to dip into the atomic reserves. With large block devices, we might end up doing emergency high-order allocations, and that makes everybody nervous - Only allow inode reclaim from kswapd and not from direct reclaim. Your stack trace here is from kswapd, so obviously that doesn't work. - Allow ->evict_inode to return an error. At this point the inode has been taken off the lists which means that somebody else may have started to start constructing it again, and we can't just put it back on the lists. Jan explained that _usually_ the reclaim path is not the last holder of a reference to the inode. What's happening here is that we've lost a race where the dentry is being turned negative by somebody else at the same time, and usually they'd have the last reference and call evict. But if the shrinker has the last reference, it has to do the eviction. Jan does not think that Overlayfs is a factor here. It may change the timing somewhat but should not make the race wider (nor narrower). Ideas still on the table: - Convert all filesystems to use the XFS inode management scheme. Nobody is thrilled by this large amount of work. - Find a simpler version of the XFS scheme to implement for other filesystems. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-03 17:12 ` Matthew Wilcox @ 2025-04-03 19:32 ` James Bottomley 2025-04-04 9:09 ` Vlastimil Babka 2025-04-07 23:00 ` Dave Chinner 2 siblings, 0 replies; 11+ messages in thread From: James Bottomley @ 2025-04-03 19:32 UTC (permalink / raw) To: Matthew Wilcox, Matt Fleming Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song On Thu, 2025-04-03 at 18:12 +0100, Matthew Wilcox wrote: [...] > Ideas still on the table: > > - Convert all filesystems to use the XFS inode management scheme. > Nobody is thrilled by this large amount of work. > - Find a simpler version of the XFS scheme to implement for other > filesystems. What's wrong with a simpler fix: if we're in PF_MEMALLOC when we try to run inode.c:evict(), send it through a workqueue? It will require some preallocation (say using a superblock based work entry ... or simply reuse the destroy_work) but it should be doable. The analysis says that evicting from reclaim is very rare because it's a deletion race, so it shouldn't matter that it's firing once per inode with this condition. Regards, James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-03 17:12 ` Matthew Wilcox 2025-04-03 19:32 ` James Bottomley @ 2025-04-04 9:09 ` Vlastimil Babka 2025-04-04 13:50 ` Matthew Wilcox 2025-04-07 23:00 ` Dave Chinner 2 siblings, 1 reply; 11+ messages in thread From: Vlastimil Babka @ 2025-04-04 9:09 UTC (permalink / raw) To: Matthew Wilcox, Matt Fleming Cc: adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, James E.J. Bottomley On 4/3/25 19:12, Matthew Wilcox wrote: > Ideas still on the table: > > - Convert all filesystems to use the XFS inode management scheme. > Nobody is thrilled by this large amount of work. > - Find a simpler version of the XFS scheme to implement for other > filesystems. I don't know the XFS scheme, but this situation seems like a match for the mempool semantics? (I assume it's also a lot of work to implement) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-04 9:09 ` Vlastimil Babka @ 2025-04-04 13:50 ` Matthew Wilcox 0 siblings, 0 replies; 11+ messages in thread From: Matthew Wilcox @ 2025-04-04 13:50 UTC (permalink / raw) To: Vlastimil Babka Cc: Matt Fleming, adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Miklos Szeredi, Amir Goldstein, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, James E.J. Bottomley On Fri, Apr 04, 2025 at 11:09:37AM +0200, Vlastimil Babka wrote: > On 4/3/25 19:12, Matthew Wilcox wrote: > > Ideas still on the table: > > > > - Convert all filesystems to use the XFS inode management scheme. > > Nobody is thrilled by this large amount of work. > > - Find a simpler version of the XFS scheme to implement for other > > filesystems. > > I don't know the XFS scheme, but this situation seems like a match for the > mempool semantics? (I assume it's also a lot of work to implement) Ah; no. evicting an inode may consume an arbitrary amount of memory, run transactions, wait for I/O, etc, etc. We really shouldn't be doing it as part of memory reclaim. I should probably have said that as part of this writeup, so thanks for bringing it up. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 2025-04-03 17:12 ` Matthew Wilcox 2025-04-03 19:32 ` James Bottomley 2025-04-04 9:09 ` Vlastimil Babka @ 2025-04-07 23:00 ` Dave Chinner 2 siblings, 0 replies; 11+ messages in thread From: Dave Chinner @ 2025-04-07 23:00 UTC (permalink / raw) To: Matthew Wilcox Cc: Matt Fleming, adilger.kernel, akpm, linux-ext4, linux-fsdevel, linux-kernel, linux-mm, luka.2016.cs, tytso, Barry Song, kernel-team, Vlastimil Babka, Miklos Szeredi, Amir Goldstein, Qi Zheng, Roman Gushchin, Muchun Song On Thu, Apr 03, 2025 at 06:12:26PM +0100, Matthew Wilcox wrote: > On Thu, Apr 03, 2025 at 01:29:44PM +0100, Matt Fleming wrote: > > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@readmodwrite.com> wrote: > > > > > > Hi there, > > > > > > I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19. > > > > > > Does overlayfs need some kind of background inode reclaim support? > > > > Hey everyone, I know there was some off-list discussion last week at > > LSFMM, but I don't think a definite solution has been proposed for the > > below stacktrace. > > Hi Matt, > > We did have a substantial discussion at LSFMM and we just had another > discussion on the ext4 call. I'm going to try to summarise those > discussions here, and people can jump in to correct me (I'm not really > an expert on this part of MM-FS interaction). > > At LSFMM, we came up with a solution that doesn't work, so let's start > with ideas that don't work: > > - Allow PF_MEMALLOC to dip into the atomic reserves. With large block > devices, we might end up doing emergency high-order allocations, and > that makes everybody nervous > - Only allow inode reclaim from kswapd and not from direct reclaim. That's what GFP_NOFS does. We already rely on kswapd to do inode reclaim rather than direct reclaim when filesystem cache pressure is driving memory reclaim... > Your stack trace here is from kswapd, so obviously that doesn't work. > - Allow ->evict_inode to return an error. At this point the inode has > been taken off the lists which means that somebody else may have > started to start constructing it again, and we can't just put it back > on the lists. No. When ->evict_inode is called, the inode hasn't been taken off the inode hash list. Hence the inode can still be found via cache lookups whilst evict_inode() is running. However, the inode will have I_FREEING set, so lookups will call wait_on_freeing_inode() before retrying the lookup. They will get woken by the inode_wake_up_bit() call in evict() that happens after ->evict_inode returns, so I_FREEING is what provides ->evict_inode serialisation against new lookups trying to recreate the inode whilst it is being torn down. IOWs, nothing should be reconstructing the inode whilst evict() is tearing it down because it can still be found in the inode hash. > Jan explained that _usually_ the reclaim path is not the last > holder of a reference to the inode. What's happening here is that > we've lost a race where the dentry is being turned negative by > somebody else at the same time, and usually they'd have the last > reference and call evict. But if the shrinker has the last > reference, it has to do the eviction. > > Jan does not think that Overlayfs is a factor here. It may change > the timing somewhat but should not make the race wider (nor > narrower). > > Ideas still on the table: > > - Convert all filesystems to use the XFS inode management scheme. > Nobody is thrilled by this large amount of work. There is no need to do that. > - Find a simpler version of the XFS scheme to implement for other > filesystems. If we push the last half of evict_inode() out to the background thread (i.e. go async before remove_inode_hash() is called), then new lookups will still serialise on the inode hash due to I_FREEING being set. i.e. Problems only arise if the inode is removed from lookup visibility whilst they still have cleanup work pending. e.g. have the filesystem provide a ->evict_inode_async() method that either completes inode eviction directly or punts it to a workqueue where it does the work and then completes inode eviction. As long as all this work is done whilst the inode is marked I_FREEING and is present in the inode hash, then new lookups will serialise on the eviction work regardless of how it is scheduled. It is likely we could simplify the XFS code by converting it over to a mechanism like this, rather than playing the long-standing "defer everything to background threads from ->destroy_inode()" game that we current do. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-04-07 23:00 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-06 2:42 Potential Linux Crash: WARNING in ext4_dirty_folio in Linux kernel v6.13-rc5 Luka 2025-03-06 5:13 ` Matthew Wilcox 2025-03-26 10:59 ` Matt Fleming 2025-04-03 12:29 ` Matt Fleming 2025-04-03 12:58 ` Vlastimil Babka 2025-04-03 14:33 ` Michal Hocko 2025-04-03 17:12 ` Matthew Wilcox 2025-04-03 19:32 ` James Bottomley 2025-04-04 9:09 ` Vlastimil Babka 2025-04-04 13:50 ` Matthew Wilcox 2025-04-07 23:00 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).