* Hang with xfs/285 on 2026-03-02 kernel
@ 2026-04-03 15:35 Matthew Wilcox
2026-04-04 11:42 ` Dave Chinner
2026-04-07 5:41 ` Christoph Hellwig
0 siblings, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-03 15:35 UTC (permalink / raw)
To: linux-xfs
This is with commit 5619b098e2fb so after 7.0-rc6
xfs/285 run fstests xfs/285 at 2026-04-03 06:11:42
XFS (vdc): Mounting V5 Filesystem e091474f-2cd9-4425-a30c-1114d62d130b
XFS (vdc): Ending clean mount
INFO: task fsstress:3762792 blocked for more than 120 seconds.
Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress state:D stack:0 pid:3762792 tgid:3762792 ppid:3762783 task_flags:0x440140 flags:0x00080000
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0xb3/0x110
__down_common+0x15c/0x2c0
__down+0x1d/0x30
down+0x68/0x80
xfs_buf_lock+0x4b/0x170
xfs_buf_find_lock+0x69/0x140
xfs_buf_get_map+0x265/0xbd0
xfs_buf_read_map+0x59/0x2e0
xfs_trans_read_buf_map+0x1bb/0x560
? xfs_read_agi+0xab/0x1a0
xfs_read_agi+0xab/0x1a0
xfs_ialloc_read_agi+0x61/0x200
xfs_iwalk_ag_start.constprop.0+0x4e/0x1e0
xfs_iwalk_ag+0x78/0x2d0
xfs_iwalk_args.constprop.0+0x67/0x120
xfs_iwalk+0x93/0xa0
? __pfx_xfs_bulkstat_iwalk+0x10/0x10
xfs_bulkstat+0xce/0x150
? __pfx_xfs_fsbulkstat_one_fmt+0x10/0x10
xfs_ioc_fsbulkstat.isra.0+0x122/0x1f0
xfs_file_ioctl+0xd52/0x1230
? find_held_lock+0x31/0x90
? kmem_cache_free+0x26c/0x460
? lock_release+0xba/0x260
? putname+0x45/0x80
? kmem_cache_free+0x271/0x460
__x64_sys_ioctl+0x4d0/0x9d0
x64_sys_call+0xf1f/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be22237b
RSP: 002b:00007ffe8acd1a30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000002300 RCX: 00007f37be22237b
RDX: 00007ffe8acd1aa0 RSI: ffffffffc0205865 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007f37be2fdac0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000080
R13: 00007ffe8acd1aa0 R14: 000055ac9b289fe0 R15: 000000000001382d
</TASK>
INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0x84/0x110
? __pfx_process_timeout+0x10/0x10
io_schedule_timeout+0x5b/0x80
xfs_buf_alloc+0x793/0x7d0
xfs_buf_get_map+0x651/0xbd0
? _raw_spin_unlock+0x26/0x50
xfs_trans_get_buf_map+0x141/0x300
xfs_ialloc_inode_init+0x130/0x2c0
xfs_ialloc_ag_alloc+0x226/0x710
xfs_dialloc+0x22d/0x980
? xfs_ilock+0x168/0x2b0
xfs_create+0x29e/0x4a0
? __get_acl+0x2d/0x1c0
xfs_generic_create+0x2a4/0x330
xfs_vn_mkdir+0x1e/0x30
vfs_mkdir+0xaf/0x1f0
filename_mkdirat+0x81/0x190
__x64_sys_mkdir+0x32/0x50
x64_sys_call+0x8e4/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
</TASK>
INFO: task fsstress:3762794 blocked for more than 120 seconds.
Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress state:D stack:0 pid:3762794 tgid:3762794 ppid:3762783 task_flags:0x440140 flags:0x00080000
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0xb3/0x110
__down_common+0x15c/0x2c0
__down+0x1d/0x30
down+0x68/0x80
xfs_buf_lock+0x4b/0x170
xfs_buf_find_lock+0x69/0x140
xfs_buf_get_map+0x265/0xbd0
? find_held_lock+0x31/0x90
xfs_buf_read_map+0x59/0x2e0
xfs_trans_read_buf_map+0x1bb/0x560
? xfs_read_agi+0xab/0x1a0
xfs_read_agi+0xab/0x1a0
xfs_ialloc_read_agi+0x61/0x200
xfs_dialloc+0x1f1/0x980
? xfs_ilock+0x168/0x2b0
xfs_create+0x29e/0x4a0
? __get_acl+0x2d/0x1c0
xfs_generic_create+0x2a4/0x330
xfs_vn_mknod+0x18/0x20
vfs_mknod+0xcd/0x200
filename_mknodat+0x1fd/0x2a0
__x64_sys_mknodat+0x3f/0x60
x64_sys_call+0x1c77/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218bf3
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000256 ORIG_RAX: 0000000000000103
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218bf3
RDX: 0000000000002124 RSI: 000055ac9ac3ff40 RDI: 00000000ffffff9c
RBP: 00007ffe8acd1ac0 R08: 000000055ac9afba R09: 00007f37be2fdac0
R10: 0000000000000000 R11: 0000000000000256 R12: 0000000000002124
R13: 0000000000000000 R14: 00000000000017a5 R15: 000055ac98e468f0
</TASK>
INFO: task fsstress:3762794 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0x84/0x110
? __pfx_process_timeout+0x10/0x10
io_schedule_timeout+0x5b/0x80
xfs_buf_alloc+0x793/0x7d0
xfs_buf_get_map+0x651/0xbd0
? _raw_spin_unlock+0x26/0x50
xfs_trans_get_buf_map+0x141/0x300
xfs_ialloc_inode_init+0x130/0x2c0
xfs_ialloc_ag_alloc+0x226/0x710
xfs_dialloc+0x22d/0x980
? xfs_ilock+0x168/0x2b0
xfs_create+0x29e/0x4a0
? __get_acl+0x2d/0x1c0
xfs_generic_create+0x2a4/0x330
xfs_vn_mkdir+0x1e/0x30
vfs_mkdir+0xaf/0x1f0
filename_mkdirat+0x81/0x190
__x64_sys_mkdir+0x32/0x50
x64_sys_call+0x8e4/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
</TASK>
INFO: task fsstress:3762795 blocked for more than 120 seconds.
Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress state:D stack:0 pid:3762795 tgid:3762795 ppid:3762783 task_flags:0x440140 flags:0x00080000
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0xb3/0x110
__down_common+0x15c/0x2c0
__down+0x1d/0x30
down+0x68/0x80
xfs_buf_lock+0x4b/0x170
xfs_buf_find_lock+0x69/0x140
xfs_buf_get_map+0x265/0xbd0
? xfs_trans_add_item+0xf2/0x1b0
xfs_buf_read_map+0x59/0x2e0
xfs_trans_read_buf_map+0x1bb/0x560
? xfs_read_agi+0xab/0x1a0
xfs_read_agi+0xab/0x1a0
xfs_ialloc_read_agi+0x61/0x200
xfs_iwalk_ag_start.constprop.0+0x4e/0x1e0
xfs_iwalk_ag+0x78/0x2d0
xfs_iwalk_args.constprop.0+0x67/0x120
xfs_iwalk+0x93/0xa0
? __pfx_xfs_bulkstat_iwalk+0x10/0x10
xfs_bulkstat+0xce/0x150
? __pfx_xfs_fsbulkstat_one_fmt+0x10/0x10
xfs_ioc_fsbulkstat.isra.0+0x122/0x1f0
xfs_file_ioctl+0xd52/0x1230
? find_held_lock+0x31/0x90
? kmem_cache_free+0x26c/0x460
? lock_release+0xba/0x260
? putname+0x45/0x80
? kmem_cache_free+0x271/0x460
__x64_sys_ioctl+0x4d0/0x9d0
x64_sys_call+0xf1f/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be22237b
RSP: 002b:00007ffe8acd1a30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000001f38 RCX: 00007f37be22237b
RDX: 00007ffe8acd1aa0 RSI: ffffffffc0205865 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007f37be2fdac0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e7
R13: 00007ffe8acd1aa0 R14: 000055ac9b20b7d0 R15: 00000000000130f4
</TASK>
INFO: task fsstress:3762795 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0x84/0x110
? __pfx_process_timeout+0x10/0x10
io_schedule_timeout+0x5b/0x80
xfs_buf_alloc+0x793/0x7d0
xfs_buf_get_map+0x651/0xbd0
? _raw_spin_unlock+0x26/0x50
xfs_trans_get_buf_map+0x141/0x300
xfs_ialloc_inode_init+0x130/0x2c0
xfs_ialloc_ag_alloc+0x226/0x710
xfs_dialloc+0x22d/0x980
? xfs_ilock+0x168/0x2b0
xfs_create+0x29e/0x4a0
? __get_acl+0x2d/0x1c0
xfs_generic_create+0x2a4/0x330
xfs_vn_mkdir+0x1e/0x30
vfs_mkdir+0xaf/0x1f0
filename_mkdirat+0x81/0x190
__x64_sys_mkdir+0x32/0x50
x64_sys_call+0x8e4/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
</TASK>
INFO: task kworker/8:19:3762862 blocked for more than 120 seconds.
Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:19 state:D stack:0 pid:3762862 tgid:3762862 ppid:2 task_flags:0x4248060 flags:0x00080000
Workqueue: xfs-conv/vdc xfs_end_io
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0xb3/0x110
__down_common+0x15c/0x2c0
__down+0x1d/0x30
down+0x68/0x80
xfs_buf_lock+0x4b/0x170
xfs_buf_find_lock+0x69/0x140
xfs_buf_get_map+0x265/0xbd0
? xfs_btree_overlapped_query_range+0x39f/0x620
xfs_buf_read_map+0x59/0x2e0
xfs_trans_read_buf_map+0x1bb/0x560
? xfs_read_agf+0xa3/0x170
xfs_read_agf+0xa3/0x170
xfs_alloc_read_agf+0x73/0x370
xfs_alloc_fix_freelist+0x2dc/0x670
? find_held_lock+0x31/0x90
xfs_free_extent_fix_freelist+0x5e/0x80
xfs_rmap_finish_one+0xc4/0x300
? kmem_cache_alloc_noprof+0x36a/0x450
? xfs_rmap_update_create_done+0x29/0xb0
xfs_rmap_update_finish_item+0x1e/0x40
xfs_defer_finish_one+0xc0/0x2d0
? xfs_defer_relog+0x56/0x280
xfs_defer_finish_noroll+0x1ad/0x540
xfs_trans_commit+0x4e/0x70
xfs_iomap_write_unwritten+0xdd/0x340
xfs_end_ioend_write+0x219/0x2c0
xfs_end_io+0xdc/0xf0
process_one_work+0x1fb/0x570
? lock_is_held_type+0x93/0x100
worker_thread+0x1e6/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0x10d/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork+0x1b4/0x250
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/8:19:3762862 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
<TASK>
__schedule+0x560/0xfc0
schedule+0x3e/0x140
schedule_timeout+0x84/0x110
? __pfx_process_timeout+0x10/0x10
io_schedule_timeout+0x5b/0x80
xfs_buf_alloc+0x793/0x7d0
xfs_buf_get_map+0x651/0xbd0
? _raw_spin_unlock+0x26/0x50
xfs_trans_get_buf_map+0x141/0x300
xfs_ialloc_inode_init+0x130/0x2c0
xfs_ialloc_ag_alloc+0x226/0x710
xfs_dialloc+0x22d/0x980
? xfs_ilock+0x168/0x2b0
xfs_create+0x29e/0x4a0
? __get_acl+0x2d/0x1c0
xfs_generic_create+0x2a4/0x330
xfs_vn_mkdir+0x1e/0x30
vfs_mkdir+0xaf/0x1f0
filename_mkdirat+0x81/0x190
__x64_sys_mkdir+0x32/0x50
x64_sys_call+0x8e4/0x1dd0
do_syscall_64+0x74/0x3f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
</TASK>
Showing all locks held in the system:
1 lock held by khungtaskd/100:
#0: ffffffff826d67c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x51/0x115
5 locks held by kworker/u64:0/3558666:
#0: ffff88810331dd48 ((wq_completion)xfs-blockgc/vdc){....}-{0:0}, at: process_one_work+0x45c/0x570
#1: ffff88810f08fe48 ((work_completion)(&(&pag->pag_blockgc_work)->work)){....}-{0:0}, at: process_one_work+0x1bb/0x570
#2: ffff888155e3c928 (&sb->s_type->i_mutex_key#17){....}-{3:3}, at: xfs_ilock_nowait+0x1ee/0x330
#3: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_free_eofblocks+0xda/0x1c0
#4: ffff888155e3c718 (&xfs_nondir_ilock_class){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
4 locks held by fsstress/3762793:
#0: ffff8881478f53f0 (sb_writers#10){....}-{0:0}, at: filename_create+0x6e/0x180
#1: ffff88816e264228 (&inode->i_sb->s_type->i_mutex_dir_key/1){....}-{3:3}, at: filename_create+0xad/0x180
#2: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_trans_alloc_icreate+0x58/0x100
#3: ffff88816e264018 (&xfs_dir_ilock_class/5){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
4 locks held by fsstress/3762794:
#0: ffff8881478f53f0 (sb_writers#10){....}-{0:0}, at: filename_create+0x6e/0x180
#1: ffff888038517328 (&inode->i_sb->s_type->i_mutex_dir_key/1){....}-{3:3}, at: filename_create+0xad/0x180
#2: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_trans_alloc_icreate+0x58/0x100
#3: ffff888038517118 (&xfs_dir_ilock_class/5){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
4 locks held by kworker/8:19/3762862:
#0: ffff88815e73ed48 ((wq_completion)xfs-conv/vdc){....}-{0:0}, at: process_one_work+0x45c/0x570
#1: ffff888104efbe48 ((work_completion)(&ip->i_ioend_work)){....}-{0:0}, at: process_one_work+0x1bb/0x570
#2: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_trans_alloc_inode+0x7d/0x190
#3: ffff888137bb2b18 (&xfs_nondir_ilock_class){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
(there are more messages after this, but i doubt they're useful)
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox @ 2026-04-04 11:42 ` Dave Chinner 2026-04-04 20:40 ` Matthew Wilcox 2026-04-05 1:03 ` Ritesh Harjani 2026-04-07 5:41 ` Christoph Hellwig 1 sibling, 2 replies; 9+ messages in thread From: Dave Chinner @ 2026-04-04 11:42 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-xfs On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote: > This is with commit 5619b098e2fb so after 7.0-rc6 > INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793 > task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800 > Call Trace: > <TASK> > __schedule+0x560/0xfc0 > schedule+0x3e/0x140 > schedule_timeout+0x84/0x110 > ? __pfx_process_timeout+0x10/0x10 > io_schedule_timeout+0x5b/0x80 > xfs_buf_alloc+0x793/0x7d0 -ENOMEM. It'll be looping here: fallback: for (;;) { bp->b_addr = __vmalloc(size, gfp_mask); if (bp->b_addr) break; if (flags & XBF_READ_AHEAD) return -ENOMEM; XFS_STATS_INC(bp->b_mount, xb_page_retries); memalloc_retry_wait(gfp_mask); } If it is looping here long enough to trigger the hang check timer, then the MM subsystem is not making progress reclaiming memory. This is probably a 16kB allocation (it's an inode cluster buffer), and the allocation context is NOFAIL because it is within a transaction (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL).... All the other tasks are backed up on the AGI buffer lock held ... > xfs_buf_get_map+0x651/0xbd0 > ? _raw_spin_unlock+0x26/0x50 > xfs_trans_get_buf_map+0x141/0x300 > xfs_ialloc_inode_init+0x130/0x2c0 > xfs_ialloc_ag_alloc+0x226/0x710 > xfs_dialloc+0x22d/0x980 ... here by the task blocked on memory allocation. This smells like a persistent ENOMEM/memory reclaim issue and XFS is just the messenger... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-04 11:42 ` Dave Chinner @ 2026-04-04 20:40 ` Matthew Wilcox 2026-04-05 22:29 ` Dave Chinner 2026-04-05 1:03 ` Ritesh Harjani 1 sibling, 1 reply; 9+ messages in thread From: Matthew Wilcox @ 2026-04-04 20:40 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Sat, Apr 04, 2026 at 10:42:59PM +1100, Dave Chinner wrote: > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote: > > This is with commit 5619b098e2fb so after 7.0-rc6 > > INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793 > > task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800 > > Call Trace: > > <TASK> > > __schedule+0x560/0xfc0 > > schedule+0x3e/0x140 > > schedule_timeout+0x84/0x110 > > ? __pfx_process_timeout+0x10/0x10 > > io_schedule_timeout+0x5b/0x80 > > xfs_buf_alloc+0x793/0x7d0 > > -ENOMEM. > > It'll be looping here: > > fallback: > for (;;) { > bp->b_addr = __vmalloc(size, gfp_mask); > if (bp->b_addr) > break; > if (flags & XBF_READ_AHEAD) > return -ENOMEM; > XFS_STATS_INC(bp->b_mount, xb_page_retries); > memalloc_retry_wait(gfp_mask); > } > > If it is looping here long enough to trigger the hang check timer, > then the MM subsystem is not making progress reclaiming memory. This > is probably a 16kB allocation (it's an inode cluster buffer), and > the allocation context is NOFAIL because it is within a transaction > (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL).... There may be something else going on. I reproduced it again and ssh'd into the VM. # free total used free shared buff/cache available Mem: 3988260 1197132 240080 144 3147496 2791128 Swap: 2097148 258128 1839020 There are five instances of fsstress running. Very slowly, but they are accumulating seconds of CPU time: root@deadly-kvm:~# ps -aux |grep fsstress root 3745227 0.0 0.0 2664 1476 ? S 06:48 0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745236 7.5 1.6 127928 65256 ? D 06:48 42:54 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745237 7.6 1.5 124644 61308 ? D 06:48 42:55 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745238 7.6 1.6 130844 65584 ? D 06:48 43:01 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745239 7.6 1.6 126524 66536 ? D 06:48 42:58 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root@deadly-kvm:~# ps -aux |grep fsstress root 3745227 0.0 0.0 2664 1476 ? S 06:48 0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745236 5.5 1.6 133116 66708 ? R 06:48 45:44 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745237 5.5 1.5 130136 62516 ? R 06:48 45:45 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745238 5.5 1.6 136520 65944 ? R 06:48 45:52 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 root 3745239 5.5 1.7 131988 67884 ? R 06:48 45:50 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 # cat /proc/3745239/stack [<0>] xfs_buf_lock+0x4b/0x170 [<0>] xfs_buf_find_lock+0x69/0x140 [<0>] xfs_buf_get_map+0x265/0xbd0 [<0>] xfs_buf_read_map+0x59/0x2e0 [<0>] xfs_trans_read_buf_map+0x1bb/0x560 [<0>] xfs_read_agi+0xab/0x1a0 (...) # cat /proc/3745238/stack [<0>] xfs_buf_alloc+0x793/0x7d0 [<0>] xfs_buf_get_map+0x651/0xbd0 [<0>] xfs_buf_readahead_map+0x3b/0x1b0 [<0>] xfs_iwalk_ichunk_ra+0xe9/0x130 [<0>] xfs_iwalk_ag+0x185/0x2d0 (...) It doesn't _seem_ like the system is struggling for memory. # cat /proc/meminfo MemTotal: 3988260 kB MemFree: 241956 kB MemAvailable: 2781960 kB Buffers: 5184 kB Cached: 2503020 kB SwapCached: 4860 kB Active: 2062948 kB Inactive: 713828 kB Active(anon): 85800 kB Inactive(anon): 182968 kB Active(file): 1977148 kB Inactive(file): 530860 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 2097148 kB SwapFree: 1823052 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 267836 kB Mapped: 16280 kB Shmem: 144 kB KReclaimable: 628212 kB Slab: 783840 kB SReclaimable: 628212 kB SUnreclaim: 155628 kB KernelStack: 3536 kB PageTables: 3680 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4091276 kB Committed_AS: 560852 kB VmallocTotal: 34359738367 kB VmallocUsed: 13004 kB VmallocChunk: 0 kB Percpu: 7360 kB AnonHugePages: 12288 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Balloon: 0 kB DirectMap4k: 153396 kB DirectMap2M: 4040704 kB DirectMap1G: 2097152 kB and an excerpt of zoneinfo: Node 0, zone Normal pages free 27350 boost 18939 min 27357 low 29461 high 31565 promo 33669 spanned 524288 present 524288 managed 496128 cma 0 protection: (0, 0, 0, 0) nr_free_pages 27350 nr_free_pages_blocks 0 nr_zone_inactive_anon 21269 nr_zone_active_anon 9703 nr_zone_inactive_file 62769 nr_zone_active_file 228878 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_mlock 0 nr_free_cma 0 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-04 20:40 ` Matthew Wilcox @ 2026-04-05 22:29 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2026-04-05 22:29 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-xfs On Sat, Apr 04, 2026 at 09:40:37PM +0100, Matthew Wilcox wrote: > On Sat, Apr 04, 2026 at 10:42:59PM +1100, Dave Chinner wrote: > > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote: > > > This is with commit 5619b098e2fb so after 7.0-rc6 > > > INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793 > > > task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800 > > > Call Trace: > > > <TASK> > > > __schedule+0x560/0xfc0 > > > schedule+0x3e/0x140 > > > schedule_timeout+0x84/0x110 > > > ? __pfx_process_timeout+0x10/0x10 > > > io_schedule_timeout+0x5b/0x80 > > > xfs_buf_alloc+0x793/0x7d0 > > > > -ENOMEM. > > > > It'll be looping here: > > > > fallback: > > for (;;) { > > bp->b_addr = __vmalloc(size, gfp_mask); > > if (bp->b_addr) > > break; > > if (flags & XBF_READ_AHEAD) > > return -ENOMEM; > > XFS_STATS_INC(bp->b_mount, xb_page_retries); > > memalloc_retry_wait(gfp_mask); > > } > > > > If it is looping here long enough to trigger the hang check timer, > > then the MM subsystem is not making progress reclaiming memory. This > > is probably a 16kB allocation (it's an inode cluster buffer), and > > the allocation context is NOFAIL because it is within a transaction > > (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL).... > > There may be something else going on. I reproduced it again and ssh'd > into the VM. > > # free > total used free shared buff/cache available > Mem: 3988260 1197132 240080 144 3147496 2791128 > Swap: 2097148 258128 1839020 > > There are five instances of fsstress running. Very slowly, but they are > accumulating seconds of CPU time: > > root@deadly-kvm:~# ps -aux |grep fsstress > root 3745227 0.0 0.0 2664 1476 ? S 06:48 0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745236 7.5 1.6 127928 65256 ? D 06:48 42:54 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745237 7.6 1.5 124644 61308 ? D 06:48 42:55 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745238 7.6 1.6 130844 65584 ? D 06:48 43:01 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745239 7.6 1.6 126524 66536 ? D 06:48 42:58 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root@deadly-kvm:~# ps -aux |grep fsstress > root 3745227 0.0 0.0 2664 1476 ? S 06:48 0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745236 5.5 1.6 133116 66708 ? R 06:48 45:44 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745237 5.5 1.5 130136 62516 ? R 06:48 45:45 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745238 5.5 1.6 136520 65944 ? R 06:48 45:52 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > root 3745239 5.5 1.7 131988 67884 ? R 06:48 45:50 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000 > > # cat /proc/3745239/stack > [<0>] xfs_buf_lock+0x4b/0x170 > [<0>] xfs_buf_find_lock+0x69/0x140 > [<0>] xfs_buf_get_map+0x265/0xbd0 > [<0>] xfs_buf_read_map+0x59/0x2e0 > [<0>] xfs_trans_read_buf_map+0x1bb/0x560 > [<0>] xfs_read_agi+0xab/0x1a0 > (...) It would be helpful to quote the full stack traces... > # cat /proc/3745238/stack > [<0>] xfs_buf_alloc+0x793/0x7d0 > [<0>] xfs_buf_get_map+0x651/0xbd0 > [<0>] xfs_buf_readahead_map+0x3b/0x1b0 > [<0>] xfs_iwalk_ichunk_ra+0xe9/0x130 > [<0>] xfs_iwalk_ag+0x185/0x2d0 > (...) However, how is memory allocation stuck here? That's the readahead path, which triggers an early exit from the __vmalloc() fallback loop. i.e. xfs_buf_alloc() does not loop forever on readahead - it tries once and then exits. Yes, this bulkstat path is holding the AGI buffer locked, and the previous thread is waiting on the AGI buffer lock, but that doesn't mean the system is deadlocked - it's just lockstepping on the AGI buffer lock due to the long hold in the bulkstat path.... i.e. these traces do not indicate that there is any sort of memory allocation problem in the system, just bulkstat slowing down other operations... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-04 11:42 ` Dave Chinner 2026-04-04 20:40 ` Matthew Wilcox @ 2026-04-05 1:03 ` Ritesh Harjani 2026-04-05 22:16 ` Dave Chinner 1 sibling, 1 reply; 9+ messages in thread From: Ritesh Harjani @ 2026-04-05 1:03 UTC (permalink / raw) To: Dave Chinner, Matthew Wilcox; +Cc: linux-xfs Dave Chinner <dgc@kernel.org> writes: > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote: >> This is with commit 5619b098e2fb so after 7.0-rc6 >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793 >> task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800 >> Call Trace: >> <TASK> >> __schedule+0x560/0xfc0 >> schedule+0x3e/0x140 >> schedule_timeout+0x84/0x110 >> ? __pfx_process_timeout+0x10/0x10 >> io_schedule_timeout+0x5b/0x80 >> xfs_buf_alloc+0x793/0x7d0 > > -ENOMEM. > > It'll be looping here: > > fallback: > for (;;) { > bp->b_addr = __vmalloc(size, gfp_mask); > if (bp->b_addr) > break; > if (flags & XBF_READ_AHEAD) > return -ENOMEM; > XFS_STATS_INC(bp->b_mount, xb_page_retries); > memalloc_retry_wait(gfp_mask); > } > > If it is looping here long enough to trigger the hang check timer, > then the MM subsystem is not making progress reclaiming memory. This Hi Dave, If that's the case and if we expect the MM subsystem to do memory reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our fallback loop? I see that we might have cleared this flag and also set __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE. So shouldn't we do? if (size > PAGE_SIZE) { if (!is_power_of_2(size)) goto fallback; - gfp_mask &= ~__GFP_DIRECT_RECLAIM; - gfp_mask |= __GFP_NORETRY; + gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; + folio = folio_alloc(alloc_gfp, get_order(size)); + } else { + folio = folio_alloc(gfp_mask, get_order(size)); } - folio = folio_alloc(gfp_mask, get_order(size)); if (!folio) { if (size <= PAGE_SIZE) return -ENOMEM; trace_xfs_buf_backing_fallback(bp, _RET_IP_); goto fallback; } -ritesh > is probably a 16kB allocation (it's an inode cluster buffer), and > the allocation context is NOFAIL because it is within a transaction > (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL).... > > All the other tasks are backed up on the AGI buffer lock held ... > >> xfs_buf_get_map+0x651/0xbd0 >> ? _raw_spin_unlock+0x26/0x50 >> xfs_trans_get_buf_map+0x141/0x300 >> xfs_ialloc_inode_init+0x130/0x2c0 >> xfs_ialloc_ag_alloc+0x226/0x710 >> xfs_dialloc+0x22d/0x980 > > ... here by the task blocked on memory allocation. > > This smells like a persistent ENOMEM/memory reclaim issue and XFS is > just the messenger... > > -Dave. > -- > Dave Chinner > dgc@kernel.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-05 1:03 ` Ritesh Harjani @ 2026-04-05 22:16 ` Dave Chinner 2026-04-06 0:27 ` Ritesh Harjani 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2026-04-05 22:16 UTC (permalink / raw) To: Ritesh Harjani; +Cc: Matthew Wilcox, linux-xfs On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote: > Dave Chinner <dgc@kernel.org> writes: > > > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote: > >> This is with commit 5619b098e2fb so after 7.0-rc6 > >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793 > >> task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800 > >> Call Trace: > >> <TASK> > >> __schedule+0x560/0xfc0 > >> schedule+0x3e/0x140 > >> schedule_timeout+0x84/0x110 > >> ? __pfx_process_timeout+0x10/0x10 > >> io_schedule_timeout+0x5b/0x80 > >> xfs_buf_alloc+0x793/0x7d0 > > > > -ENOMEM. > > > > It'll be looping here: > > > > fallback: > > for (;;) { > > bp->b_addr = __vmalloc(size, gfp_mask); > > if (bp->b_addr) > > break; > > if (flags & XBF_READ_AHEAD) > > return -ENOMEM; > > XFS_STATS_INC(bp->b_mount, xb_page_retries); > > memalloc_retry_wait(gfp_mask); > > } > > > > If it is looping here long enough to trigger the hang check timer, > > then the MM subsystem is not making progress reclaiming memory. This > > Hi Dave, > > If that's the case and if we expect the MM subsystem to do memory > reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our > fallback loop? I see that we might have cleared this flag and also set > __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE. > > So shouldn't we do? > > if (size > PAGE_SIZE) { > if (!is_power_of_2(size)) > goto fallback; > - gfp_mask &= ~__GFP_DIRECT_RECLAIM; > - gfp_mask |= __GFP_NORETRY; > + gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; > + folio = folio_alloc(alloc_gfp, get_order(size)); > + } else { > + folio = folio_alloc(gfp_mask, get_order(size)); > } > - folio = folio_alloc(gfp_mask, get_order(size)); > if (!folio) { > if (size <= PAGE_SIZE) > return -ENOMEM; > trace_xfs_buf_backing_fallback(bp, _RET_IP_); > goto fallback; > } Possibly. That said, we really don't want stuff like compaction to run here -ever- because of how expensive it is for hot paths when memory is low, and the only knob we have to control that is __GFP_DIRECT_RECLAIM. However, turning off direct reclaim should make no difference in the long run because vmalloc is only trying to allocate a batch of single page folios. If we are in low memory situations where no single page folios are not available, then even for a NORETRY/no direct reclaim allocation the expectation is that the failed allocation attempt would be kicking kswapd to perform background memory reclaim. This is especially true when the allocation is GFP_NOFS/GFP_NOIO even with direct reclaim turned on - if all the memory is held in shrinkable fs/vfs caches then direct reclaim cannot reclaim anything filesystem/IO related. i.e. background reclaim making forwards progress is absolutely necessary for any sort of "nofail" allocation loop to succeed regardless of whether direct reclaim is enabled or not. Hence if background memory reclaim is making progress, this allocation loop should eventually succeed. If the allocation is not succeeding, then it implies that some critical resource in the allocation path is not being refilled either on allocation failure or by background reclaim, and hence the allocation failure persists because nothing alleviates the resource shortage that is triggering the ENOMEM issue. So the question is: where in the __vmalloc allocation path is the ENOMEM error being generated from, and is it the same place every time? -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-05 22:16 ` Dave Chinner @ 2026-04-06 0:27 ` Ritesh Harjani 2026-04-06 21:45 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Ritesh Harjani @ 2026-04-06 0:27 UTC (permalink / raw) To: Dave Chinner; +Cc: Matthew Wilcox, linux-xfs Thanks Dave for your inputs. I have few more data points on the same. It will be nice to know your thoughts on this. Dave Chinner <dgc@kernel.org> writes: > On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote: >> Dave Chinner <dgc@kernel.org> writes: >> >> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote: >> >> This is with commit 5619b098e2fb so after 7.0-rc6 >> >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793 >> >> task:fsstress state:D stack:0 pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800 >> >> Call Trace: >> >> <TASK> >> >> __schedule+0x560/0xfc0 >> >> schedule+0x3e/0x140 >> >> schedule_timeout+0x84/0x110 >> >> ? __pfx_process_timeout+0x10/0x10 >> >> io_schedule_timeout+0x5b/0x80 >> >> xfs_buf_alloc+0x793/0x7d0 >> > >> > -ENOMEM. >> > >> > It'll be looping here: >> > >> > fallback: >> > for (;;) { >> > bp->b_addr = __vmalloc(size, gfp_mask); >> > if (bp->b_addr) >> > break; >> > if (flags & XBF_READ_AHEAD) >> > return -ENOMEM; >> > XFS_STATS_INC(bp->b_mount, xb_page_retries); >> > memalloc_retry_wait(gfp_mask); >> > } >> > >> > If it is looping here long enough to trigger the hang check timer, >> > then the MM subsystem is not making progress reclaiming memory. This >> >> Hi Dave, >> >> If that's the case and if we expect the MM subsystem to do memory >> reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our >> fallback loop? I see that we might have cleared this flag and also set >> __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE. >> >> So shouldn't we do? >> >> if (size > PAGE_SIZE) { >> if (!is_power_of_2(size)) >> goto fallback; >> - gfp_mask &= ~__GFP_DIRECT_RECLAIM; >> - gfp_mask |= __GFP_NORETRY; >> + gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; >> + folio = folio_alloc(alloc_gfp, get_order(size)); >> + } else { >> + folio = folio_alloc(gfp_mask, get_order(size)); >> } >> - folio = folio_alloc(gfp_mask, get_order(size)); >> if (!folio) { >> if (size <= PAGE_SIZE) >> return -ENOMEM; >> trace_xfs_buf_backing_fallback(bp, _RET_IP_); >> goto fallback; >> } > > Possibly. > > That said, we really don't want stuff like compaction to > run here -ever- because of how expensive it is for hot paths when > memory is low, and the only knob we have to control that is > __GFP_DIRECT_RECLAIM. > Looking at __alloc_pages_direct_compact(), it returns immediately for order=0 allocations. > However, turning off direct reclaim should make no difference in > the long run because vmalloc is only trying to allocate a batch of > single page folios. > > If we are in low memory situations where no single page folios are > not available, then even for a NORETRY/no direct reclaim allocation > the expectation is that the failed allocation attempt would be > kicking kswapd to perform background memory reclaim. > > This is especially true when the allocation is GFP_NOFS/GFP_NOIO > even with direct reclaim turned on - if all the memory is held in > shrinkable fs/vfs caches then direct reclaim cannot reclaim anything > filesystem/IO related. > So, looking at the logs from Matthew, I think, this case might have benefitted from __GFP_DIRECT_RECLAIM, because we have many clean inactive file pages. So theoritically, IMO direct reclaim should be able to use one of those clean file pages (after it gets direct-reclaimed) nr_zone_inactive_file 62769 nr_zone_write_pending 0 > i.e. background reclaim making forwards progress is absolutely > necessary for any sort of "nofail" allocation loop to succeed > regardless of whether direct reclaim is enabled or not. > > Hence if background memory reclaim is making progress, this > allocation loop should eventually succeed. If the allocation is not > succeeding, then it implies that some critical resource in the > allocation path is not being refilled either on allocation failure > or by background reclaim, and hence the allocation failure persists > because nothing alleviates the resource shortage that is triggering > the ENOMEM issue. I agree, background memory reclaim / kswapd thread should have made forward progress. I am not sure why in this case, we are we hitting hung tasks issues then. Could be because of multiple fsstress threads running in parallel (from ps -eax output), and maybe some other process ends up using the pages reclaimed by background kswapd (just a theory). > > So the question is: where in the __vmalloc allocation path is the > ENOMEM error being generated from, and is it the same place every > time? > Although I can't say for sure, but in this case after looking at the code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might be returning from here (after get_page_from_freelist() couldn't get a free page). __alloc_pages_slowpath() { ... /* Caller is not willing to reclaim, we can't balance anything */ if (!can_direct_reclaim) goto nopage; So, with the above data, I think, In this case, passing __GFP_DIRECT_RECLAIM in vmalloc fallback path might help. And either ways, until we have a page allocated, we anyway do an infinite retry, so we may as well pass __GFP_DIRECT_RECLAIM flag to it, right? fallback: for (;;) { bp->b_addr = __vmalloc(size, gfp_mask); if (bp->b_addr) break; if (flags & XBF_READ_AHEAD) return -ENOMEM; XFS_STATS_INC(bp->b_mount, xb_page_retries); memalloc_retry_wait(gfp_mask); } Thoughts? I am not sure how easily this issue is reproducible at Matthew's end. But let me also keep a kvm guest with the same kernel version to see if I can replicate this at my end in an overnight run of xfs/285 in a loop. -ritesh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-06 0:27 ` Ritesh Harjani @ 2026-04-06 21:45 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2026-04-06 21:45 UTC (permalink / raw) To: Ritesh Harjani; +Cc: Matthew Wilcox, linux-xfs On Mon, Apr 06, 2026 at 05:57:06AM +0530, Ritesh Harjani wrote: > > However, turning off direct reclaim should make no difference in > > the long run because vmalloc is only trying to allocate a batch of > > single page folios. > > > > If we are in low memory situations where no single page folios are > > not available, then even for a NORETRY/no direct reclaim allocation > > the expectation is that the failed allocation attempt would be > > kicking kswapd to perform background memory reclaim. > > > > This is especially true when the allocation is GFP_NOFS/GFP_NOIO > > even with direct reclaim turned on - if all the memory is held in > > shrinkable fs/vfs caches then direct reclaim cannot reclaim anything > > filesystem/IO related. > > > > So, looking at the logs from Matthew, I think, this case might have > benefitted from __GFP_DIRECT_RECLAIM, because we have many clean > inactive file pages. So theoritically, IMO direct reclaim should be able > to use one of those clean file pages (after it gets direct-reclaimed) > > nr_zone_inactive_file 62769 > nr_zone_write_pending 0 You miss the point - this is not an isolated use case. e.g. Look at xlog_kvmalloc() - it's also ~__GFP_DIRECT_RECLAIM, NORETRY vmalloc() loop. What's to stop that one from getting stuck in exactly the same way? To that point, kvmalloc(GFP_NOFAIL) now implements the semantics that xlog_kvmalloc() requires - it turns of direct reclaim (and hence costly compaction) for the kmalloc() allocation attempt, then falls back to vmalloc(GFP_NOFAIL) if kmalloc fails. That's also pretty much the exact semantics we are trying to implement in in xfs_buf_alloc(), yes? i.e. xfs_buf_alloc() does: For buffers < PAGE_SIZE, it calls kmalloc() directly and returns. For buffers == PAGESIZE, it calls folio_alloc(GFP_KERNEL). For buffers > PAGE_SIZE, it calls folio_alloc(NORETRY, ~__GFP_DIRECT_RECLAIM) if either folio_alloc() call fails, it effectively runs an open coded __vmalloc() no-fail loop. IOWs we are implementing essentially the same semantics as kvmalloc(__GFP_NOFAIL), modulo the reclaim flags for the __vmalloc() loop. If we are going to change the flags for the vmalloc() loop to be the original, then we are essentially reimplementing kvmalloc(GFP_NOFAIL) semantics exactly. At which point.... > > i.e. background reclaim making forwards progress is absolutely > > necessary for any sort of "nofail" allocation loop to succeed > > regardless of whether direct reclaim is enabled or not. > > > > Hence if background memory reclaim is making progress, this > > allocation loop should eventually succeed. If the allocation is not > > succeeding, then it implies that some critical resource in the > > allocation path is not being refilled either on allocation failure > > or by background reclaim, and hence the allocation failure persists > > because nothing alleviates the resource shortage that is triggering > > the ENOMEM issue. > > I agree, background memory reclaim / kswapd thread should have made > forward progress. > > I am not sure why in this case, we are we hitting hung tasks issues then. > Could be because of multiple fsstress threads running in parallel (from > ps -eax output), and maybe some other process ends up using the pages > reclaimed by background kswapd (just a theory). I don't think that's the case, because kswapd is supposed to run until watermarks are reached and that means all free page pools are supposed to have at least some free pages in them... That's why I think there's a reclaim bug lurking here - allocation appears to be stalling on something that background reclaim is not refilling. And if allocation is stalling on buffer allocation, then it can stall in other critical parts of XFS, too. Background reclaim not doing sufficient work to make looping non-blocking, no-retry allocations to succeed seems like a memory allocation/reclaim bug to me, not an XFS issue... > > So the question is: where in the __vmalloc allocation path is the > > ENOMEM error being generated from, and is it the same place every > > time? > > > > Although I can't say for sure, but in this case after looking at the > code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might > be returning from here (after get_page_from_freelist() couldn't get a > free page). > > __alloc_pages_slowpath() { > ... > /* Caller is not willing to reclaim, we can't balance anything */ > if (!can_direct_reclaim) > goto nopage; Sure, we can't balance anything, but we've set ALLOC_KSWAPD early in this function and so every time we get to the above point in the allocation code we've alreayd run this: retry: /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); Hence kswapds should be active and doing reclaim work to bring everything back to minimum free pool watermarks. That *should* be sufficient for a no-direct-reclaim allocation loop to make progress. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Hang with xfs/285 on 2026-03-02 kernel 2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox 2026-04-04 11:42 ` Dave Chinner @ 2026-04-07 5:41 ` Christoph Hellwig 1 sibling, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2026-04-07 5:41 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-xfs I tried to reproduce it, but failed. Can you share more context? mkfs/mount options, kernel .config, system information? ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-07 5:41 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox 2026-04-04 11:42 ` Dave Chinner 2026-04-04 20:40 ` Matthew Wilcox 2026-04-05 22:29 ` Dave Chinner 2026-04-05 1:03 ` Ritesh Harjani 2026-04-05 22:16 ` Dave Chinner 2026-04-06 0:27 ` Ritesh Harjani 2026-04-06 21:45 ` Dave Chinner 2026-04-07 5:41 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox