public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Hang with xfs/285 on 2026-03-02 kernel
@ 2026-04-03 15:35 Matthew Wilcox
  2026-04-04 11:42 ` Dave Chinner
  2026-04-07  5:41 ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-03 15:35 UTC (permalink / raw)
  To: linux-xfs

This is with commit 5619b098e2fb so after 7.0-rc6

xfs/285       run fstests xfs/285 at 2026-04-03 06:11:42
XFS (vdc): Mounting V5 Filesystem e091474f-2cd9-4425-a30c-1114d62d130b
XFS (vdc): Ending clean mount
INFO: task fsstress:3762792 blocked for more than 120 seconds.
      Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress        state:D stack:0     pid:3762792 tgid:3762792 ppid:3762783 task_flags:0x440140 flags:0x00080000
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0xb3/0x110
 __down_common+0x15c/0x2c0
 __down+0x1d/0x30
 down+0x68/0x80
 xfs_buf_lock+0x4b/0x170
 xfs_buf_find_lock+0x69/0x140
 xfs_buf_get_map+0x265/0xbd0
 xfs_buf_read_map+0x59/0x2e0
 xfs_trans_read_buf_map+0x1bb/0x560
 ? xfs_read_agi+0xab/0x1a0
 xfs_read_agi+0xab/0x1a0
 xfs_ialloc_read_agi+0x61/0x200
 xfs_iwalk_ag_start.constprop.0+0x4e/0x1e0
 xfs_iwalk_ag+0x78/0x2d0
 xfs_iwalk_args.constprop.0+0x67/0x120
 xfs_iwalk+0x93/0xa0
 ? __pfx_xfs_bulkstat_iwalk+0x10/0x10
 xfs_bulkstat+0xce/0x150
 ? __pfx_xfs_fsbulkstat_one_fmt+0x10/0x10
 xfs_ioc_fsbulkstat.isra.0+0x122/0x1f0
 xfs_file_ioctl+0xd52/0x1230
 ? find_held_lock+0x31/0x90
 ? kmem_cache_free+0x26c/0x460
 ? lock_release+0xba/0x260
 ? putname+0x45/0x80
 ? kmem_cache_free+0x271/0x460
 __x64_sys_ioctl+0x4d0/0x9d0
 x64_sys_call+0xf1f/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be22237b
RSP: 002b:00007ffe8acd1a30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000002300 RCX: 00007f37be22237b
RDX: 00007ffe8acd1aa0 RSI: ffffffffc0205865 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007f37be2fdac0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000080
R13: 00007ffe8acd1aa0 R14: 000055ac9b289fe0 R15: 000000000001382d
 </TASK>
INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0x84/0x110
 ? __pfx_process_timeout+0x10/0x10
 io_schedule_timeout+0x5b/0x80
 xfs_buf_alloc+0x793/0x7d0
 xfs_buf_get_map+0x651/0xbd0
 ? _raw_spin_unlock+0x26/0x50
 xfs_trans_get_buf_map+0x141/0x300
 xfs_ialloc_inode_init+0x130/0x2c0
 xfs_ialloc_ag_alloc+0x226/0x710
 xfs_dialloc+0x22d/0x980
 ? xfs_ilock+0x168/0x2b0
 xfs_create+0x29e/0x4a0
 ? __get_acl+0x2d/0x1c0
 xfs_generic_create+0x2a4/0x330
 xfs_vn_mkdir+0x1e/0x30
 vfs_mkdir+0xaf/0x1f0
 filename_mkdirat+0x81/0x190
 __x64_sys_mkdir+0x32/0x50
 x64_sys_call+0x8e4/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
 </TASK>
INFO: task fsstress:3762794 blocked for more than 120 seconds.
      Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress        state:D stack:0     pid:3762794 tgid:3762794 ppid:3762783 task_flags:0x440140 flags:0x00080000
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0xb3/0x110
 __down_common+0x15c/0x2c0
 __down+0x1d/0x30
 down+0x68/0x80
 xfs_buf_lock+0x4b/0x170
 xfs_buf_find_lock+0x69/0x140
 xfs_buf_get_map+0x265/0xbd0
 ? find_held_lock+0x31/0x90
 xfs_buf_read_map+0x59/0x2e0
 xfs_trans_read_buf_map+0x1bb/0x560
 ? xfs_read_agi+0xab/0x1a0
 xfs_read_agi+0xab/0x1a0
 xfs_ialloc_read_agi+0x61/0x200
 xfs_dialloc+0x1f1/0x980
 ? xfs_ilock+0x168/0x2b0
 xfs_create+0x29e/0x4a0
 ? __get_acl+0x2d/0x1c0
 xfs_generic_create+0x2a4/0x330
 xfs_vn_mknod+0x18/0x20
 vfs_mknod+0xcd/0x200
 filename_mknodat+0x1fd/0x2a0
 __x64_sys_mknodat+0x3f/0x60
 x64_sys_call+0x1c77/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218bf3
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000256 ORIG_RAX: 0000000000000103
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218bf3
RDX: 0000000000002124 RSI: 000055ac9ac3ff40 RDI: 00000000ffffff9c
RBP: 00007ffe8acd1ac0 R08: 000000055ac9afba R09: 00007f37be2fdac0
R10: 0000000000000000 R11: 0000000000000256 R12: 0000000000002124
R13: 0000000000000000 R14: 00000000000017a5 R15: 000055ac98e468f0
 </TASK>
INFO: task fsstress:3762794 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0x84/0x110
 ? __pfx_process_timeout+0x10/0x10
 io_schedule_timeout+0x5b/0x80
 xfs_buf_alloc+0x793/0x7d0
 xfs_buf_get_map+0x651/0xbd0
 ? _raw_spin_unlock+0x26/0x50
 xfs_trans_get_buf_map+0x141/0x300
 xfs_ialloc_inode_init+0x130/0x2c0
 xfs_ialloc_ag_alloc+0x226/0x710
 xfs_dialloc+0x22d/0x980
 ? xfs_ilock+0x168/0x2b0
 xfs_create+0x29e/0x4a0
 ? __get_acl+0x2d/0x1c0
 xfs_generic_create+0x2a4/0x330
 xfs_vn_mkdir+0x1e/0x30
 vfs_mkdir+0xaf/0x1f0
 filename_mkdirat+0x81/0x190
 __x64_sys_mkdir+0x32/0x50
 x64_sys_call+0x8e4/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
 </TASK>
INFO: task fsstress:3762795 blocked for more than 120 seconds.
      Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress        state:D stack:0     pid:3762795 tgid:3762795 ppid:3762783 task_flags:0x440140 flags:0x00080000
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0xb3/0x110
 __down_common+0x15c/0x2c0
 __down+0x1d/0x30
 down+0x68/0x80
 xfs_buf_lock+0x4b/0x170
 xfs_buf_find_lock+0x69/0x140
 xfs_buf_get_map+0x265/0xbd0
 ? xfs_trans_add_item+0xf2/0x1b0
 xfs_buf_read_map+0x59/0x2e0
 xfs_trans_read_buf_map+0x1bb/0x560
 ? xfs_read_agi+0xab/0x1a0
 xfs_read_agi+0xab/0x1a0
 xfs_ialloc_read_agi+0x61/0x200
 xfs_iwalk_ag_start.constprop.0+0x4e/0x1e0
 xfs_iwalk_ag+0x78/0x2d0
 xfs_iwalk_args.constprop.0+0x67/0x120
 xfs_iwalk+0x93/0xa0
 ? __pfx_xfs_bulkstat_iwalk+0x10/0x10
 xfs_bulkstat+0xce/0x150
 ? __pfx_xfs_fsbulkstat_one_fmt+0x10/0x10
 xfs_ioc_fsbulkstat.isra.0+0x122/0x1f0
 xfs_file_ioctl+0xd52/0x1230
 ? find_held_lock+0x31/0x90
 ? kmem_cache_free+0x26c/0x460
 ? lock_release+0xba/0x260
 ? putname+0x45/0x80
 ? kmem_cache_free+0x271/0x460
 __x64_sys_ioctl+0x4d0/0x9d0
 x64_sys_call+0xf1f/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be22237b
RSP: 002b:00007ffe8acd1a30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000001f38 RCX: 00007f37be22237b
RDX: 00007ffe8acd1aa0 RSI: ffffffffc0205865 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007f37be2fdac0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e7
R13: 00007ffe8acd1aa0 R14: 000055ac9b20b7d0 R15: 00000000000130f4
 </TASK>
INFO: task fsstress:3762795 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0x84/0x110
 ? __pfx_process_timeout+0x10/0x10
 io_schedule_timeout+0x5b/0x80
 xfs_buf_alloc+0x793/0x7d0
 xfs_buf_get_map+0x651/0xbd0
 ? _raw_spin_unlock+0x26/0x50
 xfs_trans_get_buf_map+0x141/0x300
 xfs_ialloc_inode_init+0x130/0x2c0
 xfs_ialloc_ag_alloc+0x226/0x710
 xfs_dialloc+0x22d/0x980
 ? xfs_ilock+0x168/0x2b0
 xfs_create+0x29e/0x4a0
 ? __get_acl+0x2d/0x1c0
 xfs_generic_create+0x2a4/0x330
 xfs_vn_mkdir+0x1e/0x30
 vfs_mkdir+0xaf/0x1f0
 filename_mkdirat+0x81/0x190
 __x64_sys_mkdir+0x32/0x50
 x64_sys_call+0x8e4/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
 </TASK>
INFO: task kworker/8:19:3762862 blocked for more than 120 seconds.
      Not tainted 7.0.0-rc6-ktest-00166-g5619b098e2fb #104
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:19    state:D stack:0     pid:3762862 tgid:3762862 ppid:2      task_flags:0x4248060 flags:0x00080000
Workqueue: xfs-conv/vdc xfs_end_io
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0xb3/0x110
 __down_common+0x15c/0x2c0
 __down+0x1d/0x30
 down+0x68/0x80
 xfs_buf_lock+0x4b/0x170
 xfs_buf_find_lock+0x69/0x140
 xfs_buf_get_map+0x265/0xbd0
 ? xfs_btree_overlapped_query_range+0x39f/0x620
 xfs_buf_read_map+0x59/0x2e0
 xfs_trans_read_buf_map+0x1bb/0x560
 ? xfs_read_agf+0xa3/0x170
 xfs_read_agf+0xa3/0x170
 xfs_alloc_read_agf+0x73/0x370
 xfs_alloc_fix_freelist+0x2dc/0x670
 ? find_held_lock+0x31/0x90
 xfs_free_extent_fix_freelist+0x5e/0x80
 xfs_rmap_finish_one+0xc4/0x300
 ? kmem_cache_alloc_noprof+0x36a/0x450
 ? xfs_rmap_update_create_done+0x29/0xb0
 xfs_rmap_update_finish_item+0x1e/0x40
 xfs_defer_finish_one+0xc0/0x2d0
 ? xfs_defer_relog+0x56/0x280
 xfs_defer_finish_noroll+0x1ad/0x540
 xfs_trans_commit+0x4e/0x70
 xfs_iomap_write_unwritten+0xdd/0x340
 xfs_end_ioend_write+0x219/0x2c0
 xfs_end_io+0xdc/0xf0
 process_one_work+0x1fb/0x570
 ? lock_is_held_type+0x93/0x100
 worker_thread+0x1e6/0x3f0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x10d/0x140
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x1b4/0x250
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
INFO: task kworker/8:19:3762862 blocked on a semaphore likely last held by task fsstress:3762793
task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
Call Trace:
 <TASK>
 __schedule+0x560/0xfc0
 schedule+0x3e/0x140
 schedule_timeout+0x84/0x110
 ? __pfx_process_timeout+0x10/0x10
 io_schedule_timeout+0x5b/0x80
 xfs_buf_alloc+0x793/0x7d0
 xfs_buf_get_map+0x651/0xbd0
 ? _raw_spin_unlock+0x26/0x50
 xfs_trans_get_buf_map+0x141/0x300
 xfs_ialloc_inode_init+0x130/0x2c0
 xfs_ialloc_ag_alloc+0x226/0x710
 xfs_dialloc+0x22d/0x980
 ? xfs_ilock+0x168/0x2b0
 xfs_create+0x29e/0x4a0
 ? __get_acl+0x2d/0x1c0
 xfs_generic_create+0x2a4/0x330
 xfs_vn_mkdir+0x1e/0x30
 vfs_mkdir+0xaf/0x1f0
 filename_mkdirat+0x81/0x190
 __x64_sys_mkdir+0x32/0x50
 x64_sys_call+0x8e4/0x1dd0
 do_syscall_64+0x74/0x3f0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37be218b47
RSP: 002b:00007ffe8acd1958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f37be218b47
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055ac9ac6de40
RBP: 00007ffe8acd1ac0 R08: 000000055ac9aeaa R09: 00007f37be2fdac0
R10: 0000000000000007 R11: 0000000000000206 R12: 00000000000001ff
R13: 00007ffe8acd1ac0 R14: 0000000000002a8d R15: 000055ac98e46790
 </TASK>

Showing all locks held in the system:
1 lock held by khungtaskd/100:
 #0: ffffffff826d67c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x51/0x115
5 locks held by kworker/u64:0/3558666:
 #0: ffff88810331dd48 ((wq_completion)xfs-blockgc/vdc){....}-{0:0}, at: process_one_work+0x45c/0x570
 #1: ffff88810f08fe48 ((work_completion)(&(&pag->pag_blockgc_work)->work)){....}-{0:0}, at: process_one_work+0x1bb/0x570
 #2: ffff888155e3c928 (&sb->s_type->i_mutex_key#17){....}-{3:3}, at: xfs_ilock_nowait+0x1ee/0x330
 #3: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_free_eofblocks+0xda/0x1c0
 #4: ffff888155e3c718 (&xfs_nondir_ilock_class){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
4 locks held by fsstress/3762793:
 #0: ffff8881478f53f0 (sb_writers#10){....}-{0:0}, at: filename_create+0x6e/0x180
 #1: ffff88816e264228 (&inode->i_sb->s_type->i_mutex_dir_key/1){....}-{3:3}, at: filename_create+0xad/0x180
 #2: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_trans_alloc_icreate+0x58/0x100
 #3: ffff88816e264018 (&xfs_dir_ilock_class/5){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
4 locks held by fsstress/3762794:
 #0: ffff8881478f53f0 (sb_writers#10){....}-{0:0}, at: filename_create+0x6e/0x180
 #1: ffff888038517328 (&inode->i_sb->s_type->i_mutex_dir_key/1){....}-{3:3}, at: filename_create+0xad/0x180
 #2: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_trans_alloc_icreate+0x58/0x100
 #3: ffff888038517118 (&xfs_dir_ilock_class/5){....}-{3:3}, at: xfs_ilock+0x168/0x2b0
4 locks held by kworker/8:19/3762862:
 #0: ffff88815e73ed48 ((wq_completion)xfs-conv/vdc){....}-{0:0}, at: process_one_work+0x45c/0x570
 #1: ffff888104efbe48 ((work_completion)(&ip->i_ioend_work)){....}-{0:0}, at: process_one_work+0x1bb/0x570
 #2: ffff8881478f55e0 (sb_internal#2){....}-{0:0}, at: xfs_trans_alloc_inode+0x7d/0x190
 #3: ffff888137bb2b18 (&xfs_nondir_ilock_class){....}-{3:3}, at: xfs_ilock+0x168/0x2b0

(there are more messages after this, but i doubt they're useful)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox
@ 2026-04-04 11:42 ` Dave Chinner
  2026-04-04 20:40   ` Matthew Wilcox
  2026-04-05  1:03   ` Ritesh Harjani
  2026-04-07  5:41 ` Christoph Hellwig
  1 sibling, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2026-04-04 11:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs

On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
> This is with commit 5619b098e2fb so after 7.0-rc6
> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
> task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
> Call Trace:
>  <TASK>
>  __schedule+0x560/0xfc0
>  schedule+0x3e/0x140
>  schedule_timeout+0x84/0x110
>  ? __pfx_process_timeout+0x10/0x10
>  io_schedule_timeout+0x5b/0x80
>  xfs_buf_alloc+0x793/0x7d0

-ENOMEM.

It'll be looping here:

fallback:
        for (;;) {
                bp->b_addr = __vmalloc(size, gfp_mask);
                if (bp->b_addr)
                        break;
                if (flags & XBF_READ_AHEAD)
                        return -ENOMEM;
                XFS_STATS_INC(bp->b_mount, xb_page_retries);
                memalloc_retry_wait(gfp_mask);
        }

If it is looping here long enough to trigger the hang check timer,
then the MM subsystem is not making progress reclaiming memory. This
is probably a 16kB allocation (it's an inode cluster buffer), and
the allocation context is NOFAIL because it is within a transaction
(this loop pre-dates __vmalloc() supporting __GFP_NOFAIL)....

All the other tasks are backed up on the AGI buffer lock held ...

>  xfs_buf_get_map+0x651/0xbd0
>  ? _raw_spin_unlock+0x26/0x50
>  xfs_trans_get_buf_map+0x141/0x300
>  xfs_ialloc_inode_init+0x130/0x2c0
>  xfs_ialloc_ag_alloc+0x226/0x710
>  xfs_dialloc+0x22d/0x980

... here by the task blocked on memory allocation.

This smells like a persistent ENOMEM/memory reclaim issue and XFS is
just the messenger...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-04 11:42 ` Dave Chinner
@ 2026-04-04 20:40   ` Matthew Wilcox
  2026-04-05 22:29     ` Dave Chinner
  2026-04-05  1:03   ` Ritesh Harjani
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-04 20:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Apr 04, 2026 at 10:42:59PM +1100, Dave Chinner wrote:
> On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
> > This is with commit 5619b098e2fb so after 7.0-rc6
> > INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
> > task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
> > Call Trace:
> >  <TASK>
> >  __schedule+0x560/0xfc0
> >  schedule+0x3e/0x140
> >  schedule_timeout+0x84/0x110
> >  ? __pfx_process_timeout+0x10/0x10
> >  io_schedule_timeout+0x5b/0x80
> >  xfs_buf_alloc+0x793/0x7d0
> 
> -ENOMEM.
> 
> It'll be looping here:
> 
> fallback:
>         for (;;) {
>                 bp->b_addr = __vmalloc(size, gfp_mask);
>                 if (bp->b_addr)
>                         break;
>                 if (flags & XBF_READ_AHEAD)
>                         return -ENOMEM;
>                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
>                 memalloc_retry_wait(gfp_mask);
>         }
> 
> If it is looping here long enough to trigger the hang check timer,
> then the MM subsystem is not making progress reclaiming memory. This
> is probably a 16kB allocation (it's an inode cluster buffer), and
> the allocation context is NOFAIL because it is within a transaction
> (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL)....

There may be something else going on.  I reproduced it again and ssh'd
into the VM.

# free
               total        used        free      shared  buff/cache   available
Mem:         3988260     1197132      240080         144     3147496     2791128
Swap:        2097148      258128     1839020

There are five instances of fsstress running.  Very slowly, but they are
accumulating seconds of CPU time:

root@deadly-kvm:~# ps -aux |grep fsstress
root     3745227  0.0  0.0   2664  1476 ?        S    06:48   0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745236  7.5  1.6 127928 65256 ?        D    06:48  42:54 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745237  7.6  1.5 124644 61308 ?        D    06:48  42:55 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745238  7.6  1.6 130844 65584 ?        D    06:48  43:01 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745239  7.6  1.6 126524 66536 ?        D    06:48  42:58 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root@deadly-kvm:~# ps -aux |grep fsstress
root     3745227  0.0  0.0   2664  1476 ?        S    06:48   0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745236  5.5  1.6 133116 66708 ?        R    06:48  45:44 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745237  5.5  1.5 130136 62516 ?        R    06:48  45:45 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745238  5.5  1.6 136520 65944 ?        R    06:48  45:52 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
root     3745239  5.5  1.7 131988 67884 ?        R    06:48  45:50 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000

# cat /proc/3745239/stack
[<0>] xfs_buf_lock+0x4b/0x170
[<0>] xfs_buf_find_lock+0x69/0x140
[<0>] xfs_buf_get_map+0x265/0xbd0
[<0>] xfs_buf_read_map+0x59/0x2e0
[<0>] xfs_trans_read_buf_map+0x1bb/0x560
[<0>] xfs_read_agi+0xab/0x1a0
(...)

# cat /proc/3745238/stack
[<0>] xfs_buf_alloc+0x793/0x7d0
[<0>] xfs_buf_get_map+0x651/0xbd0
[<0>] xfs_buf_readahead_map+0x3b/0x1b0
[<0>] xfs_iwalk_ichunk_ra+0xe9/0x130
[<0>] xfs_iwalk_ag+0x185/0x2d0
(...)

It doesn't _seem_ like the system is struggling for memory.

# cat /proc/meminfo
MemTotal:        3988260 kB
MemFree:          241956 kB
MemAvailable:    2781960 kB
Buffers:            5184 kB
Cached:          2503020 kB
SwapCached:         4860 kB
Active:          2062948 kB
Inactive:         713828 kB
Active(anon):      85800 kB
Inactive(anon):   182968 kB
Active(file):    1977148 kB
Inactive(file):   530860 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097148 kB
SwapFree:        1823052 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        267836 kB
Mapped:            16280 kB
Shmem:               144 kB
KReclaimable:     628212 kB
Slab:             783840 kB
SReclaimable:     628212 kB
SUnreclaim:       155628 kB
KernelStack:        3536 kB
PageTables:         3680 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4091276 kB
Committed_AS:     560852 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       13004 kB
VmallocChunk:          0 kB
Percpu:             7360 kB
AnonHugePages:     12288 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
Balloon:               0 kB
DirectMap4k:      153396 kB
DirectMap2M:     4040704 kB
DirectMap1G:     2097152 kB

and an excerpt of zoneinfo:

Node 0, zone   Normal
  pages free     27350
        boost    18939
        min      27357
        low      29461
        high     31565
        promo    33669
        spanned  524288
        present  524288
        managed  496128
        cma      0
        protection: (0, 0, 0, 0)
      nr_free_pages 27350
      nr_free_pages_blocks 0
      nr_zone_inactive_anon 21269
      nr_zone_active_anon 9703
      nr_zone_inactive_file 62769
      nr_zone_active_file 228878
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_free_cma  0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-04 11:42 ` Dave Chinner
  2026-04-04 20:40   ` Matthew Wilcox
@ 2026-04-05  1:03   ` Ritesh Harjani
  2026-04-05 22:16     ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Ritesh Harjani @ 2026-04-05  1:03 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox; +Cc: linux-xfs

Dave Chinner <dgc@kernel.org> writes:

> On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
>> This is with commit 5619b098e2fb so after 7.0-rc6
>> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
>> task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
>> Call Trace:
>>  <TASK>
>>  __schedule+0x560/0xfc0
>>  schedule+0x3e/0x140
>>  schedule_timeout+0x84/0x110
>>  ? __pfx_process_timeout+0x10/0x10
>>  io_schedule_timeout+0x5b/0x80
>>  xfs_buf_alloc+0x793/0x7d0
>
> -ENOMEM.
>
> It'll be looping here:
>
> fallback:
>         for (;;) {
>                 bp->b_addr = __vmalloc(size, gfp_mask);
>                 if (bp->b_addr)
>                         break;
>                 if (flags & XBF_READ_AHEAD)
>                         return -ENOMEM;
>                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
>                 memalloc_retry_wait(gfp_mask);
>         }
>
> If it is looping here long enough to trigger the hang check timer,
> then the MM subsystem is not making progress reclaiming memory. This

Hi Dave,

If that's the case and if we expect the MM subsystem to do memory
reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our
fallback loop? I see that we might have cleared this flag and also set
__GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE.

So shouldn't we do?

        if (size > PAGE_SIZE) {
                if (!is_power_of_2(size))
                        goto fallback;
-               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-               gfp_mask |= __GFP_NORETRY;
+               gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
+               folio = folio_alloc(alloc_gfp, get_order(size));
+       } else {
+               folio = folio_alloc(gfp_mask, get_order(size));
        }
-       folio = folio_alloc(gfp_mask, get_order(size));
        if (!folio) {
                if (size <= PAGE_SIZE)
                        return -ENOMEM;
                trace_xfs_buf_backing_fallback(bp, _RET_IP_);
                goto fallback;
        }


-ritesh

> is probably a 16kB allocation (it's an inode cluster buffer), and
> the allocation context is NOFAIL because it is within a transaction
> (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL)....
>
> All the other tasks are backed up on the AGI buffer lock held ...
>
>>  xfs_buf_get_map+0x651/0xbd0
>>  ? _raw_spin_unlock+0x26/0x50
>>  xfs_trans_get_buf_map+0x141/0x300
>>  xfs_ialloc_inode_init+0x130/0x2c0
>>  xfs_ialloc_ag_alloc+0x226/0x710
>>  xfs_dialloc+0x22d/0x980
>
> ... here by the task blocked on memory allocation.
>
> This smells like a persistent ENOMEM/memory reclaim issue and XFS is
> just the messenger...
>
> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-05  1:03   ` Ritesh Harjani
@ 2026-04-05 22:16     ` Dave Chinner
  2026-04-06  0:27       ` Ritesh Harjani
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2026-04-05 22:16 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Matthew Wilcox, linux-xfs

On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote:
> Dave Chinner <dgc@kernel.org> writes:
> 
> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
> >> This is with commit 5619b098e2fb so after 7.0-rc6
> >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
> >> task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
> >> Call Trace:
> >>  <TASK>
> >>  __schedule+0x560/0xfc0
> >>  schedule+0x3e/0x140
> >>  schedule_timeout+0x84/0x110
> >>  ? __pfx_process_timeout+0x10/0x10
> >>  io_schedule_timeout+0x5b/0x80
> >>  xfs_buf_alloc+0x793/0x7d0
> >
> > -ENOMEM.
> >
> > It'll be looping here:
> >
> > fallback:
> >         for (;;) {
> >                 bp->b_addr = __vmalloc(size, gfp_mask);
> >                 if (bp->b_addr)
> >                         break;
> >                 if (flags & XBF_READ_AHEAD)
> >                         return -ENOMEM;
> >                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
> >                 memalloc_retry_wait(gfp_mask);
> >         }
> >
> > If it is looping here long enough to trigger the hang check timer,
> > then the MM subsystem is not making progress reclaiming memory. This
> 
> Hi Dave,
> 
> If that's the case and if we expect the MM subsystem to do memory
> reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our
> fallback loop? I see that we might have cleared this flag and also set
> __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE.
> 
> So shouldn't we do?
> 
>         if (size > PAGE_SIZE) {
>                 if (!is_power_of_2(size))
>                         goto fallback;
> -               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> -               gfp_mask |= __GFP_NORETRY;
> +               gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
> +               folio = folio_alloc(alloc_gfp, get_order(size));
> +       } else {
> +               folio = folio_alloc(gfp_mask, get_order(size));
>         }
> -       folio = folio_alloc(gfp_mask, get_order(size));
>         if (!folio) {
>                 if (size <= PAGE_SIZE)
>                         return -ENOMEM;
>                 trace_xfs_buf_backing_fallback(bp, _RET_IP_);
>                 goto fallback;
>         }

Possibly.

That said, we really don't want stuff like compaction to
run here -ever- because of how expensive it is for hot paths when
memory is low, and the only knob we have to control that is
__GFP_DIRECT_RECLAIM.

However, turning off direct reclaim should make no difference in
the long run because vmalloc is only trying to allocate a batch of
single page folios.

If we are in low memory situations where no single page folios are
not available, then even for a NORETRY/no direct reclaim allocation
the expectation is that the failed allocation attempt would be
kicking kswapd to perform background memory reclaim.

This is especially true when the allocation is GFP_NOFS/GFP_NOIO
even with direct reclaim turned on - if all the memory is held in
shrinkable fs/vfs caches then direct reclaim cannot reclaim anything
filesystem/IO related.

i.e. background reclaim making forwards progress is absolutely
necessary for any sort of "nofail" allocation loop to succeed
regardless of whether direct reclaim is enabled or not.

Hence if background memory reclaim is making progress, this
allocation loop should eventually succeed. If the allocation is not
succeeding, then it implies that some critical resource in the
allocation path is not being refilled either on allocation failure
or by background reclaim, and hence the allocation failure persists
because nothing alleviates the resource shortage that is triggering
the ENOMEM issue.

So the question is: where in the __vmalloc allocation path is the
ENOMEM error being generated from, and is it the same place every
time?

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-04 20:40   ` Matthew Wilcox
@ 2026-04-05 22:29     ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2026-04-05 22:29 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs

On Sat, Apr 04, 2026 at 09:40:37PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 04, 2026 at 10:42:59PM +1100, Dave Chinner wrote:
> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
> > > This is with commit 5619b098e2fb so after 7.0-rc6
> > > INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
> > > task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
> > > Call Trace:
> > >  <TASK>
> > >  __schedule+0x560/0xfc0
> > >  schedule+0x3e/0x140
> > >  schedule_timeout+0x84/0x110
> > >  ? __pfx_process_timeout+0x10/0x10
> > >  io_schedule_timeout+0x5b/0x80
> > >  xfs_buf_alloc+0x793/0x7d0
> > 
> > -ENOMEM.
> > 
> > It'll be looping here:
> > 
> > fallback:
> >         for (;;) {
> >                 bp->b_addr = __vmalloc(size, gfp_mask);
> >                 if (bp->b_addr)
> >                         break;
> >                 if (flags & XBF_READ_AHEAD)
> >                         return -ENOMEM;
> >                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
> >                 memalloc_retry_wait(gfp_mask);
> >         }
> > 
> > If it is looping here long enough to trigger the hang check timer,
> > then the MM subsystem is not making progress reclaiming memory. This
> > is probably a 16kB allocation (it's an inode cluster buffer), and
> > the allocation context is NOFAIL because it is within a transaction
> > (this loop pre-dates __vmalloc() supporting __GFP_NOFAIL)....
> 
> There may be something else going on.  I reproduced it again and ssh'd
> into the VM.
> 
> # free
>                total        used        free      shared  buff/cache   available
> Mem:         3988260     1197132      240080         144     3147496     2791128
> Swap:        2097148      258128     1839020
> 
> There are five instances of fsstress running.  Very slowly, but they are
> accumulating seconds of CPU time:
> 
> root@deadly-kvm:~# ps -aux |grep fsstress
> root     3745227  0.0  0.0   2664  1476 ?        S    06:48   0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745236  7.5  1.6 127928 65256 ?        D    06:48  42:54 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745237  7.6  1.5 124644 61308 ?        D    06:48  42:55 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745238  7.6  1.6 130844 65584 ?        D    06:48  43:01 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745239  7.6  1.6 126524 66536 ?        D    06:48  42:58 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root@deadly-kvm:~# ps -aux |grep fsstress
> root     3745227  0.0  0.0   2664  1476 ?        S    06:48   0:00 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745236  5.5  1.6 133116 66708 ?        R    06:48  45:44 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745237  5.5  1.5 130136 62516 ?        R    06:48  45:45 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745238  5.5  1.6 136520 65944 ?        R    06:48  45:52 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> root     3745239  5.5  1.7 131988 67884 ?        R    06:48  45:50 ./ltp/fsstress -p 4 -d /mnt/scratch -n 2000000
> 
> # cat /proc/3745239/stack
> [<0>] xfs_buf_lock+0x4b/0x170
> [<0>] xfs_buf_find_lock+0x69/0x140
> [<0>] xfs_buf_get_map+0x265/0xbd0
> [<0>] xfs_buf_read_map+0x59/0x2e0
> [<0>] xfs_trans_read_buf_map+0x1bb/0x560
> [<0>] xfs_read_agi+0xab/0x1a0
> (...)

It would be helpful to quote the full stack traces...

> # cat /proc/3745238/stack
> [<0>] xfs_buf_alloc+0x793/0x7d0
> [<0>] xfs_buf_get_map+0x651/0xbd0
> [<0>] xfs_buf_readahead_map+0x3b/0x1b0
> [<0>] xfs_iwalk_ichunk_ra+0xe9/0x130
> [<0>] xfs_iwalk_ag+0x185/0x2d0
> (...)

However, how is memory allocation stuck here? That's the readahead
path, which triggers an early exit from the __vmalloc() fallback
loop.  i.e. xfs_buf_alloc() does not loop forever on readahead - it
tries once and then exits.

Yes, this bulkstat path is holding the AGI buffer locked, and the
previous thread is waiting on the AGI buffer lock, but that doesn't
mean the system is deadlocked - it's just lockstepping on the AGI
buffer lock due to the long hold in the bulkstat path....

i.e. these traces do not indicate that there is any sort of memory
allocation problem in the system, just bulkstat slowing down other
operations...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-05 22:16     ` Dave Chinner
@ 2026-04-06  0:27       ` Ritesh Harjani
  2026-04-06 21:45         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Ritesh Harjani @ 2026-04-06  0:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, linux-xfs


Thanks Dave for your inputs. I have few more data points on the same.
It will be nice to know your thoughts on this.

Dave Chinner <dgc@kernel.org> writes:

> On Sun, Apr 05, 2026 at 06:33:59AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <dgc@kernel.org> writes:
>> 
>> > On Fri, Apr 03, 2026 at 04:35:46PM +0100, Matthew Wilcox wrote:
>> >> This is with commit 5619b098e2fb so after 7.0-rc6
>> >> INFO: task fsstress:3762792 blocked on a semaphore likely last held by task fsstress:3762793
>> >> task:fsstress        state:D stack:0     pid:3762793 tgid:3762793 ppid:3762783 task_flags:0x440140 flags:0x00080800
>> >> Call Trace:
>> >>  <TASK>
>> >>  __schedule+0x560/0xfc0
>> >>  schedule+0x3e/0x140
>> >>  schedule_timeout+0x84/0x110
>> >>  ? __pfx_process_timeout+0x10/0x10
>> >>  io_schedule_timeout+0x5b/0x80
>> >>  xfs_buf_alloc+0x793/0x7d0
>> >
>> > -ENOMEM.
>> >
>> > It'll be looping here:
>> >
>> > fallback:
>> >         for (;;) {
>> >                 bp->b_addr = __vmalloc(size, gfp_mask);
>> >                 if (bp->b_addr)
>> >                         break;
>> >                 if (flags & XBF_READ_AHEAD)
>> >                         return -ENOMEM;
>> >                 XFS_STATS_INC(bp->b_mount, xb_page_retries);
>> >                 memalloc_retry_wait(gfp_mask);
>> >         }
>> >
>> > If it is looping here long enough to trigger the hang check timer,
>> > then the MM subsystem is not making progress reclaiming memory. This
>> 
>> Hi Dave,
>> 
>> If that's the case and if we expect the MM subsystem to do memory
>> reclaim, shouldn't we be passing the __GFP_DIRECT_RECLAIM flag to our
>> fallback loop? I see that we might have cleared this flag and also set
>> __GFP_NORETRY, in the above if condition if allocation size is >PAGE_SIZE.
>> 
>> So shouldn't we do?
>> 
>>         if (size > PAGE_SIZE) {
>>                 if (!is_power_of_2(size))
>>                         goto fallback;
>> -               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>> -               gfp_mask |= __GFP_NORETRY;
>> +               gfp_t alloc_gfp = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
>> +               folio = folio_alloc(alloc_gfp, get_order(size));
>> +       } else {
>> +               folio = folio_alloc(gfp_mask, get_order(size));
>>         }
>> -       folio = folio_alloc(gfp_mask, get_order(size));
>>         if (!folio) {
>>                 if (size <= PAGE_SIZE)
>>                         return -ENOMEM;
>>                 trace_xfs_buf_backing_fallback(bp, _RET_IP_);
>>                 goto fallback;
>>         }
>
> Possibly.
>
> That said, we really don't want stuff like compaction to
> run here -ever- because of how expensive it is for hot paths when
> memory is low, and the only knob we have to control that is
> __GFP_DIRECT_RECLAIM.
>

Looking at __alloc_pages_direct_compact(), it returns immediately for
order=0 allocations.


> However, turning off direct reclaim should make no difference in
> the long run because vmalloc is only trying to allocate a batch of
> single page folios.
>
> If we are in low memory situations where no single page folios are
> not available, then even for a NORETRY/no direct reclaim allocation
> the expectation is that the failed allocation attempt would be
> kicking kswapd to perform background memory reclaim.
>
> This is especially true when the allocation is GFP_NOFS/GFP_NOIO
> even with direct reclaim turned on - if all the memory is held in
> shrinkable fs/vfs caches then direct reclaim cannot reclaim anything
> filesystem/IO related.
>

So, looking at the logs from Matthew, I think, this case might have
benefitted from __GFP_DIRECT_RECLAIM, because we have many clean
inactive file pages. So theoritically, IMO direct reclaim should be able
to use one of those clean file pages (after it gets direct-reclaimed)

      nr_zone_inactive_file 62769
      nr_zone_write_pending 0


> i.e. background reclaim making forwards progress is absolutely
> necessary for any sort of "nofail" allocation loop to succeed
> regardless of whether direct reclaim is enabled or not.
>
> Hence if background memory reclaim is making progress, this
> allocation loop should eventually succeed. If the allocation is not
> succeeding, then it implies that some critical resource in the
> allocation path is not being refilled either on allocation failure
> or by background reclaim, and hence the allocation failure persists
> because nothing alleviates the resource shortage that is triggering
> the ENOMEM issue.

I agree, background memory reclaim / kswapd thread should have made
forward progress. 

I am not sure why in this case, we are we hitting hung tasks issues then.
Could be because of multiple fsstress threads running in parallel (from
ps -eax output), and maybe some other process ends up using the pages
reclaimed by background kswapd (just a theory). 

>
> So the question is: where in the __vmalloc allocation path is the
> ENOMEM error being generated from, and is it the same place every
> time?
>

Although I can't say for sure, but in this case after looking at the
code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might
be returning from here (after get_page_from_freelist() couldn't get a
free page).

__alloc_pages_slowpath() {
    ...
	/* Caller is not willing to reclaim, we can't balance anything */
	if (!can_direct_reclaim)
		goto nopage;


So, with the above data, I think,
In this case, passing __GFP_DIRECT_RECLAIM in vmalloc fallback path
might help. And either ways, until we have a page allocated, we anyway
do an infinite retry, so we may as well pass __GFP_DIRECT_RECLAIM flag
to it, right?

fallback:
	for (;;) {
		bp->b_addr = __vmalloc(size, gfp_mask);
		if (bp->b_addr)
			break;
		if (flags & XBF_READ_AHEAD)
			return -ENOMEM;
		XFS_STATS_INC(bp->b_mount, xb_page_retries);
		memalloc_retry_wait(gfp_mask);
	}

Thoughts?

I am not sure how easily this issue is reproducible at Matthew's end.
But let me also keep a kvm guest with the same kernel version to see if
I can replicate this at my end in an overnight run of xfs/285 in a loop.


-ritesh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-06  0:27       ` Ritesh Harjani
@ 2026-04-06 21:45         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2026-04-06 21:45 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Matthew Wilcox, linux-xfs

On Mon, Apr 06, 2026 at 05:57:06AM +0530, Ritesh Harjani wrote:
> > However, turning off direct reclaim should make no difference in
> > the long run because vmalloc is only trying to allocate a batch of
> > single page folios.
> >
> > If we are in low memory situations where no single page folios are
> > not available, then even for a NORETRY/no direct reclaim allocation
> > the expectation is that the failed allocation attempt would be
> > kicking kswapd to perform background memory reclaim.
> >
> > This is especially true when the allocation is GFP_NOFS/GFP_NOIO
> > even with direct reclaim turned on - if all the memory is held in
> > shrinkable fs/vfs caches then direct reclaim cannot reclaim anything
> > filesystem/IO related.
> >
> 
> So, looking at the logs from Matthew, I think, this case might have
> benefitted from __GFP_DIRECT_RECLAIM, because we have many clean
> inactive file pages. So theoritically, IMO direct reclaim should be able
> to use one of those clean file pages (after it gets direct-reclaimed)
> 
>       nr_zone_inactive_file 62769
>       nr_zone_write_pending 0

You miss the point - this is not an isolated use case. e.g.  Look at
xlog_kvmalloc() - it's also ~__GFP_DIRECT_RECLAIM, NORETRY vmalloc()
loop. What's to stop that one from getting stuck in exactly the same
way?

To that point, kvmalloc(GFP_NOFAIL) now implements the semantics
that xlog_kvmalloc() requires - it turns of direct reclaim (and
hence costly compaction) for the kmalloc() allocation attempt, then
falls back to vmalloc(GFP_NOFAIL) if kmalloc fails.

That's also pretty much the exact semantics we are trying to
implement in in xfs_buf_alloc(), yes? i.e.  xfs_buf_alloc() does:

For buffers < PAGE_SIZE, it calls kmalloc() directly and returns.

For buffers == PAGESIZE, it calls folio_alloc(GFP_KERNEL).

For buffers > PAGE_SIZE, it calls folio_alloc(NORETRY, ~__GFP_DIRECT_RECLAIM)

if either folio_alloc() call fails, it effectively runs an open
coded __vmalloc() no-fail loop.

IOWs we are implementing essentially the same semantics as
kvmalloc(__GFP_NOFAIL), modulo the reclaim flags for the __vmalloc()
loop. If we are going to change the flags for the vmalloc() loop
to be the original, then we are essentially reimplementing
kvmalloc(GFP_NOFAIL) semantics exactly. At which point....

> > i.e. background reclaim making forwards progress is absolutely
> > necessary for any sort of "nofail" allocation loop to succeed
> > regardless of whether direct reclaim is enabled or not.
> >
> > Hence if background memory reclaim is making progress, this
> > allocation loop should eventually succeed. If the allocation is not
> > succeeding, then it implies that some critical resource in the
> > allocation path is not being refilled either on allocation failure
> > or by background reclaim, and hence the allocation failure persists
> > because nothing alleviates the resource shortage that is triggering
> > the ENOMEM issue.
> 
> I agree, background memory reclaim / kswapd thread should have made
> forward progress. 
> 
> I am not sure why in this case, we are we hitting hung tasks issues then.
> Could be because of multiple fsstress threads running in parallel (from
> ps -eax output), and maybe some other process ends up using the pages
> reclaimed by background kswapd (just a theory). 

I don't think that's the case, because kswapd is supposed to run
until watermarks are reached and that means all free page pools are
supposed to have at least some free pages in them...

That's why I think there's a reclaim bug lurking here - allocation
appears to be stalling on something that background reclaim is not
refilling.  And if allocation is stalling on buffer allocation, then
it can stall in other critical parts of XFS, too. Background reclaim
not doing sufficient work to make looping non-blocking, no-retry
allocations to succeed seems like a memory allocation/reclaim bug to
me, not an XFS issue...

> > So the question is: where in the __vmalloc allocation path is the
> > ENOMEM error being generated from, and is it the same place every
> > time?
> >
> 
> Although I can't say for sure, but in this case after looking at the
> code, and knowing that we are not passing __GFP_DIRECT_RECLAIM, it might
> be returning from here (after get_page_from_freelist() couldn't get a
> free page).
> 
> __alloc_pages_slowpath() {
>     ...
> 	/* Caller is not willing to reclaim, we can't balance anything */
> 	if (!can_direct_reclaim)
> 		goto nopage;

Sure, we can't balance anything, but we've set ALLOC_KSWAPD early in
this function and so every time we get to the above point in the
allocation code we've alreayd run this:

retry:
        /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
        if (alloc_flags & ALLOC_KSWAPD)
                wake_all_kswapds(order, gfp_mask, ac);

Hence kswapds should be active and doing reclaim work to bring
everything back to minimum free pool watermarks.  That *should* be
sufficient for a no-direct-reclaim allocation loop to make progress.

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Hang with xfs/285 on 2026-03-02 kernel
  2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox
  2026-04-04 11:42 ` Dave Chinner
@ 2026-04-07  5:41 ` Christoph Hellwig
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2026-04-07  5:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs

I tried to reproduce it, but failed.  Can you share more context?
mkfs/mount options, kernel .config, system information?


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-07  5:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 15:35 Hang with xfs/285 on 2026-03-02 kernel Matthew Wilcox
2026-04-04 11:42 ` Dave Chinner
2026-04-04 20:40   ` Matthew Wilcox
2026-04-05 22:29     ` Dave Chinner
2026-04-05  1:03   ` Ritesh Harjani
2026-04-05 22:16     ` Dave Chinner
2026-04-06  0:27       ` Ritesh Harjani
2026-04-06 21:45         ` Dave Chinner
2026-04-07  5:41 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox