inconsistent lock state in the new fserror code

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* inconsistent lock state in the new fserror code
@ 2026-02-13  6:15 Christoph Hellwig
  2026-02-13 16:00 ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-02-13  6:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, viro, brauner, jack

xfstests generic/108 makes lockdep unhappy with the new fserror code
in Linus' tree, see the trace below.  The problem seems to be that
igrab takes i_lock to protect against a inode that is beeing freed.
Error reporting doesn't care about that, but we don't really have
a good interface to just grab a reference.

[  149.494670] ================================
[  149.494871] WARNING: inconsistent lock state
[  149.495073] 6.19.0+ #4827 Tainted: G                 N 
[  149.495336] --------------------------------
[  149.495560] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[  149.495857] swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[  149.496111] ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0
[  149.496543] {HARDIRQ-ON-W} state was registered at:
[  149.496853]   lock_acquire+0xca/0x2c0
[  149.497057]   _raw_spin_lock+0x2e/0x40
[  149.497257]   unlock_new_inode+0x2c/0xc0
[  149.497460]   xfs_iget+0xcf4/0x1080
[  149.497643]   xfs_trans_metafile_iget+0x3d/0x100
[  149.497882]   xfs_metafile_iget+0x2b/0x50
[  149.498144]   xfs_mount_setup_metadir+0x20/0x60
[  149.498163]   xfs_mountfs+0x457/0xa60
[  149.498163]   xfs_fs_fill_super+0x6b3/0xa90
[  149.498163]   get_tree_bdev_flags+0x13c/0x1e0
[  149.498163]   vfs_get_tree+0x27/0xe0
[  149.498163]   vfs_cmd_create+0x54/0xe0
[  149.498163]   __do_sys_fsconfig+0x309/0x620
[  149.498163]   do_syscall_64+0x8b/0xf80
[  149.498163]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  149.498163] irq event stamp: 139080
[  149.498163] hardirqs last  enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270
[  149.498163] hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0
[  149.498163] softirqs last  enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
[  149.498163] softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
[  149.498163] 
[  149.498163] other info that might help us debug this:
[  149.498163]  Possible unsafe locking scenario:
[  149.498163] 
[  149.498163]        CPU0
[  149.498163]        ----
[  149.498163]   lock(&sb->s_type->i_lock_key#33);
[  149.498163]   <Interrupt>
[  149.498163]     lock(&sb->s_type->i_lock_key#33);
[  149.498163] 
[  149.498163]  *** DEADLOCK ***
[  149.498163] 
[  149.498163] 1 lock held by swapper/1/0:
[  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
[  149.498163] 
[  149.498163] stack backtrace:
[  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
[  149.498163] Tainted: [N]=TEST
[  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[  149.498163] Call Trace:
[  149.498163]  <IRQ>
[  149.498163]  dump_stack_lvl+0x5b/0x80
[  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
[  149.498163]  mark_lock+0xa6f/0xe90
[  149.498163]  ? mempool_alloc_noprof+0x91/0x130
[  149.498163]  ? set_track_prepare+0x39/0x60
[  149.498163]  ? mempool_alloc_noprof+0x91/0x130
[  149.498163]  ? fserror_report+0x8a/0x260
[  149.498163]  ? iomap_finish_ioend_buffered+0x170/0x210
[  149.498163]  ? clone_endio+0x8f/0x1c0
[  149.498163]  ? blk_update_request+0x1e4/0x4d0
[  149.498163]  ? blk_mq_end_request+0x1b/0x100
[  149.498163]  ? virtblk_done+0x6f/0x110
[  149.498163]  ? vring_interrupt+0x59/0x80
[  149.498163]  ? __handle_irq_event_percpu+0x8a/0x2e0
[  149.498163]  ? handle_irq_event+0x33/0x70
[  149.498163]  ? handle_edge_irq+0xdd/0x1e0
[  149.498163]  __lock_acquire+0x10b6/0x25e0
[  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
[  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
[  149.498163]  lock_acquire+0xca/0x2c0
[  149.498163]  ? igrab+0x1a/0xb0
[  149.498163]  ? rcu_is_watching+0x11/0x50
[  149.498163]  ? __kmalloc_noprof+0x3ab/0x5a0
[  149.498163]  _raw_spin_lock+0x2e/0x40
[  149.498163]  ? igrab+0x1a/0xb0
[  149.498163]  igrab+0x1a/0xb0
[  149.498163]  fserror_report+0x135/0x260
[  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
[  149.498163]  ? __pfx_stripe_end_io+0x10/0x10
[  149.498163]  clone_endio+0x8f/0x1c0
[  149.498163]  blk_update_request+0x1e4/0x4d0
[  149.498163]  ? __pfx_sg_pool_free+0x10/0x10
[  149.498163]  ? mempool_free+0x3d/0x50
[  149.498163]  blk_mq_end_request+0x1b/0x100
[  149.498163]  virtblk_done+0x6f/0x110
[  149.498163]  vring_interrupt+0x59/0x80
[  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
[  149.498163]  handle_irq_event+0x33/0x70
[  149.498163]  handle_edge_irq+0xdd/0x1e0
[  149.498163]  __common_interrupt+0x6f/0x180
[  149.498163]  common_interrupt+0xb7/0xe0
[  149.498163]  </IRQ>
[  149.498163]  <TASK>
[  149.498163]  asm_common_interrupt+0x26/0x40
[  149.498163] RIP: 0010:default_idle+0xf/0x20
[  149.498163] Code: 4c 01 c7 4c 29 c2 e9 6e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d a9 88 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[  149.498163] RSP: 0018:ffffc9000009fed8 EFLAGS: 00000206
[  149.498163] RAX: 0000000000021f47 RBX: ffff888100ad53c0 RCX: 0000000000000000
[  149.498163] RDX: 0000000000000000 RSI: ffffffff83278fb9 RDI: ffffffff8329090b
[  149.498163] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[  149.498163] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  149.498163] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  149.498163]  default_idle_call+0x7e/0x1a0
[  149.498163]  do_idle+0x1ec/0x270
[  149.498163]  cpu_startup_entry+0x24/0x30
[  149.498163]  start_secondary+0xf7/0x100
[  149.498163]  common_startup_64+0x13e/0x148
[  149.498163]  </TASK>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-13  6:15 inconsistent lock state in the new fserror code Christoph Hellwig
@ 2026-02-13 16:00 ` Darrick J. Wong
  2026-02-13 19:07   ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2026-02-13 16:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, viro, brauner, jack

On Thu, Feb 12, 2026 at 10:15:57PM -0800, Christoph Hellwig wrote:
> xfstests generic/108 makes lockdep unhappy with the new fserror code
> in Linus' tree, see the trace below.  The problem seems to be that
> igrab takes i_lock to protect against a inode that is beeing freed.
> Error reporting doesn't care about that, but we don't really have
> a good interface to just grab a reference.
> 
> [  149.494670] ================================
> [  149.494871] WARNING: inconsistent lock state
> [  149.495073] 6.19.0+ #4827 Tainted: G                 N 
> [  149.495336] --------------------------------
> [  149.495560] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> [  149.495857] swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [  149.496111] ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0
> [  149.496543] {HARDIRQ-ON-W} state was registered at:
> [  149.496853]   lock_acquire+0xca/0x2c0
> [  149.497057]   _raw_spin_lock+0x2e/0x40
> [  149.497257]   unlock_new_inode+0x2c/0xc0
> [  149.497460]   xfs_iget+0xcf4/0x1080
> [  149.497643]   xfs_trans_metafile_iget+0x3d/0x100
> [  149.497882]   xfs_metafile_iget+0x2b/0x50
> [  149.498144]   xfs_mount_setup_metadir+0x20/0x60
> [  149.498163]   xfs_mountfs+0x457/0xa60
> [  149.498163]   xfs_fs_fill_super+0x6b3/0xa90
> [  149.498163]   get_tree_bdev_flags+0x13c/0x1e0
> [  149.498163]   vfs_get_tree+0x27/0xe0
> [  149.498163]   vfs_cmd_create+0x54/0xe0
> [  149.498163]   __do_sys_fsconfig+0x309/0x620
> [  149.498163]   do_syscall_64+0x8b/0xf80
> [  149.498163]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  149.498163] irq event stamp: 139080
> [  149.498163] hardirqs last  enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270
> [  149.498163] hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0
> [  149.498163] softirqs last  enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
> [  149.498163] softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
> [  149.498163] 
> [  149.498163] other info that might help us debug this:
> [  149.498163]  Possible unsafe locking scenario:
> [  149.498163] 
> [  149.498163]        CPU0
> [  149.498163]        ----
> [  149.498163]   lock(&sb->s_type->i_lock_key#33);
> [  149.498163]   <Interrupt>
> [  149.498163]     lock(&sb->s_type->i_lock_key#33);

Er... is lockdep telling us here that we could take i_lock in
unlock_new_inode, get interrupted, and then take another i_lock?

> [  149.498163] 
> [  149.498163]  *** DEADLOCK ***
> [  149.498163] 
> [  149.498163] 1 lock held by swapper/1/0:
> [  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
> [  149.498163] 
> [  149.498163] stack backtrace:
> [  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
> [  149.498163] Tainted: [N]=TEST
> [  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> [  149.498163] Call Trace:
> [  149.498163]  <IRQ>
> [  149.498163]  dump_stack_lvl+0x5b/0x80
> [  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
> [  149.498163]  mark_lock+0xa6f/0xe90
> [  149.498163]  ? mempool_alloc_noprof+0x91/0x130
> [  149.498163]  ? set_track_prepare+0x39/0x60
> [  149.498163]  ? mempool_alloc_noprof+0x91/0x130
> [  149.498163]  ? fserror_report+0x8a/0x260
> [  149.498163]  ? iomap_finish_ioend_buffered+0x170/0x210
> [  149.498163]  ? clone_endio+0x8f/0x1c0
> [  149.498163]  ? blk_update_request+0x1e4/0x4d0
> [  149.498163]  ? blk_mq_end_request+0x1b/0x100
> [  149.498163]  ? virtblk_done+0x6f/0x110
> [  149.498163]  ? vring_interrupt+0x59/0x80
> [  149.498163]  ? __handle_irq_event_percpu+0x8a/0x2e0
> [  149.498163]  ? handle_irq_event+0x33/0x70
> [  149.498163]  ? handle_edge_irq+0xdd/0x1e0
> [  149.498163]  __lock_acquire+0x10b6/0x25e0
> [  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
> [  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
> [  149.498163]  lock_acquire+0xca/0x2c0
> [  149.498163]  ? igrab+0x1a/0xb0
> [  149.498163]  ? rcu_is_watching+0x11/0x50
> [  149.498163]  ? __kmalloc_noprof+0x3ab/0x5a0
> [  149.498163]  _raw_spin_lock+0x2e/0x40
> [  149.498163]  ? igrab+0x1a/0xb0
> [  149.498163]  igrab+0x1a/0xb0
> [  149.498163]  fserror_report+0x135/0x260
> [  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
> [  149.498163]  ? __pfx_stripe_end_io+0x10/0x10
> [  149.498163]  clone_endio+0x8f/0x1c0
> [  149.498163]  blk_update_request+0x1e4/0x4d0
> [  149.498163]  ? __pfx_sg_pool_free+0x10/0x10
> [  149.498163]  ? mempool_free+0x3d/0x50
> [  149.498163]  blk_mq_end_request+0x1b/0x100
> [  149.498163]  virtblk_done+0x6f/0x110
> [  149.498163]  vring_interrupt+0x59/0x80
> [  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
> [  149.498163]  handle_irq_event+0x33/0x70
> [  149.498163]  handle_edge_irq+0xdd/0x1e0
> [  149.498163]  __common_interrupt+0x6f/0x180
> [  149.498163]  common_interrupt+0xb7/0xe0

Hrmm, so we're calling fserror_report/igrab from an interrupt handler.
The bio endio function is for writeback ioend completion.

igrab takes i_lock to check if the inode is in FREEING or WILL_FREE
state.  However, the fact that it's in writeback presumably means that
the vfs still holds an i_count on this inode, so the inode cannot be
freed until iomap_finish_ioend_buffered completes.  So perhaps instead
of calling igrab directly, could we perhaps get away with:

	/*
	 * Only referenced inodes may be passed into this function!
	 * This means they cannot be INEW, FREEING, or WILL_FREE.
	 *
	 * Can't iput from non-sleeping context, so grabbing another
	 * reference must be the last thing before submitting the event
	 */
	if (inode && !atomic_inc_not_zero(&inode->i_count)) {
		/* warn about precondition violation and lost error */
		goto lost_event;
	}

	schedule_work(&event->work);

Hm?

It also occurred to me that we shouldn't be calling fserror_report on
any of the metadata inodes, which means another fixpatch is needed for
the callsites in xfs_health.c.

--D

> [  149.498163]  </IRQ>
> [  149.498163]  <TASK>
> [  149.498163]  asm_common_interrupt+0x26/0x40
> [  149.498163] RIP: 0010:default_idle+0xf/0x20
> [  149.498163] Code: 4c 01 c7 4c 29 c2 e9 6e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d a9 88 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> [  149.498163] RSP: 0018:ffffc9000009fed8 EFLAGS: 00000206
> [  149.498163] RAX: 0000000000021f47 RBX: ffff888100ad53c0 RCX: 0000000000000000
> [  149.498163] RDX: 0000000000000000 RSI: ffffffff83278fb9 RDI: ffffffff8329090b
> [  149.498163] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> [  149.498163] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [  149.498163] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  149.498163]  default_idle_call+0x7e/0x1a0
> [  149.498163]  do_idle+0x1ec/0x270
> [  149.498163]  cpu_startup_entry+0x24/0x30
> [  149.498163]  start_secondary+0xf7/0x100
> [  149.498163]  common_startup_64+0x13e/0x148
> [  149.498163]  </TASK>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-13 16:00 ` Darrick J. Wong
@ 2026-02-13 19:07   ` Darrick J. Wong
  2026-02-13 22:38     ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2026-02-13 19:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, viro, brauner, jack

On Fri, Feb 13, 2026 at 08:00:41AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 12, 2026 at 10:15:57PM -0800, Christoph Hellwig wrote:
> > xfstests generic/108 makes lockdep unhappy with the new fserror code
> > in Linus' tree, see the trace below.  The problem seems to be that
> > igrab takes i_lock to protect against a inode that is beeing freed.
> > Error reporting doesn't care about that, but we don't really have
> > a good interface to just grab a reference.
> > 
> > [  149.494670] ================================
> > [  149.494871] WARNING: inconsistent lock state
> > [  149.495073] 6.19.0+ #4827 Tainted: G                 N 
> > [  149.495336] --------------------------------
> > [  149.495560] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> > [  149.495857] swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > [  149.496111] ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0
> > [  149.496543] {HARDIRQ-ON-W} state was registered at:
> > [  149.496853]   lock_acquire+0xca/0x2c0
> > [  149.497057]   _raw_spin_lock+0x2e/0x40
> > [  149.497257]   unlock_new_inode+0x2c/0xc0
> > [  149.497460]   xfs_iget+0xcf4/0x1080
> > [  149.497643]   xfs_trans_metafile_iget+0x3d/0x100
> > [  149.497882]   xfs_metafile_iget+0x2b/0x50
> > [  149.498144]   xfs_mount_setup_metadir+0x20/0x60
> > [  149.498163]   xfs_mountfs+0x457/0xa60
> > [  149.498163]   xfs_fs_fill_super+0x6b3/0xa90
> > [  149.498163]   get_tree_bdev_flags+0x13c/0x1e0
> > [  149.498163]   vfs_get_tree+0x27/0xe0
> > [  149.498163]   vfs_cmd_create+0x54/0xe0
> > [  149.498163]   __do_sys_fsconfig+0x309/0x620
> > [  149.498163]   do_syscall_64+0x8b/0xf80
> > [  149.498163]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  149.498163] irq event stamp: 139080
> > [  149.498163] hardirqs last  enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270
> > [  149.498163] hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0
> > [  149.498163] softirqs last  enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
> > [  149.498163] softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
> > [  149.498163] 
> > [  149.498163] other info that might help us debug this:
> > [  149.498163]  Possible unsafe locking scenario:
> > [  149.498163] 
> > [  149.498163]        CPU0
> > [  149.498163]        ----
> > [  149.498163]   lock(&sb->s_type->i_lock_key#33);
> > [  149.498163]   <Interrupt>
> > [  149.498163]     lock(&sb->s_type->i_lock_key#33);
> 
> Er... is lockdep telling us here that we could take i_lock in
> unlock_new_inode, get interrupted, and then take another i_lock?
> 
> > [  149.498163] 
> > [  149.498163]  *** DEADLOCK ***
> > [  149.498163] 
> > [  149.498163] 1 lock held by swapper/1/0:
> > [  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
> > [  149.498163] 
> > [  149.498163] stack backtrace:
> > [  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
> > [  149.498163] Tainted: [N]=TEST
> > [  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> > [  149.498163] Call Trace:
> > [  149.498163]  <IRQ>
> > [  149.498163]  dump_stack_lvl+0x5b/0x80
> > [  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
> > [  149.498163]  mark_lock+0xa6f/0xe90
> > [  149.498163]  ? mempool_alloc_noprof+0x91/0x130
> > [  149.498163]  ? set_track_prepare+0x39/0x60
> > [  149.498163]  ? mempool_alloc_noprof+0x91/0x130
> > [  149.498163]  ? fserror_report+0x8a/0x260
> > [  149.498163]  ? iomap_finish_ioend_buffered+0x170/0x210
> > [  149.498163]  ? clone_endio+0x8f/0x1c0
> > [  149.498163]  ? blk_update_request+0x1e4/0x4d0
> > [  149.498163]  ? blk_mq_end_request+0x1b/0x100
> > [  149.498163]  ? virtblk_done+0x6f/0x110
> > [  149.498163]  ? vring_interrupt+0x59/0x80
> > [  149.498163]  ? __handle_irq_event_percpu+0x8a/0x2e0
> > [  149.498163]  ? handle_irq_event+0x33/0x70
> > [  149.498163]  ? handle_edge_irq+0xdd/0x1e0
> > [  149.498163]  __lock_acquire+0x10b6/0x25e0
> > [  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
> > [  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
> > [  149.498163]  lock_acquire+0xca/0x2c0
> > [  149.498163]  ? igrab+0x1a/0xb0
> > [  149.498163]  ? rcu_is_watching+0x11/0x50
> > [  149.498163]  ? __kmalloc_noprof+0x3ab/0x5a0
> > [  149.498163]  _raw_spin_lock+0x2e/0x40
> > [  149.498163]  ? igrab+0x1a/0xb0
> > [  149.498163]  igrab+0x1a/0xb0
> > [  149.498163]  fserror_report+0x135/0x260
> > [  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
> > [  149.498163]  ? __pfx_stripe_end_io+0x10/0x10
> > [  149.498163]  clone_endio+0x8f/0x1c0
> > [  149.498163]  blk_update_request+0x1e4/0x4d0
> > [  149.498163]  ? __pfx_sg_pool_free+0x10/0x10
> > [  149.498163]  ? mempool_free+0x3d/0x50
> > [  149.498163]  blk_mq_end_request+0x1b/0x100
> > [  149.498163]  virtblk_done+0x6f/0x110
> > [  149.498163]  vring_interrupt+0x59/0x80
> > [  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
> > [  149.498163]  handle_irq_event+0x33/0x70
> > [  149.498163]  handle_edge_irq+0xdd/0x1e0
> > [  149.498163]  __common_interrupt+0x6f/0x180
> > [  149.498163]  common_interrupt+0xb7/0xe0
> 
> Hrmm, so we're calling fserror_report/igrab from an interrupt handler.
> The bio endio function is for writeback ioend completion.
> 
> igrab takes i_lock to check if the inode is in FREEING or WILL_FREE
> state.  However, the fact that it's in writeback presumably means that
> the vfs still holds an i_count on this inode, so the inode cannot be
> freed until iomap_finish_ioend_buffered completes.  So perhaps instead
> of calling igrab directly, could we perhaps get away with:
> 
> 	/*
> 	 * Only referenced inodes may be passed into this function!
> 	 * This means they cannot be INEW, FREEING, or WILL_FREE.
> 	 *
> 	 * Can't iput from non-sleeping context, so grabbing another
> 	 * reference must be the last thing before submitting the event
> 	 */
> 	if (inode && !atomic_inc_not_zero(&inode->i_count)) {
> 		/* warn about precondition violation and lost error */
> 		goto lost_event;
> 	}
> 
> 	schedule_work(&event->work);
> 
> Hm?

/me hands himself another cup of coffee, changes that to:

	/*
	 * Can't iput from non-sleeping context, so grabbing another
	 * reference to the inode must be the last thing before
	 * submitting the event.  Open-code the igrab here to avoid
	 * taking i_lock in interrupt context.
	 */
	if (inode) {
		WARN_ON_ONCE(inode_unhashed(inode));
		WARN_ON_ONCE(inode_state_read_once(inode) &
					(I_NEW | I_FREEING | I_WILL_FREE));

		if (!atomic_inc_not_zero(&inode->i_count))
			goto lost_event;
		event->inode = inode;
	}

This seems to fix the lockdep report, will send that out for full
testing this evening.

--D

> It also occurred to me that we shouldn't be calling fserror_report on
> any of the metadata inodes, which means another fixpatch is needed for
> the callsites in xfs_health.c.
> 
> --D
> 
> > [  149.498163]  </IRQ>
> > [  149.498163]  <TASK>
> > [  149.498163]  asm_common_interrupt+0x26/0x40
> > [  149.498163] RIP: 0010:default_idle+0xf/0x20
> > [  149.498163] Code: 4c 01 c7 4c 29 c2 e9 6e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d a9 88 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> > [  149.498163] RSP: 0018:ffffc9000009fed8 EFLAGS: 00000206
> > [  149.498163] RAX: 0000000000021f47 RBX: ffff888100ad53c0 RCX: 0000000000000000
> > [  149.498163] RDX: 0000000000000000 RSI: ffffffff83278fb9 RDI: ffffffff8329090b
> > [  149.498163] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> > [  149.498163] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> > [  149.498163] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [  149.498163]  default_idle_call+0x7e/0x1a0
> > [  149.498163]  do_idle+0x1ec/0x270
> > [  149.498163]  cpu_startup_entry+0x24/0x30
> > [  149.498163]  start_secondary+0xf7/0x100
> > [  149.498163]  common_startup_64+0x13e/0x148
> > [  149.498163]  </TASK>
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-13 19:07   ` Darrick J. Wong
@ 2026-02-13 22:38     ` Dave Chinner
  2026-02-14  5:55       ` Darrick J. Wong
  2026-02-17  5:47       ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2026-02-13 22:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, viro, brauner, jack

On Fri, Feb 13, 2026 at 11:07:57AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 13, 2026 at 08:00:41AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 12, 2026 at 10:15:57PM -0800, Christoph Hellwig wrote:
> > > [  149.498163] other info that might help us debug this:
> > > [  149.498163]  Possible unsafe locking scenario:
> > > [  149.498163] 
> > > [  149.498163]        CPU0
> > > [  149.498163]        ----
> > > [  149.498163]   lock(&sb->s_type->i_lock_key#33);
> > > [  149.498163]   <Interrupt>
> > > [  149.498163]     lock(&sb->s_type->i_lock_key#33);
> > 
> > Er... is lockdep telling us here that we could take i_lock in
> > unlock_new_inode, get interrupted, and then take another i_lock?

Yes.

> > > [  149.498163] 
> > > [  149.498163]  *** DEADLOCK ***
> > > [  149.498163] 
> > > [  149.498163] 1 lock held by swapper/1/0:
> > > [  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
> > > [  149.498163] 
> > > [  149.498163] stack backtrace:
> > > [  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
> > > [  149.498163] Tainted: [N]=TEST
> > > [  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> > > [  149.498163] Call Trace:
> > > [  149.498163]  <IRQ>
> > > [  149.498163]  dump_stack_lvl+0x5b/0x80
> > > [  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
> > > [  149.498163]  mark_lock+0xa6f/0xe90
> > > [  149.498163]  __lock_acquire+0x10b6/0x25e0
> > > [  149.498163]  lock_acquire+0xca/0x2c0
> > > [  149.498163]  _raw_spin_lock+0x2e/0x40
> > > [  149.498163]  igrab+0x1a/0xb0
> > > [  149.498163]  fserror_report+0x135/0x260
> > > [  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
> > > [  149.498163]  clone_endio+0x8f/0x1c0
> > > [  149.498163]  blk_update_request+0x1e4/0x4d0
> > > [  149.498163]  blk_mq_end_request+0x1b/0x100
> > > [  149.498163]  virtblk_done+0x6f/0x110
> > > [  149.498163]  vring_interrupt+0x59/0x80

Ok, so why are we calling iomap_finish_ioend_buffered() from IRQ
context? That looks like a bug because the only IO completion call
chain that can get into iomap_finish_ioend_buffered() is supposedly:

iomap_finish_ioends
  iomap_finish_ioend
    iomap_finish_ioend_buffered

And the comment above iomap_finish_ioends() says:

/*
 * Ioend completion routine for merged bios. This can only be called from task
 * contexts as merged ioends can be of unbound length. Hence we have to break up
 * the writeback completions into manageable chunks to avoid long scheduler
 * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get
 * good batch processing throughput without creating adverse scheduler latency
 * conditions.
 */

Ah, there's the problem - pure buffered overwrites from XFS use
ioend_writeback_end_bio(), not xfs_end_bio(). Hence the buffered
write completion is not punted to a workqueue, and it calls
iomap_finish_ioend_buffered() direct from the bio completion
context.

Yeah, that seems like a bug that needs fixing in the
ioend_writeback_end_bio() function - if there's an IO error, it
needs to punt the processing of the ioend to a workqueue...

> > > [  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
> > > [  149.498163]  handle_irq_event+0x33/0x70
> > > [  149.498163]  handle_edge_irq+0xdd/0x1e0
> > > [  149.498163]  __common_interrupt+0x6f/0x180
> > > [  149.498163]  common_interrupt+0xb7/0xe0
> > 
> > Hrmm, so we're calling fserror_report/igrab from an interrupt handler.
> > The bio endio function is for writeback ioend completion.

Yup, this is one of the reasons writeback doesn't hold an inode
reference over IO - we can't call iput() from an interrupt context.

> > igrab takes i_lock to check if the inode is in FREEING or WILL_FREE
> > state.  However, the fact that it's in writeback presumably means that
> > the vfs still holds an i_count on this inode,

Writeback holds an inode reference over submission only.

> > so the inode cannot be
> > freed until iomap_finish_ioend_buffered completes.

iput()->iput_final()->evict will block in inode_wait_for_writeback()
waiting for outstanding writeback to complete before it starts
tearing down the inode. This isn't controlled by reference counts.

> /me hands himself another cup of coffee, changes that to:
> 
> 	/*
> 	 * Can't iput from non-sleeping context, so grabbing another
> 	 * reference to the inode must be the last thing before
> 	 * submitting the event.  Open-code the igrab here to avoid
> 	 * taking i_lock in interrupt context.
> 	 */
> 	if (inode) {
> 		WARN_ON_ONCE(inode_unhashed(inode));
> 		WARN_ON_ONCE(inode_state_read_once(inode) &
> 					(I_NEW | I_FREEING | I_WILL_FREE));

It is valid for the inode have a zero reference count and have either
I_FREEING or I_WILL_FREE set here if another task has dropped the
final inode reference while writeback IO is still in flight.

> 		if (!atomic_inc_not_zero(&inode->i_count))
> 			goto lost_event;

Overall, I'm not sure using atomic_inc_not_zero() is safe here. It
may be, but I don't think this is how the problem should be solved.
Punt ioend w/ IO errors to a work queue, and then nothing needs to
change w.r.t. the fserror handling of the inodes. i.e. it will be
save to use inode->i_lock and hence igrab()...

Cheers,

Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-13 22:38     ` Dave Chinner
@ 2026-02-14  5:55       ` Darrick J. Wong
  2026-02-17  5:47       ` Christoph Hellwig
  1 sibling, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2026-02-14  5:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, viro, brauner, jack

On Sat, Feb 14, 2026 at 09:38:26AM +1100, Dave Chinner wrote:
> On Fri, Feb 13, 2026 at 11:07:57AM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 13, 2026 at 08:00:41AM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 12, 2026 at 10:15:57PM -0800, Christoph Hellwig wrote:
> > > > [  149.498163] other info that might help us debug this:
> > > > [  149.498163]  Possible unsafe locking scenario:
> > > > [  149.498163] 
> > > > [  149.498163]        CPU0
> > > > [  149.498163]        ----
> > > > [  149.498163]   lock(&sb->s_type->i_lock_key#33);
> > > > [  149.498163]   <Interrupt>
> > > > [  149.498163]     lock(&sb->s_type->i_lock_key#33);
> > > 
> > > Er... is lockdep telling us here that we could take i_lock in
> > > unlock_new_inode, get interrupted, and then take another i_lock?
> 
> Yes.
> 
> > > > [  149.498163] 
> > > > [  149.498163]  *** DEADLOCK ***
> > > > [  149.498163] 
> > > > [  149.498163] 1 lock held by swapper/1/0:
> > > > [  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
> > > > [  149.498163] 
> > > > [  149.498163] stack backtrace:
> > > > [  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
> > > > [  149.498163] Tainted: [N]=TEST
> > > > [  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> > > > [  149.498163] Call Trace:
> > > > [  149.498163]  <IRQ>
> > > > [  149.498163]  dump_stack_lvl+0x5b/0x80
> > > > [  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
> > > > [  149.498163]  mark_lock+0xa6f/0xe90
> > > > [  149.498163]  __lock_acquire+0x10b6/0x25e0
> > > > [  149.498163]  lock_acquire+0xca/0x2c0
> > > > [  149.498163]  _raw_spin_lock+0x2e/0x40
> > > > [  149.498163]  igrab+0x1a/0xb0
> > > > [  149.498163]  fserror_report+0x135/0x260
> > > > [  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
> > > > [  149.498163]  clone_endio+0x8f/0x1c0
> > > > [  149.498163]  blk_update_request+0x1e4/0x4d0
> > > > [  149.498163]  blk_mq_end_request+0x1b/0x100
> > > > [  149.498163]  virtblk_done+0x6f/0x110
> > > > [  149.498163]  vring_interrupt+0x59/0x80
> 
> Ok, so why are we calling iomap_finish_ioend_buffered() from IRQ
> context? That looks like a bug because the only IO completion call
> chain that can get into iomap_finish_ioend_buffered() is supposedly:
> 
> iomap_finish_ioends
>   iomap_finish_ioend
>     iomap_finish_ioend_buffered
> 
> And the comment above iomap_finish_ioends() says:
> 
> /*
>  * Ioend completion routine for merged bios. This can only be called from task
>  * contexts as merged ioends can be of unbound length. Hence we have to break up
>  * the writeback completions into manageable chunks to avoid long scheduler
>  * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get
>  * good batch processing throughput without creating adverse scheduler latency
>  * conditions.
>  */
> 
> Ah, there's the problem - pure buffered overwrites from XFS use
> ioend_writeback_end_bio(), not xfs_end_bio(). Hence the buffered
> write completion is not punted to a workqueue, and it calls
> iomap_finish_ioend_buffered() direct from the bio completion
> context.
> 
> Yeah, that seems like a bug that needs fixing in the
> ioend_writeback_end_bio() function - if there's an IO error, it
> needs to punt the processing of the ioend to a workqueue...

<nod> That's a much simpler approach, particularly if we're only bumping
to a workqueue to handle IO errors (which means there's no need for
merging).

> > > > [  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
> > > > [  149.498163]  handle_irq_event+0x33/0x70
> > > > [  149.498163]  handle_edge_irq+0xdd/0x1e0
> > > > [  149.498163]  __common_interrupt+0x6f/0x180
> > > > [  149.498163]  common_interrupt+0xb7/0xe0
> > > 
> > > Hrmm, so we're calling fserror_report/igrab from an interrupt handler.
> > > The bio endio function is for writeback ioend completion.
> 
> Yup, this is one of the reasons writeback doesn't hold an inode
> reference over IO - we can't call iput() from an interrupt context.
> 
> > > igrab takes i_lock to check if the inode is in FREEING or WILL_FREE
> > > state.  However, the fact that it's in writeback presumably means that
> > > the vfs still holds an i_count on this inode,
> 
> Writeback holds an inode reference over submission only.
> 
> > > so the inode cannot be
> > > freed until iomap_finish_ioend_buffered completes.
> 
> iput()->iput_final()->evict will block in inode_wait_for_writeback()
> waiting for outstanding writeback to complete before it starts
> tearing down the inode. This isn't controlled by reference counts.
> 
> > /me hands himself another cup of coffee, changes that to:
> > 
> > 	/*
> > 	 * Can't iput from non-sleeping context, so grabbing another
> > 	 * reference to the inode must be the last thing before
> > 	 * submitting the event.  Open-code the igrab here to avoid
> > 	 * taking i_lock in interrupt context.
> > 	 */
> > 	if (inode) {
> > 		WARN_ON_ONCE(inode_unhashed(inode));
> > 		WARN_ON_ONCE(inode_state_read_once(inode) &
> > 					(I_NEW | I_FREEING | I_WILL_FREE));
> 
> It is valid for the inode have a zero reference count and have either
> I_FREEING or I_WILL_FREE set here if another task has dropped the
> final inode reference while writeback IO is still in flight.
> 
> > 		if (!atomic_inc_not_zero(&inode->i_count))
> > 			goto lost_event;
> 
> Overall, I'm not sure using atomic_inc_not_zero() is safe here. It
> may be, but I don't think this is how the problem should be solved.

I /think/ it works because evict waits for writeback to end (so the
inode can't go away) and we never attach the inode to the error event if
the i_count already hit zero buuut this is a code smell anyway so I've
little interest in pursuing this part further.

> Punt ioend w/ IO errors to a work queue, and then nothing needs to
> change w.r.t. the fserror handling of the inodes. i.e. it will be
> save to use inode->i_lock and hence igrab()...

<nod> Will test that out.  Thanks for the suggestion.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-13 22:38     ` Dave Chinner
  2026-02-14  5:55       ` Darrick J. Wong
@ 2026-02-17  5:47       ` Christoph Hellwig
  2026-02-18 19:00         ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-02-17  5:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs, linux-fsdevel,
	viro, brauner, jack

On Sat, Feb 14, 2026 at 09:38:26AM +1100, Dave Chinner wrote:
> Yeah, that seems like a bug that needs fixing in the
> ioend_writeback_end_bio() function - if there's an IO error, it
> needs to punt the processing of the ioend to a workqueue...

The iomap code doesn't have a workqueue currently.  The way we split
the code, we left the workqueue handling in XFS, because it is anchored
in the inode.  I've been wanting to have it generic, as it would help
with various other things, though.

For XFS we might be able to just always hook into our I/O completion
handler and shortcut the workqueue for pure overwrites without errors,
but that won't help other users like the block device code, zonefs and
gfs2.  Maybe we'll need an opt-in for the fserror reporting now to
exclude them?

On something related, if we require a user context for fserror_report
anyway, there is no need for the workqueue bouncing in it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-17  5:47       ` Christoph Hellwig
@ 2026-02-18 19:00         ` Darrick J. Wong
  2026-02-19  5:53           ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2026-02-18 19:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, viro, brauner, jack

On Mon, Feb 16, 2026 at 09:47:45PM -0800, Christoph Hellwig wrote:
> On Sat, Feb 14, 2026 at 09:38:26AM +1100, Dave Chinner wrote:
> > Yeah, that seems like a bug that needs fixing in the
> > ioend_writeback_end_bio() function - if there's an IO error, it
> > needs to punt the processing of the ioend to a workqueue...
> 
> The iomap code doesn't have a workqueue currently.  The way we split
> the code, we left the workqueue handling in XFS, because it is anchored
> in the inode.  I've been wanting to have it generic, as it would help
> with various other things, though.
> 
> For XFS we might be able to just always hook into our I/O completion
> handler and shortcut the workqueue for pure overwrites without errors,
> but that won't help other users like the block device code, zonefs and
> gfs2.  Maybe we'll need an opt-in for the fserror reporting now to
> exclude them?

<shrug> Assuming that file IO errors aren't a frequent occurrence, it's
easy enough to attach them to a global list and schedule_worker to
process the list when an error comes in.

> On something related, if we require a user context for fserror_report
> anyway, there is no need for the workqueue bouncing in it.

Bouncing the fserror_event to an async kworker is useful for laundering
the inode locking context -- fsnotify and ->report_error know they're
running in process context without any filesystem locks held.

I tried getting rid of the wq bouncing and immediately ran into the same
lockdep complaint but on xfs_inode::i_flags_lock.

--D

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-18 19:00         ` Darrick J. Wong
@ 2026-02-19  5:53           ` Christoph Hellwig
  2026-02-19  5:59             ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-02-19  5:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, linux-xfs, linux-fsdevel, viro,
	brauner, jack

On Wed, Feb 18, 2026 at 11:00:39AM -0800, Darrick J. Wong wrote:
> > but that won't help other users like the block device code, zonefs and
> > gfs2.  Maybe we'll need an opt-in for the fserror reporting now to
> > exclude them?
> 
> <shrug> Assuming that file IO errors aren't a frequent occurrence, it's
> easy enough to attach them to a global list and schedule_worker to
> process the list when an error comes in.

I'd rather not created random forests of workqueues if we can.
Let file systems opt into features when they provide the infrastructure,
and left common enough infrastructure into common code as we usually do.

> 
> > On something related, if we require a user context for fserror_report
> > anyway, there is no need for the workqueue bouncing in it.
> 
> Bouncing the fserror_event to an async kworker is useful for laundering
> the inode locking context -- fsnotify and ->report_error know they're
> running in process context without any filesystem locks held.
> 
> I tried getting rid of the wq bouncing and immediately ran into the same
> lockdep complaint but on xfs_inode::i_flags_lock.

Ok.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: inconsistent lock state in the new fserror code
  2026-02-19  5:53           ` Christoph Hellwig
@ 2026-02-19  5:59             ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2026-02-19  5:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, viro, brauner, jack

On Wed, Feb 18, 2026 at 09:53:03PM -0800, Christoph Hellwig wrote:
> On Wed, Feb 18, 2026 at 11:00:39AM -0800, Darrick J. Wong wrote:
> > > but that won't help other users like the block device code, zonefs and
> > > gfs2.  Maybe we'll need an opt-in for the fserror reporting now to
> > > exclude them?
> > 
> > <shrug> Assuming that file IO errors aren't a frequent occurrence, it's
> > easy enough to attach them to a global list and schedule_worker to
> > process the list when an error comes in.
> 
> I'd rather not created random forests of workqueues if we can.
> Let file systems opt into features when they provide the infrastructure,
> and left common enough infrastructure into common code as we usually do.

That /is/ a weird part about the fserror calls in iomap -- the
filesystem doesn't have to provide any infrastructure to get the
functionality.

I guess we could do something weird like add a flags field to iomap_ops
so that a filesystem could say that it wants fserror reporting; or plumb
a bunch more stuff through iomap.

<shrug> I think I'd rather just send my accumulated 7.0 fixes and we can
argue about the correct solution(s) with some real code. :)

--D

> > > On something related, if we require a user context for fserror_report
> > > anyway, there is no need for the workqueue bouncing in it.
> > 
> > Bouncing the fserror_event to an async kworker is useful for laundering
> > the inode locking context -- fsnotify and ->report_error know they're
> > running in process context without any filesystem locks held.
> > 
> > I tried getting rid of the wq bouncing and immediately ran into the same
> > lockdep complaint but on xfs_inode::i_flags_lock.
> 
> Ok.
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-19  5:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-13  6:15 inconsistent lock state in the new fserror code Christoph Hellwig
2026-02-13 16:00 ` Darrick J. Wong
2026-02-13 19:07   ` Darrick J. Wong
2026-02-13 22:38     ` Dave Chinner
2026-02-14  5:55       ` Darrick J. Wong
2026-02-17  5:47       ` Christoph Hellwig
2026-02-18 19:00         ` Darrick J. Wong
2026-02-19  5:53           ` Christoph Hellwig
2026-02-19  5:59             ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox