Re: inconsistent lock state in the new fserror code

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz
Subject: Re: inconsistent lock state in the new fserror code
Date: Fri, 13 Feb 2026 11:07:57 -0800	[thread overview]
Message-ID: <20260213190757.GJ7693@frogsfrogsfrogs> (raw)
In-Reply-To: <20260213160041.GT1535390@frogsfrogsfrogs>

On Fri, Feb 13, 2026 at 08:00:41AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 12, 2026 at 10:15:57PM -0800, Christoph Hellwig wrote:
> > xfstests generic/108 makes lockdep unhappy with the new fserror code
> > in Linus' tree, see the trace below.  The problem seems to be that
> > igrab takes i_lock to protect against a inode that is beeing freed.
> > Error reporting doesn't care about that, but we don't really have
> > a good interface to just grab a reference.
> > 
> > [  149.494670] ================================
> > [  149.494871] WARNING: inconsistent lock state
> > [  149.495073] 6.19.0+ #4827 Tainted: G                 N 
> > [  149.495336] --------------------------------
> > [  149.495560] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> > [  149.495857] swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > [  149.496111] ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0
> > [  149.496543] {HARDIRQ-ON-W} state was registered at:
> > [  149.496853]   lock_acquire+0xca/0x2c0
> > [  149.497057]   _raw_spin_lock+0x2e/0x40
> > [  149.497257]   unlock_new_inode+0x2c/0xc0
> > [  149.497460]   xfs_iget+0xcf4/0x1080
> > [  149.497643]   xfs_trans_metafile_iget+0x3d/0x100
> > [  149.497882]   xfs_metafile_iget+0x2b/0x50
> > [  149.498144]   xfs_mount_setup_metadir+0x20/0x60
> > [  149.498163]   xfs_mountfs+0x457/0xa60
> > [  149.498163]   xfs_fs_fill_super+0x6b3/0xa90
> > [  149.498163]   get_tree_bdev_flags+0x13c/0x1e0
> > [  149.498163]   vfs_get_tree+0x27/0xe0
> > [  149.498163]   vfs_cmd_create+0x54/0xe0
> > [  149.498163]   __do_sys_fsconfig+0x309/0x620
> > [  149.498163]   do_syscall_64+0x8b/0xf80
> > [  149.498163]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  149.498163] irq event stamp: 139080
> > [  149.498163] hardirqs last  enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270
> > [  149.498163] hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0
> > [  149.498163] softirqs last  enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
> > [  149.498163] softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
> > [  149.498163] 
> > [  149.498163] other info that might help us debug this:
> > [  149.498163]  Possible unsafe locking scenario:
> > [  149.498163] 
> > [  149.498163]        CPU0
> > [  149.498163]        ----
> > [  149.498163]   lock(&sb->s_type->i_lock_key#33);
> > [  149.498163]   <Interrupt>
> > [  149.498163]     lock(&sb->s_type->i_lock_key#33);
> 
> Er... is lockdep telling us here that we could take i_lock in
> unlock_new_inode, get interrupted, and then take another i_lock?
> 
> > [  149.498163] 
> > [  149.498163]  *** DEADLOCK ***
> > [  149.498163] 
> > [  149.498163] 1 lock held by swapper/1/0:
> > [  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
> > [  149.498163] 
> > [  149.498163] stack backtrace:
> > [  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
> > [  149.498163] Tainted: [N]=TEST
> > [  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> > [  149.498163] Call Trace:
> > [  149.498163]  <IRQ>
> > [  149.498163]  dump_stack_lvl+0x5b/0x80
> > [  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
> > [  149.498163]  mark_lock+0xa6f/0xe90
> > [  149.498163]  ? mempool_alloc_noprof+0x91/0x130
> > [  149.498163]  ? set_track_prepare+0x39/0x60
> > [  149.498163]  ? mempool_alloc_noprof+0x91/0x130
> > [  149.498163]  ? fserror_report+0x8a/0x260
> > [  149.498163]  ? iomap_finish_ioend_buffered+0x170/0x210
> > [  149.498163]  ? clone_endio+0x8f/0x1c0
> > [  149.498163]  ? blk_update_request+0x1e4/0x4d0
> > [  149.498163]  ? blk_mq_end_request+0x1b/0x100
> > [  149.498163]  ? virtblk_done+0x6f/0x110
> > [  149.498163]  ? vring_interrupt+0x59/0x80
> > [  149.498163]  ? __handle_irq_event_percpu+0x8a/0x2e0
> > [  149.498163]  ? handle_irq_event+0x33/0x70
> > [  149.498163]  ? handle_edge_irq+0xdd/0x1e0
> > [  149.498163]  __lock_acquire+0x10b6/0x25e0
> > [  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
> > [  149.498163]  ? __pcs_replace_empty_main+0x369/0x510
> > [  149.498163]  lock_acquire+0xca/0x2c0
> > [  149.498163]  ? igrab+0x1a/0xb0
> > [  149.498163]  ? rcu_is_watching+0x11/0x50
> > [  149.498163]  ? __kmalloc_noprof+0x3ab/0x5a0
> > [  149.498163]  _raw_spin_lock+0x2e/0x40
> > [  149.498163]  ? igrab+0x1a/0xb0
> > [  149.498163]  igrab+0x1a/0xb0
> > [  149.498163]  fserror_report+0x135/0x260
> > [  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
> > [  149.498163]  ? __pfx_stripe_end_io+0x10/0x10
> > [  149.498163]  clone_endio+0x8f/0x1c0
> > [  149.498163]  blk_update_request+0x1e4/0x4d0
> > [  149.498163]  ? __pfx_sg_pool_free+0x10/0x10
> > [  149.498163]  ? mempool_free+0x3d/0x50
> > [  149.498163]  blk_mq_end_request+0x1b/0x100
> > [  149.498163]  virtblk_done+0x6f/0x110
> > [  149.498163]  vring_interrupt+0x59/0x80
> > [  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
> > [  149.498163]  handle_irq_event+0x33/0x70
> > [  149.498163]  handle_edge_irq+0xdd/0x1e0
> > [  149.498163]  __common_interrupt+0x6f/0x180
> > [  149.498163]  common_interrupt+0xb7/0xe0
> 
> Hrmm, so we're calling fserror_report/igrab from an interrupt handler.
> The bio endio function is for writeback ioend completion.
> 
> igrab takes i_lock to check if the inode is in FREEING or WILL_FREE
> state.  However, the fact that it's in writeback presumably means that
> the vfs still holds an i_count on this inode, so the inode cannot be
> freed until iomap_finish_ioend_buffered completes.  So perhaps instead
> of calling igrab directly, could we perhaps get away with:
> 
> 	/*
> 	 * Only referenced inodes may be passed into this function!
> 	 * This means they cannot be INEW, FREEING, or WILL_FREE.
> 	 *
> 	 * Can't iput from non-sleeping context, so grabbing another
> 	 * reference must be the last thing before submitting the event
> 	 */
> 	if (inode && !atomic_inc_not_zero(&inode->i_count)) {
> 		/* warn about precondition violation and lost error */
> 		goto lost_event;
> 	}
> 
> 	schedule_work(&event->work);
> 
> Hm?

/me hands himself another cup of coffee, changes that to:

	/*
	 * Can't iput from non-sleeping context, so grabbing another
	 * reference to the inode must be the last thing before
	 * submitting the event.  Open-code the igrab here to avoid
	 * taking i_lock in interrupt context.
	 */
	if (inode) {
		WARN_ON_ONCE(inode_unhashed(inode));
		WARN_ON_ONCE(inode_state_read_once(inode) &
					(I_NEW | I_FREEING | I_WILL_FREE));

		if (!atomic_inc_not_zero(&inode->i_count))
			goto lost_event;
		event->inode = inode;
	}

This seems to fix the lockdep report, will send that out for full
testing this evening.

--D

> It also occurred to me that we shouldn't be calling fserror_report on
> any of the metadata inodes, which means another fixpatch is needed for
> the callsites in xfs_health.c.
> 
> --D
> 
> > [  149.498163]  </IRQ>
> > [  149.498163]  <TASK>
> > [  149.498163]  asm_common_interrupt+0x26/0x40
> > [  149.498163] RIP: 0010:default_idle+0xf/0x20
> > [  149.498163] Code: 4c 01 c7 4c 29 c2 e9 6e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d a9 88 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> > [  149.498163] RSP: 0018:ffffc9000009fed8 EFLAGS: 00000206
> > [  149.498163] RAX: 0000000000021f47 RBX: ffff888100ad53c0 RCX: 0000000000000000
> > [  149.498163] RDX: 0000000000000000 RSI: ffffffff83278fb9 RDI: ffffffff8329090b
> > [  149.498163] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> > [  149.498163] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> > [  149.498163] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [  149.498163]  default_idle_call+0x7e/0x1a0
> > [  149.498163]  do_idle+0x1ec/0x270
> > [  149.498163]  cpu_startup_entry+0x24/0x30
> > [  149.498163]  start_secondary+0xf7/0x100
> > [  149.498163]  common_startup_64+0x13e/0x148
> > [  149.498163]  </TASK>
>

next prev parent reply	other threads:[~2026-02-13 19:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13  6:15 inconsistent lock state in the new fserror code Christoph Hellwig
2026-02-13 16:00 ` Darrick J. Wong
2026-02-13 19:07   ` Darrick J. Wong [this message]
2026-02-13 22:38     ` Dave Chinner
2026-02-14  5:55       ` Darrick J. Wong
2026-02-17  5:47       ` Christoph Hellwig
2026-02-18 19:00         ` Darrick J. Wong
2026-02-19  5:53           ` Christoph Hellwig
2026-02-19  5:59             ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260213190757.GJ7693@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=brauner@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox