Re: inconsistent lock state in the new fserror code

Linux filesystem development
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <dgc@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz
Subject: Re: inconsistent lock state in the new fserror code
Date: Fri, 13 Feb 2026 21:55:36 -0800	[thread overview]
Message-ID: <20260214055536.GW1535390@frogsfrogsfrogs> (raw)
In-Reply-To: <aY-n4leNi4NCzri1@dread>

On Sat, Feb 14, 2026 at 09:38:26AM +1100, Dave Chinner wrote:
> On Fri, Feb 13, 2026 at 11:07:57AM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 13, 2026 at 08:00:41AM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 12, 2026 at 10:15:57PM -0800, Christoph Hellwig wrote:
> > > > [  149.498163] other info that might help us debug this:
> > > > [  149.498163]  Possible unsafe locking scenario:
> > > > [  149.498163] 
> > > > [  149.498163]        CPU0
> > > > [  149.498163]        ----
> > > > [  149.498163]   lock(&sb->s_type->i_lock_key#33);
> > > > [  149.498163]   <Interrupt>
> > > > [  149.498163]     lock(&sb->s_type->i_lock_key#33);
> > > 
> > > Er... is lockdep telling us here that we could take i_lock in
> > > unlock_new_inode, get interrupted, and then take another i_lock?
> 
> Yes.
> 
> > > > [  149.498163] 
> > > > [  149.498163]  *** DEADLOCK ***
> > > > [  149.498163] 
> > > > [  149.498163] 1 lock held by swapper/1/0:
> > > > [  149.498163]  #0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
> > > > [  149.498163] 
> > > > [  149.498163] stack backtrace:
> > > > [  149.498163] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G                 N  6.19.0+ #4827 PREEMPT(full) 
> > > > [  149.498163] Tainted: [N]=TEST
> > > > [  149.498163] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> > > > [  149.498163] Call Trace:
> > > > [  149.498163]  <IRQ>
> > > > [  149.498163]  dump_stack_lvl+0x5b/0x80
> > > > [  149.498163]  print_usage_bug.part.0+0x22c/0x2c0
> > > > [  149.498163]  mark_lock+0xa6f/0xe90
> > > > [  149.498163]  __lock_acquire+0x10b6/0x25e0
> > > > [  149.498163]  lock_acquire+0xca/0x2c0
> > > > [  149.498163]  _raw_spin_lock+0x2e/0x40
> > > > [  149.498163]  igrab+0x1a/0xb0
> > > > [  149.498163]  fserror_report+0x135/0x260
> > > > [  149.498163]  iomap_finish_ioend_buffered+0x170/0x210
> > > > [  149.498163]  clone_endio+0x8f/0x1c0
> > > > [  149.498163]  blk_update_request+0x1e4/0x4d0
> > > > [  149.498163]  blk_mq_end_request+0x1b/0x100
> > > > [  149.498163]  virtblk_done+0x6f/0x110
> > > > [  149.498163]  vring_interrupt+0x59/0x80
> 
> Ok, so why are we calling iomap_finish_ioend_buffered() from IRQ
> context? That looks like a bug because the only IO completion call
> chain that can get into iomap_finish_ioend_buffered() is supposedly:
> 
> iomap_finish_ioends
>   iomap_finish_ioend
>     iomap_finish_ioend_buffered
> 
> And the comment above iomap_finish_ioends() says:
> 
> /*
>  * Ioend completion routine for merged bios. This can only be called from task
>  * contexts as merged ioends can be of unbound length. Hence we have to break up
>  * the writeback completions into manageable chunks to avoid long scheduler
>  * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get
>  * good batch processing throughput without creating adverse scheduler latency
>  * conditions.
>  */
> 
> Ah, there's the problem - pure buffered overwrites from XFS use
> ioend_writeback_end_bio(), not xfs_end_bio(). Hence the buffered
> write completion is not punted to a workqueue, and it calls
> iomap_finish_ioend_buffered() direct from the bio completion
> context.
> 
> Yeah, that seems like a bug that needs fixing in the
> ioend_writeback_end_bio() function - if there's an IO error, it
> needs to punt the processing of the ioend to a workqueue...

<nod> That's a much simpler approach, particularly if we're only bumping
to a workqueue to handle IO errors (which means there's no need for
merging).

> > > > [  149.498163]  __handle_irq_event_percpu+0x8a/0x2e0
> > > > [  149.498163]  handle_irq_event+0x33/0x70
> > > > [  149.498163]  handle_edge_irq+0xdd/0x1e0
> > > > [  149.498163]  __common_interrupt+0x6f/0x180
> > > > [  149.498163]  common_interrupt+0xb7/0xe0
> > > 
> > > Hrmm, so we're calling fserror_report/igrab from an interrupt handler.
> > > The bio endio function is for writeback ioend completion.
> 
> Yup, this is one of the reasons writeback doesn't hold an inode
> reference over IO - we can't call iput() from an interrupt context.
> 
> > > igrab takes i_lock to check if the inode is in FREEING or WILL_FREE
> > > state.  However, the fact that it's in writeback presumably means that
> > > the vfs still holds an i_count on this inode,
> 
> Writeback holds an inode reference over submission only.
> 
> > > so the inode cannot be
> > > freed until iomap_finish_ioend_buffered completes.
> 
> iput()->iput_final()->evict will block in inode_wait_for_writeback()
> waiting for outstanding writeback to complete before it starts
> tearing down the inode. This isn't controlled by reference counts.
> 
> > /me hands himself another cup of coffee, changes that to:
> > 
> > 	/*
> > 	 * Can't iput from non-sleeping context, so grabbing another
> > 	 * reference to the inode must be the last thing before
> > 	 * submitting the event.  Open-code the igrab here to avoid
> > 	 * taking i_lock in interrupt context.
> > 	 */
> > 	if (inode) {
> > 		WARN_ON_ONCE(inode_unhashed(inode));
> > 		WARN_ON_ONCE(inode_state_read_once(inode) &
> > 					(I_NEW | I_FREEING | I_WILL_FREE));
> 
> It is valid for the inode have a zero reference count and have either
> I_FREEING or I_WILL_FREE set here if another task has dropped the
> final inode reference while writeback IO is still in flight.
> 
> > 		if (!atomic_inc_not_zero(&inode->i_count))
> > 			goto lost_event;
> 
> Overall, I'm not sure using atomic_inc_not_zero() is safe here. It
> may be, but I don't think this is how the problem should be solved.

I /think/ it works because evict waits for writeback to end (so the
inode can't go away) and we never attach the inode to the error event if
the i_count already hit zero buuut this is a code smell anyway so I've
little interest in pursuing this part further.

> Punt ioend w/ IO errors to a work queue, and then nothing needs to
> change w.r.t. the fserror handling of the inodes. i.e. it will be
> save to use inode->i_lock and hence igrab()...

<nod> Will test that out.  Thanks for the suggestion.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

next prev parent reply	other threads:[~2026-02-14  5:55 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13  6:15 inconsistent lock state in the new fserror code Christoph Hellwig
2026-02-13 16:00 ` Darrick J. Wong
2026-02-13 19:07   ` Darrick J. Wong
2026-02-13 22:38     ` Dave Chinner
2026-02-14  5:55       ` Darrick J. Wong [this message]
2026-02-17  5:47       ` Christoph Hellwig
2026-02-18 19:00         ` Darrick J. Wong
2026-02-19  5:53           ` Christoph Hellwig
2026-02-19  5:59             ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260214055536.GW1535390@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=brauner@kernel.org \
    --cc=dgc@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox