Re: shared/298 lockdep splat?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: shared/298 lockdep splat?
       [not found] <20170920211042.GH7112@magnolia>
@ 2017-09-20 22:22 ` Dave Chinner
  2017-09-21  8:47   ` Byungchul Park
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2017-09-20 22:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs, Peter Zijlstra, Byungchul Park, linux-kernel

[cc lkml, PeterZ and Byungchul]

On Wed, Sep 20, 2017 at 02:10:42PM -0700, Darrick J. Wong wrote:
> Hello list,
> 
> Yesterday I tried setting up qemu 2.10 with some (fake) nvdimms backed
> by an on-disk file.  Midway through a -g auto xfstests run, shared/298
> produced the attached dmesg spew.  I'll try to have a look later, but
> in the meantime I'm doing the 'complain to list, see if anyone bites'
> thing. :)
> 
> The kernel is 4.14-rc1 without any patches applied.

tl;dr: it's a false positive because it's conflating the independent
locking/completion contexts of upper and lower filesystems when
the loop device is in use.

I've mangled the report in replying - Darrick posted it here, too:

https://pastebin.com/GshLpH7x

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G        W      
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
>  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
> 
> but now in release context of a crosslock acquired at the following:
>  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 ((complete)&ret.event){+.+.}:
>        lock_acquire+0xab/0x200
>        wait_for_completion_io+0x4e/0x1a0
>        submit_bio_wait+0x7f/0xb0
>        blkdev_issue_zeroout+0x71/0xa0
>        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>        xfs_bmapi_write+0x374/0x11f0 [xfs]
>        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>        iomap_apply+0x43/0xe0
>        dax_iomap_rw+0x89/0xf0
>        xfs_file_dax_write+0xcc/0x220 [xfs]
>        xfs_file_write_iter+0xf0/0x130 [xfs]
>        __vfs_write+0xd9/0x150
>        vfs_write+0xc8/0x1c0
>        SyS_write+0x45/0xa0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe

This is on the lower filesystem - it's a write() syscall on a DAX
enable filesystem. Locking in this case is:

Context			function		lock
Lower filesystem	xfs_file_dax_write	XFS_IOLOCK_EXCL
Lower filesystem	xfs_file_iomap_begin	XFS_ILOCK_EXCL
Lower filesystem/
   lower blkdev		submit_bio_wait		wait for dax block
						device IO completion

Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion

In the lower filesystem.

> -> #1 (&xfs_nondir_ilock_class){++++}:
>        lock_acquire+0xab/0x200
>        down_write_nested+0x4a/0xb0
>        xfs_ilock+0x263/0x330 [xfs]
>        xfs_setattr_size+0x152/0x370 [xfs]
>        xfs_vn_setattr+0x6b/0x90 [xfs]
>        notify_change+0x27d/0x3f0
>        do_truncate+0x5b/0x90
>        path_openat+0x237/0xa90
>        do_filp_open+0x8a/0xf0
>        do_sys_open+0x11c/0x1f0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe

Context			function		lock
Lower filesystem	xfs_vn_setattr		XFS_IOLOCK_EXCL
Lower filesystem	xfs_vn_setattr		XFS_MMAPLOCK_EXCL
Lower filesystem	xfs_setattr_size	XFS_ILOCK_EXCL

Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class

i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion

In the lower filesystem.

> 
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
>        up_write+0x1c/0x40
>        xfs_iunlock+0x1d0/0x310 [xfs]
>        xfs_file_fallocate+0x8a/0x310 [xfs]
>        loop_queue_work+0xb7/0x8d0
>        kthread_worker_fn+0xb9/0x1f0

Context			function		lock
Lower filesystem	xfs_file_fallocate	XFS_IOLOCK_EXCL
Lower filesystem	xfs_file_fallocate	XFS_MMAPLOCK_EXCL
[Lower filesystem	whatever falloc calls	XFS_ILOCK_EXCL]
Upper filesystem/
   Upper blkd		loop_queue_work		blk_mq_complete_request

This is clearer when the reported stack is looked at:

> lock_commit_crosslock+0x3e9/0x5c0
> complete+0x24/0x60
> blk_update_request+0xc2/0x3e0
> blk_mq_end_request+0x18/0x80
> __blk_mq_complete_request+0x9f/0x170
> loop_queue_work+0x51/0x8d0
> kthread_worker_fn+0xb9/0x1f0
> kthread+0x148/0x180
> ret_from_fork+0x2a/0x40

So this is reporting as:

> Chain exists of:
>   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
>  Possible unsafe locking scenario by crosslock:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&xfs_nondir_ilock_class);
>   lock((complete)&ret.event);
>                                lock(&(&ip->i_mmaplock)->mr_lock);
>                                unlock((complete)&ret.event);
>
>                *** DEADLOCK ***

Which effectively means:

	IO completion -> XFS_MMAPLOCK_EXCL .....

The thing is, this IO completion has nothing to do with the lower
filesystem - it's the IO completion for the filesystem on the loop
device (the upper filesystem) and is not in any way related to the
IO completion from the dax device the lower filesystem is waiting
on.

IOWs, this is a false positive.

Peter, this is the sort of false positive I mentioned were likely to
occur without some serious work to annotate the IO stack to prevent
them.  We can nest multiple layers of IO completions and locking in
the IO stack via things like loop and RAID devices.  They can be
nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
checking is going to be a source of false positives until there is
an effective (and simple!) way of providing context based completion
annotations to avoid them...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: shared/298 lockdep splat?
  2017-09-20 22:22 ` shared/298 lockdep splat? Dave Chinner
@ 2017-09-21  8:47   ` Byungchul Park
  2017-09-26  3:51     ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Byungchul Park @ 2017-09-21  8:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, xfs, Peter Zijlstra, linux-kernel, kernel-team

On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.14.0-rc1-fixes #1 Tainted: G        W      
> > ------------------------------------------------------
> > loop0/31693 is trying to acquire lock:
> >  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
> > 
> > but now in release context of a crosslock acquired at the following:
> >  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #2 ((complete)&ret.event){+.+.}:
> >        lock_acquire+0xab/0x200
> >        wait_for_completion_io+0x4e/0x1a0
> >        submit_bio_wait+0x7f/0xb0
> >        blkdev_issue_zeroout+0x71/0xa0
> >        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> >        xfs_bmapi_write+0x374/0x11f0 [xfs]
> >        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> >        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> >        iomap_apply+0x43/0xe0
> >        dax_iomap_rw+0x89/0xf0
> >        xfs_file_dax_write+0xcc/0x220 [xfs]
> >        xfs_file_write_iter+0xf0/0x130 [xfs]
> >        __vfs_write+0xd9/0x150
> >        vfs_write+0xc8/0x1c0
> >        SyS_write+0x45/0xa0
> >        entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> This is on the lower filesystem - it's a write() syscall on a DAX
> enable filesystem. Locking in this case is:
> 
> Context			function		lock
> Lower filesystem	xfs_file_dax_write	XFS_IOLOCK_EXCL
> Lower filesystem	xfs_file_iomap_begin	XFS_ILOCK_EXCL
> Lower filesystem/
>    lower blkdev		submit_bio_wait		wait for dax block
> 						device IO completion
> 
> Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion
> 
> In the lower filesystem.
> 
> > -> #1 (&xfs_nondir_ilock_class){++++}:
> >        lock_acquire+0xab/0x200
> >        down_write_nested+0x4a/0xb0
> >        xfs_ilock+0x263/0x330 [xfs]
> >        xfs_setattr_size+0x152/0x370 [xfs]
> >        xfs_vn_setattr+0x6b/0x90 [xfs]
> >        notify_change+0x27d/0x3f0
> >        do_truncate+0x5b/0x90
> >        path_openat+0x237/0xa90
> >        do_filp_open+0x8a/0xf0
> >        do_sys_open+0x11c/0x1f0
> >        entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Context			function		lock
> Lower filesystem	xfs_vn_setattr		XFS_IOLOCK_EXCL
> Lower filesystem	xfs_vn_setattr		XFS_MMAPLOCK_EXCL
> Lower filesystem	xfs_setattr_size	XFS_ILOCK_EXCL
> 
> Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class
> 
> i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion
> 
> In the lower filesystem.
> 
> > 
> > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> >        up_write+0x1c/0x40
> >        xfs_iunlock+0x1d0/0x310 [xfs]
> >        xfs_file_fallocate+0x8a/0x310 [xfs]
> >        loop_queue_work+0xb7/0x8d0
> >        kthread_worker_fn+0xb9/0x1f0
> 
> Context			function		lock
> Lower filesystem	xfs_file_fallocate	XFS_IOLOCK_EXCL
> Lower filesystem	xfs_file_fallocate	XFS_MMAPLOCK_EXCL
> [Lower filesystem	whatever falloc calls	XFS_ILOCK_EXCL]
> Upper filesystem/
>    Upper blkd		loop_queue_work		blk_mq_complete_request
> 
> This is clearer when the reported stack is looked at:
> 
> > lock_commit_crosslock+0x3e9/0x5c0
> > complete+0x24/0x60
> > blk_update_request+0xc2/0x3e0
> > blk_mq_end_request+0x18/0x80
> > __blk_mq_complete_request+0x9f/0x170
> > loop_queue_work+0x51/0x8d0
> > kthread_worker_fn+0xb9/0x1f0
> > kthread+0x148/0x180
> > ret_from_fork+0x2a/0x40
> 
> So this is reporting as:
> 
> > Chain exists of:
> >   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
> >
> >  Possible unsafe locking scenario by crosslock:
> >
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(&xfs_nondir_ilock_class);
> >   lock((complete)&ret.event);
> >                                lock(&(&ip->i_mmaplock)->mr_lock);
> >                                unlock((complete)&ret.event);
> >
> >                *** DEADLOCK ***
> 
> Which effectively means:
> 
> 	IO completion -> XFS_MMAPLOCK_EXCL .....
> 
> The thing is, this IO completion has nothing to do with the lower
> filesystem - it's the IO completion for the filesystem on the loop
> device (the upper filesystem) and is not in any way related to the
> IO completion from the dax device the lower filesystem is waiting
> on.
> 
> IOWs, this is a false positive.
> 
> Peter, this is the sort of false positive I mentioned were likely to
> occur without some serious work to annotate the IO stack to prevent
> them.  We can nest multiple layers of IO completions and locking in
> the IO stack via things like loop and RAID devices.  They can be
> nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> checking is going to be a source of false positives until there is
> an effective (and simple!) way of providing context based completion
> annotations to avoid them...

Hello,

It looks caused by that &ret.event in submit_bio_wait() is initialized
with the same class for all layers. I mean that completion variables in
different layers should be initialized with different classes, as you do
for typical locks in xfs.

I am not sure if I understand how xfs works correctly. Right? If yes,
how can we distinguish between independent 'bio's in submit_bio_wait()?
You or I can make it work with the answer. No?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: shared/298 lockdep splat?
  2017-09-21  8:47   ` Byungchul Park
@ 2017-09-26  3:51     ` Dave Chinner
  2017-09-26  7:04       ` Byungchul Park
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2017-09-26  3:51 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Darrick J. Wong, xfs, Peter Zijlstra, linux-kernel, kernel-team

On Thu, Sep 21, 2017 at 05:47:14PM +0900, Byungchul Park wrote:
> On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > Peter, this is the sort of false positive I mentioned were likely to
> > occur without some serious work to annotate the IO stack to prevent
> > them.  We can nest multiple layers of IO completions and locking in
> > the IO stack via things like loop and RAID devices.  They can be
> > nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> > dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> > checking is going to be a source of false positives until there is
> > an effective (and simple!) way of providing context based completion
> > annotations to avoid them...
> 
> Hello,
> 
> It looks caused by that &ret.event in submit_bio_wait() is initialized
> with the same class for all layers. I mean that completion variables in
> different layers should be initialized with different classes, as you do
> for typical locks in xfs.

Except that submit_bio_wait() is generic block layer functionality
and can be used by anyone. Whatever solution you decide on, it has
to be generic. And keep in mind that any code that submits a bio
themselves and waits on a completion event from the bio is going to
have to do their own annotations, which makes this a real PITA.

> I am not sure if I understand how xfs works correctly. Right? If yes,
> how can we distinguish between independent 'bio's in submit_bio_wait()?
> You or I can make it work with the answer. No?

Has nothing to do with XFS - it has no clue where it sits in the
block device stack and has no business screwing with bio internals
and stack layering to handle issues with stacked block devices....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: shared/298 lockdep splat?
  2017-09-26  3:51     ` Dave Chinner
@ 2017-09-26  7:04       ` Byungchul Park
  0 siblings, 0 replies; 4+ messages in thread
From: Byungchul Park @ 2017-09-26  7:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, xfs, Peter Zijlstra, linux-kernel, kernel-team

On Tue, Sep 26, 2017 at 01:51:49PM +1000, Dave Chinner wrote:
> On Thu, Sep 21, 2017 at 05:47:14PM +0900, Byungchul Park wrote:
> > On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > > Peter, this is the sort of false positive I mentioned were likely to
> > > occur without some serious work to annotate the IO stack to prevent
> > > them.  We can nest multiple layers of IO completions and locking in
> > > the IO stack via things like loop and RAID devices.  They can be
> > > nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> > > dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> > > checking is going to be a source of false positives until there is
> > > an effective (and simple!) way of providing context based completion
> > > annotations to avoid them...
> > 
> > Hello,
> > 
> > It looks caused by that &ret.event in submit_bio_wait() is initialized
> > with the same class for all layers. I mean that completion variables in
> > different layers should be initialized with different classes, as you do
> > for typical locks in xfs.
> 
> Except that submit_bio_wait() is generic block layer functionality
> and can be used by anyone. Whatever solution you decide on, it has
> to be generic. And keep in mind that any code that submits a bio
> themselves and waits on a completion event from the bio is going to
> have to do their own annotations, which makes this a real PITA.

Right. Agree. Let me think it more. As you said, it should be generic.

> > I am not sure if I understand how xfs works correctly. Right? If yes,
> > how can we distinguish between independent 'bio's in submit_bio_wait()?
> > You or I can make it work with the answer. No?
> 
> Has nothing to do with XFS - it has no clue where it sits in the
> block device stack and has no business screwing with bio internals
> and stack layering to handle issues with stacked block devices....

Ok. Thank you for replying.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-09-26  7:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170920211042.GH7112@magnolia>
2017-09-20 22:22 ` shared/298 lockdep splat? Dave Chinner
2017-09-21  8:47   ` Byungchul Park
2017-09-26  3:51     ` Dave Chinner
2017-09-26  7:04       ` Byungchul Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox