* shared/298 lockdep splat?
@ 2017-09-20 21:10 Darrick J. Wong
2017-09-20 22:22 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2017-09-20 21:10 UTC (permalink / raw)
To: xfs
Hello list,
Yesterday I tried setting up qemu 2.10 with some (fake) nvdimms backed
by an on-disk file. Midway through a -g auto xfstests run, shared/298
produced the attached dmesg spew. I'll try to have a look later, but
in the meantime I'm doing the 'complain to list, see if anyone bites'
thing. :)
The kernel is 4.14-rc1 without any patches applied.
--D
======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc1-fixes #1 Tainted: G W
------------------------------------------------------
loop0/31693 is trying to acquire lock:
(&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
but now in release context of a crosslock acquired at the following:
((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((complete)&ret.event){+.+.}:
lock_acquire+0xab/0x200
wait_for_completion_io+0x4e/0x1a0
submit_bio_wait+0x7f/0xb0
blkdev_issue_zeroout+0x71/0xa0
xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
xfs_bmapi_write+0x374/0x11f0 [xfs]
xfs_iomap_write_direct+0x2ac/0x430 [xfs]
xfs_file_iomap_begin+0x20d/0xd50 [xfs]
iomap_apply+0x43/0xe0
dax_iomap_rw+0x89/0xf0
xfs_file_dax_write+0xcc/0x220 [xfs]
xfs_file_write_iter+0xf0/0x130 [xfs]
__vfs_write+0xd9/0x150
vfs_write+0xc8/0x1c0
SyS_write+0x45/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xbe
-> #1 (&xfs_nondir_ilock_class){++++}:
lock_acquire+0xab/0x200
down_write_nested+0x4a/0xb0
xfs_ilock+0x263/0x330 [xfs]
xfs_setattr_size+0x152/0x370 [xfs]
xfs_vn_setattr+0x6b/0x90 [xfs]
notify_change+0x27d/0x3f0
do_truncate+0x5b/0x90
path_openat+0x237/0xa90
do_filp_open+0x8a/0xf0
do_sys_open+0x11c/0x1f0
entry_SYSCALL_64_fastpath+0x1f/0xbe
-> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
up_write+0x1c/0x40
xfs_iunlock+0x1d0/0x310 [xfs]
xfs_file_fallocate+0x8a/0x310 [xfs]
loop_queue_work+0xb7/0x8d0
kthread_worker_fn+0xb9/0x1f0
other info that might help us debug this:
Chain exists of:
&(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
Possible unsafe locking scenario by crosslock:
CPU0 CPU1
---- ----
lock(&xfs_nondir_ilock_class);
lock((complete)&ret.event);
lock(&(&ip->i_mmaplock)->mr_lock);
unlock((complete)&ret.event);
*** DEADLOCK ***
1 lock held by loop0/31693:
#0: (&x->wait#16){-...}, at: [<ffffffff810d1858>] complete+0x18/0x60
stack backtrace:
CPU: 2 PID: 31693 Comm: loop0 Tainted: G W 4.14.0-rc1-fixes #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x7c/0xbe
print_circular_bug+0x204/0x310
? graph_unlock+0x70/0x70
check_prev_add+0x401/0x800
? __lock_acquire+0x72a/0x1100
? __lock_acquire+0x534/0x1100
? lock_commit_crosslock+0x3e9/0x5c0
lock_commit_crosslock+0x3e9/0x5c0
complete+0x24/0x60
blk_update_request+0xc2/0x3e0
blk_mq_end_request+0x18/0x80
__blk_mq_complete_request+0x9f/0x170
loop_queue_work+0x51/0x8d0
? kthread_worker_fn+0x96/0x1f0
kthread_worker_fn+0xb9/0x1f0
kthread+0x148/0x180
? loop_get_status64+0x80/0x80
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2a/0x40
XFS (loop0): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
XFS (loop0): EXPERIMENTAL reflink feature enabled. Use at your own risk!
XFS (loop0): Mounting V5 Filesystem
XFS (loop0): Ending clean mount
XFS (loop0): Unmounting Filesystem
XFS (pmem3): Unmounting Filesystem
XFS (pmem3): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
XFS (pmem3): EXPERIMENTAL reflink feature enabled. Use at your own risk!
XFS (pmem3): Mounting V5 Filesystem
XFS (pmem3): Ending clean mount
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: shared/298 lockdep splat?
2017-09-20 21:10 shared/298 lockdep splat? Darrick J. Wong
@ 2017-09-20 22:22 ` Dave Chinner
2017-09-21 8:47 ` Byungchul Park
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2017-09-20 22:22 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: xfs, Peter Zijlstra, Byungchul Park, linux-kernel
[cc lkml, PeterZ and Byungchul]
On Wed, Sep 20, 2017 at 02:10:42PM -0700, Darrick J. Wong wrote:
> Hello list,
>
> Yesterday I tried setting up qemu 2.10 with some (fake) nvdimms backed
> by an on-disk file. Midway through a -g auto xfstests run, shared/298
> produced the attached dmesg spew. I'll try to have a look later, but
> in the meantime I'm doing the 'complain to list, see if anyone bites'
> thing. :)
>
> The kernel is 4.14-rc1 without any patches applied.
tl;dr: it's a false positive because it's conflating the independent
locking/completion contexts of upper and lower filesystems when
the loop device is in use.
I've mangled the report in replying - Darrick posted it here, too:
https://pastebin.com/GshLpH7x
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
> (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
> ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
> lock_acquire+0xab/0x200
> wait_for_completion_io+0x4e/0x1a0
> submit_bio_wait+0x7f/0xb0
> blkdev_issue_zeroout+0x71/0xa0
> xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> xfs_bmapi_write+0x374/0x11f0 [xfs]
> xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> iomap_apply+0x43/0xe0
> dax_iomap_rw+0x89/0xf0
> xfs_file_dax_write+0xcc/0x220 [xfs]
> xfs_file_write_iter+0xf0/0x130 [xfs]
> __vfs_write+0xd9/0x150
> vfs_write+0xc8/0x1c0
> SyS_write+0x45/0xa0
> entry_SYSCALL_64_fastpath+0x1f/0xbe
This is on the lower filesystem - it's a write() syscall on a DAX
enable filesystem. Locking in this case is:
Context function lock
Lower filesystem xfs_file_dax_write XFS_IOLOCK_EXCL
Lower filesystem xfs_file_iomap_begin XFS_ILOCK_EXCL
Lower filesystem/
lower blkdev submit_bio_wait wait for dax block
device IO completion
Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion
In the lower filesystem.
> -> #1 (&xfs_nondir_ilock_class){++++}:
> lock_acquire+0xab/0x200
> down_write_nested+0x4a/0xb0
> xfs_ilock+0x263/0x330 [xfs]
> xfs_setattr_size+0x152/0x370 [xfs]
> xfs_vn_setattr+0x6b/0x90 [xfs]
> notify_change+0x27d/0x3f0
> do_truncate+0x5b/0x90
> path_openat+0x237/0xa90
> do_filp_open+0x8a/0xf0
> do_sys_open+0x11c/0x1f0
> entry_SYSCALL_64_fastpath+0x1f/0xbe
Context function lock
Lower filesystem xfs_vn_setattr XFS_IOLOCK_EXCL
Lower filesystem xfs_vn_setattr XFS_MMAPLOCK_EXCL
Lower filesystem xfs_setattr_size XFS_ILOCK_EXCL
Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class
i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion
In the lower filesystem.
>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> up_write+0x1c/0x40
> xfs_iunlock+0x1d0/0x310 [xfs]
> xfs_file_fallocate+0x8a/0x310 [xfs]
> loop_queue_work+0xb7/0x8d0
> kthread_worker_fn+0xb9/0x1f0
Context function lock
Lower filesystem xfs_file_fallocate XFS_IOLOCK_EXCL
Lower filesystem xfs_file_fallocate XFS_MMAPLOCK_EXCL
[Lower filesystem whatever falloc calls XFS_ILOCK_EXCL]
Upper filesystem/
Upper blkd loop_queue_work blk_mq_complete_request
This is clearer when the reported stack is looked at:
> lock_commit_crosslock+0x3e9/0x5c0
> complete+0x24/0x60
> blk_update_request+0xc2/0x3e0
> blk_mq_end_request+0x18/0x80
> __blk_mq_complete_request+0x9f/0x170
> loop_queue_work+0x51/0x8d0
> kthread_worker_fn+0xb9/0x1f0
> kthread+0x148/0x180
> ret_from_fork+0x2a/0x40
So this is reporting as:
> Chain exists of:
> &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
> Possible unsafe locking scenario by crosslock:
>
> CPU0 CPU1
> ---- ----
> lock(&xfs_nondir_ilock_class);
> lock((complete)&ret.event);
> lock(&(&ip->i_mmaplock)->mr_lock);
> unlock((complete)&ret.event);
>
> *** DEADLOCK ***
Which effectively means:
IO completion -> XFS_MMAPLOCK_EXCL .....
The thing is, this IO completion has nothing to do with the lower
filesystem - it's the IO completion for the filesystem on the loop
device (the upper filesystem) and is not in any way related to the
IO completion from the dax device the lower filesystem is waiting
on.
IOWs, this is a false positive.
Peter, this is the sort of false positive I mentioned were likely to
occur without some serious work to annotate the IO stack to prevent
them. We can nest multiple layers of IO completions and locking in
the IO stack via things like loop and RAID devices. They can be
nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
checking is going to be a source of false positives until there is
an effective (and simple!) way of providing context based completion
annotations to avoid them...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: shared/298 lockdep splat?
2017-09-20 22:22 ` Dave Chinner
@ 2017-09-21 8:47 ` Byungchul Park
2017-09-26 3:51 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Byungchul Park @ 2017-09-21 8:47 UTC (permalink / raw)
To: Dave Chinner
Cc: Darrick J. Wong, xfs, Peter Zijlstra, linux-kernel, kernel-team
On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.14.0-rc1-fixes #1 Tainted: G W
> > ------------------------------------------------------
> > loop0/31693 is trying to acquire lock:
> > (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
> >
> > but now in release context of a crosslock acquired at the following:
> > ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #2 ((complete)&ret.event){+.+.}:
> > lock_acquire+0xab/0x200
> > wait_for_completion_io+0x4e/0x1a0
> > submit_bio_wait+0x7f/0xb0
> > blkdev_issue_zeroout+0x71/0xa0
> > xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> > xfs_bmapi_write+0x374/0x11f0 [xfs]
> > xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> > xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> > iomap_apply+0x43/0xe0
> > dax_iomap_rw+0x89/0xf0
> > xfs_file_dax_write+0xcc/0x220 [xfs]
> > xfs_file_write_iter+0xf0/0x130 [xfs]
> > __vfs_write+0xd9/0x150
> > vfs_write+0xc8/0x1c0
> > SyS_write+0x45/0xa0
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> This is on the lower filesystem - it's a write() syscall on a DAX
> enable filesystem. Locking in this case is:
>
> Context function lock
> Lower filesystem xfs_file_dax_write XFS_IOLOCK_EXCL
> Lower filesystem xfs_file_iomap_begin XFS_ILOCK_EXCL
> Lower filesystem/
> lower blkdev submit_bio_wait wait for dax block
> device IO completion
>
> Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion
>
> In the lower filesystem.
>
> > -> #1 (&xfs_nondir_ilock_class){++++}:
> > lock_acquire+0xab/0x200
> > down_write_nested+0x4a/0xb0
> > xfs_ilock+0x263/0x330 [xfs]
> > xfs_setattr_size+0x152/0x370 [xfs]
> > xfs_vn_setattr+0x6b/0x90 [xfs]
> > notify_change+0x27d/0x3f0
> > do_truncate+0x5b/0x90
> > path_openat+0x237/0xa90
> > do_filp_open+0x8a/0xf0
> > do_sys_open+0x11c/0x1f0
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> Context function lock
> Lower filesystem xfs_vn_setattr XFS_IOLOCK_EXCL
> Lower filesystem xfs_vn_setattr XFS_MMAPLOCK_EXCL
> Lower filesystem xfs_setattr_size XFS_ILOCK_EXCL
>
> Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class
>
> i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion
>
> In the lower filesystem.
>
> >
> > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> > up_write+0x1c/0x40
> > xfs_iunlock+0x1d0/0x310 [xfs]
> > xfs_file_fallocate+0x8a/0x310 [xfs]
> > loop_queue_work+0xb7/0x8d0
> > kthread_worker_fn+0xb9/0x1f0
>
> Context function lock
> Lower filesystem xfs_file_fallocate XFS_IOLOCK_EXCL
> Lower filesystem xfs_file_fallocate XFS_MMAPLOCK_EXCL
> [Lower filesystem whatever falloc calls XFS_ILOCK_EXCL]
> Upper filesystem/
> Upper blkd loop_queue_work blk_mq_complete_request
>
> This is clearer when the reported stack is looked at:
>
> > lock_commit_crosslock+0x3e9/0x5c0
> > complete+0x24/0x60
> > blk_update_request+0xc2/0x3e0
> > blk_mq_end_request+0x18/0x80
> > __blk_mq_complete_request+0x9f/0x170
> > loop_queue_work+0x51/0x8d0
> > kthread_worker_fn+0xb9/0x1f0
> > kthread+0x148/0x180
> > ret_from_fork+0x2a/0x40
>
> So this is reporting as:
>
> > Chain exists of:
> > &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
> >
> > Possible unsafe locking scenario by crosslock:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(&xfs_nondir_ilock_class);
> > lock((complete)&ret.event);
> > lock(&(&ip->i_mmaplock)->mr_lock);
> > unlock((complete)&ret.event);
> >
> > *** DEADLOCK ***
>
> Which effectively means:
>
> IO completion -> XFS_MMAPLOCK_EXCL .....
>
> The thing is, this IO completion has nothing to do with the lower
> filesystem - it's the IO completion for the filesystem on the loop
> device (the upper filesystem) and is not in any way related to the
> IO completion from the dax device the lower filesystem is waiting
> on.
>
> IOWs, this is a false positive.
>
> Peter, this is the sort of false positive I mentioned were likely to
> occur without some serious work to annotate the IO stack to prevent
> them. We can nest multiple layers of IO completions and locking in
> the IO stack via things like loop and RAID devices. They can be
> nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> checking is going to be a source of false positives until there is
> an effective (and simple!) way of providing context based completion
> annotations to avoid them...
Hello,
It looks caused by that &ret.event in submit_bio_wait() is initialized
with the same class for all layers. I mean that completion variables in
different layers should be initialized with different classes, as you do
for typical locks in xfs.
I am not sure if I understand how xfs works correctly. Right? If yes,
how can we distinguish between independent 'bio's in submit_bio_wait()?
You or I can make it work with the answer. No?
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: shared/298 lockdep splat?
2017-09-21 8:47 ` Byungchul Park
@ 2017-09-26 3:51 ` Dave Chinner
2017-09-26 7:04 ` Byungchul Park
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2017-09-26 3:51 UTC (permalink / raw)
To: Byungchul Park
Cc: Darrick J. Wong, xfs, Peter Zijlstra, linux-kernel, kernel-team
On Thu, Sep 21, 2017 at 05:47:14PM +0900, Byungchul Park wrote:
> On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > Peter, this is the sort of false positive I mentioned were likely to
> > occur without some serious work to annotate the IO stack to prevent
> > them. We can nest multiple layers of IO completions and locking in
> > the IO stack via things like loop and RAID devices. They can be
> > nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> > dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> > checking is going to be a source of false positives until there is
> > an effective (and simple!) way of providing context based completion
> > annotations to avoid them...
>
> Hello,
>
> It looks caused by that &ret.event in submit_bio_wait() is initialized
> with the same class for all layers. I mean that completion variables in
> different layers should be initialized with different classes, as you do
> for typical locks in xfs.
Except that submit_bio_wait() is generic block layer functionality
and can be used by anyone. Whatever solution you decide on, it has
to be generic. And keep in mind that any code that submits a bio
themselves and waits on a completion event from the bio is going to
have to do their own annotations, which makes this a real PITA.
> I am not sure if I understand how xfs works correctly. Right? If yes,
> how can we distinguish between independent 'bio's in submit_bio_wait()?
> You or I can make it work with the answer. No?
Has nothing to do with XFS - it has no clue where it sits in the
block device stack and has no business screwing with bio internals
and stack layering to handle issues with stacked block devices....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: shared/298 lockdep splat?
2017-09-26 3:51 ` Dave Chinner
@ 2017-09-26 7:04 ` Byungchul Park
0 siblings, 0 replies; 5+ messages in thread
From: Byungchul Park @ 2017-09-26 7:04 UTC (permalink / raw)
To: Dave Chinner
Cc: Darrick J. Wong, xfs, Peter Zijlstra, linux-kernel, kernel-team
On Tue, Sep 26, 2017 at 01:51:49PM +1000, Dave Chinner wrote:
> On Thu, Sep 21, 2017 at 05:47:14PM +0900, Byungchul Park wrote:
> > On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > > Peter, this is the sort of false positive I mentioned were likely to
> > > occur without some serious work to annotate the IO stack to prevent
> > > them. We can nest multiple layers of IO completions and locking in
> > > the IO stack via things like loop and RAID devices. They can be
> > > nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> > > dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> > > checking is going to be a source of false positives until there is
> > > an effective (and simple!) way of providing context based completion
> > > annotations to avoid them...
> >
> > Hello,
> >
> > It looks caused by that &ret.event in submit_bio_wait() is initialized
> > with the same class for all layers. I mean that completion variables in
> > different layers should be initialized with different classes, as you do
> > for typical locks in xfs.
>
> Except that submit_bio_wait() is generic block layer functionality
> and can be used by anyone. Whatever solution you decide on, it has
> to be generic. And keep in mind that any code that submits a bio
> themselves and waits on a completion event from the bio is going to
> have to do their own annotations, which makes this a real PITA.
Right. Agree. Let me think it more. As you said, it should be generic.
> > I am not sure if I understand how xfs works correctly. Right? If yes,
> > how can we distinguish between independent 'bio's in submit_bio_wait()?
> > You or I can make it work with the answer. No?
>
> Has nothing to do with XFS - it has no clue where it sits in the
> block device stack and has no business screwing with bio internals
> and stack layering to handle issues with stacked block devices....
Ok. Thank you for replying.
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-09-26 7:04 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-20 21:10 shared/298 lockdep splat? Darrick J. Wong
2017-09-20 22:22 ` Dave Chinner
2017-09-21 8:47 ` Byungchul Park
2017-09-26 3:51 ` Dave Chinner
2017-09-26 7:04 ` Byungchul Park
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).