* Multi-CPU harmless lockdep on x86 while copying data
@ 2014-03-09 2:58 Michael L. Semon
2014-03-10 2:55 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Michael L. Semon @ 2014-03-09 2:58 UTC (permalink / raw)
To: xfs-oss
Hi! I was playing a shell game with some precious backup data. It
went like this:
open a 36-GB source partition (DM linear, partitions from 2 drives,
v5-superblock XFS)
open a 32-GB aes-xts crypt (v4-superblock XFS)
`cp -av` from holding partition to crypt
It was during the cp operation that I got this multi-CPU version of
the harmless lockdep:
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc5+ #6 Not tainted
---------------------------------------------------------
kswapd0/25 just changed the state of lock:
(&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(&mm->mmap_sem){++++++}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
local_irq_disable();
lock(&xfs_dir_ilock_class);
lock(&mm->mmap_sem);
<Interrupt>
lock(&xfs_dir_ilock_class);
*** DEADLOCK ***
3 locks held by kswapd0/25:
#0: (shrinker_rwsem){++++..}, at: [<790c04b7>] shrink_slab+0x2a/0xda
#1: (&type->s_umount_key#20){++++.+}, at: [<790e55c3>] grab_super_passive+0x3b/0x75
#2: (&pag->pag_ici_reclaim_lock){+.+...}, at: [<7917809a>] xfs_reclaim_inodes_ag+0xb4/0x3a7
the shortest dependencies between 2nd lock and 1st lock:
-> (&mm->mmap_sem){++++++} ops: 146382 {
HARDIRQ-ON-W at:
[<79064553>] __lock_acquire+0x58a/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<792f6409>] down_write+0x4c/0xa3
[<790e96e5>] do_execve_common+0x2bf/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
HARDIRQ-ON-R at:
[<79064477>] __lock_acquire+0x4ae/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<790d1718>] might_fault+0x81/0xa9
[<7922fc80>] clear_user+0x13/0x46
[<79123c80>] padzero+0x22/0x2e
[<7912552b>] load_elf_binary+0x500/0xd35
[<790e93ec>] search_binary_handler+0x72/0xac
[<790e98c1>] do_execve_common+0x49b/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
SOFTIRQ-ON-W at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<792f6409>] down_write+0x4c/0xa3
[<790e96e5>] do_execve_common+0x2bf/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
SOFTIRQ-ON-R at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<790d1718>] might_fault+0x81/0xa9
[<7922fc80>] clear_user+0x13/0x46
[<79123c80>] padzero+0x22/0x2e
[<7912552b>] load_elf_binary+0x500/0xd35
[<790e93ec>] search_binary_handler+0x72/0xac
[<790e98c1>] do_execve_common+0x49b/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
RECLAIM_FS-ON-W at:
[<790620d7>] mark_held_locks+0x81/0xe7
[<79062aac>] lockdep_trace_alloc+0xa5/0xe6
[<790b97a5>] __alloc_pages_nodemask+0x6f/0x6be
[<790dfb14>] new_slab+0x5f/0x21c
[<792f1a88>] __slab_alloc.isra.59.constprop.67+0x25f/0x43d
[<790e0974>] kmem_cache_alloc+0x91/0xf9
[<790d35ed>] __split_vma.isra.34+0x28/0x141
[<790d4523>] do_munmap+0x234/0x2b0
[<790d5115>] vm_munmap+0x37/0x4a
[<790d513b>] SyS_munmap+0x13/0x15
[<792f85b8>] sysenter_do_call+0x12/0x36
RECLAIM_FS-ON-R at:
[<790620d7>] mark_held_locks+0x81/0xe7
[<79062aac>] lockdep_trace_alloc+0xa5/0xe6
[<790b97a5>] __alloc_pages_nodemask+0x6f/0x6be
[<790269fa>] pte_alloc_one+0x24/0x3c
[<790cead5>] __pte_alloc+0x1a/0x85
[<790d0d0c>] handle_mm_fault+0x5d8/0x604
[<79023d1f>] __do_page_fault+0x110/0x3bc
[<790240d3>] do_page_fault+0xd/0xf
[<792f832b>] error_code+0x5f/0x64
[<79123c80>] padzero+0x22/0x2e
[<7912552b>] load_elf_binary+0x500/0xd35
[<790e93ec>] search_binary_handler+0x72/0xac
[<790e98c1>] do_execve_common+0x49b/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
INITIAL USE at:
[<79064276>] __lock_acquire+0x2ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<792f6409>] down_write+0x4c/0xa3
[<790e96e5>] do_execve_common+0x2bf/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
}
... key at: [<795f1eec>] __key.44037+0x0/0x8
... acquired at:
[<79064360>] __lock_acquire+0x397/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<790d1718>] might_fault+0x81/0xa9
[<790f379a>] filldir64+0x92/0xe2
[<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c
[<7917009e>] xfs_readdir+0xc4/0x126
[<79171b8b>] xfs_file_readdir+0x25/0x3e
[<790f385a>] iterate_dir+0x70/0x9b
[<790f3a31>] SyS_getdents64+0x6d/0xcc
[<792f85b8>] sysenter_do_call+0x12/0x36
-> (&xfs_dir_ilock_class){++++-.} ops: 11354 {
HARDIRQ-ON-W at:
[<79064553>] __lock_acquire+0x58a/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<7917db25>] xfs_vn_update_time+0x6c/0x150
[<790f9302>] update_time+0x1e/0x9e
[<790faed8>] touch_atime+0xcd/0x101
[<790f3879>] iterate_dir+0x8f/0x9b
[<790f3a31>] SyS_getdents64+0x6d/0xcc
[<792f85b8>] sysenter_do_call+0x12/0x36
HARDIRQ-ON-R at:
[<79064477>] __lock_acquire+0x4ae/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e409>] down_read_nested+0x4f/0x8a
[<791c09b4>] xfs_ilock+0xb9/0x16c
[<791c0a85>] xfs_ilock_data_map_shared+0x1e/0x3a
[<79172550>] xfs_dir_open+0x2e/0x64
[<790e1d86>] do_dentry_open.isra.26+0x115/0x221
[<790e2f06>] finish_open+0x1b/0x27
[<790ef0e6>] do_last.isra.60+0x89c/0xe6b
[<790ef75e>] path_openat+0xa9/0x4fe
[<790efbe4>] do_filp_open+0x31/0x72
[<790e339a>] do_sys_open+0x112/0x181
[<790e344d>] SyS_openat+0x20/0x22
[<792f85b8>] sysenter_do_call+0x12/0x36
SOFTIRQ-ON-W at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<7917db25>] xfs_vn_update_time+0x6c/0x150
[<790f9302>] update_time+0x1e/0x9e
[<790faed8>] touch_atime+0xcd/0x101
[<790f3879>] iterate_dir+0x8f/0x9b
[<790f3a31>] SyS_getdents64+0x6d/0xcc
[<792f85b8>] sysenter_do_call+0x12/0x36
SOFTIRQ-ON-R at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e409>] down_read_nested+0x4f/0x8a
[<791c09b4>] xfs_ilock+0xb9/0x16c
[<791c0a85>] xfs_ilock_data_map_shared+0x1e/0x3a
[<79172550>] xfs_dir_open+0x2e/0x64
[<790e1d86>] do_dentry_open.isra.26+0x115/0x221
[<790e2f06>] finish_open+0x1b/0x27
[<790ef0e6>] do_last.isra.60+0x89c/0xe6b
[<790ef75e>] path_openat+0xa9/0x4fe
[<790efbe4>] do_filp_open+0x31/0x72
[<790e339a>] do_sys_open+0x112/0x181
[<790e344d>] SyS_openat+0x20/0x22
[<792f85b8>] sysenter_do_call+0x12/0x36
IN-RECLAIM_FS-W at:
[<79064676>] __lock_acquire+0x6ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<79177db4>] xfs_reclaim_inode+0xdb/0x30d
[<79178260>] xfs_reclaim_inodes_ag+0x27a/0x3a7
[<7917840b>] xfs_reclaim_inodes_nr+0x2d/0x33
[<791837d1>] xfs_fs_free_cached_objects+0x13/0x15
[<790e57f8>] super_cache_scan+0x129/0x12d
[<790befdc>] shrink_slab_node+0x125/0x270
[<790c04f0>] shrink_slab+0x63/0xda
[<790c273e>] kswapd+0x31d/0x770
[<790458fa>] kthread+0xa1/0xb6
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
INITIAL USE at:
[<79064276>] __lock_acquire+0x2ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e409>] down_read_nested+0x4f/0x8a
[<791c09b4>] xfs_ilock+0xb9/0x16c
[<791c0a85>] xfs_ilock_data_map_shared+0x1e/0x3a
[<79172550>] xfs_dir_open+0x2e/0x64
[<790e1d86>] do_dentry_open.isra.26+0x115/0x221
[<790e2f06>] finish_open+0x1b/0x27
[<790ef0e6>] do_last.isra.60+0x89c/0xe6b
[<790ef75e>] path_openat+0xa9/0x4fe
[<790efbe4>] do_filp_open+0x31/0x72
[<790e339a>] do_sys_open+0x112/0x181
[<790e344d>] SyS_openat+0x20/0x22
[<792f85b8>] sysenter_do_call+0x12/0x36
}
... key at: [<79bfe07c>] xfs_dir_ilock_class+0x0/0x8
... acquired at:
[<79060873>] check_usage_forwards+0xf8/0xfa
[<79061f62>] mark_lock+0x1b3/0x2a7
[<79064676>] __lock_acquire+0x6ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<79177db4>] xfs_reclaim_inode+0xdb/0x30d
[<79178260>] xfs_reclaim_inodes_ag+0x27a/0x3a7
[<7917840b>] xfs_reclaim_inodes_nr+0x2d/0x33
[<791837d1>] xfs_fs_free_cached_objects+0x13/0x15
[<790e57f8>] super_cache_scan+0x129/0x12d
[<790befdc>] shrink_slab_node+0x125/0x270
[<790c04f0>] shrink_slab+0x63/0xda
[<790c273e>] kswapd+0x31d/0x770
[<790458fa>] kthread+0xa1/0xb6
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
stack backtrace:
CPU: 0 PID: 25 Comm: kswapd0 Not tainted 3.14.0-rc5+ #6
Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
00000000 00000000 781b1b6c 792f2495 781b1bb0 781b1b90 792efde6 793b75f7
793b7965 00000019 7819b1c0 00000000 7819b654 7819b1c0 781b1be0 79060873
7819b654 00000001 793b7965 781b1bac 79a83100 79a83100 79834600 7819b654
Call Trace:
[<792f2495>] dump_stack+0x48/0x60
[<792efde6>] print_irq_inversion_bug.part.36+0x173/0x17b
[<79060873>] check_usage_forwards+0xf8/0xfa
[<7906077b>] ? check_usage_backwards+0xfc/0xfc
[<79061f62>] mark_lock+0x1b3/0x2a7
[<79053956>] ? sched_clock_local+0x42/0x12e
[<7906077b>] ? check_usage_backwards+0xfc/0xfc
[<79064676>] __lock_acquire+0x6ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<791c09fa>] ? xfs_ilock+0xff/0x16c
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] ? xfs_ilock+0xff/0x16c
[<791c09fa>] xfs_ilock+0xff/0x16c
[<79177db4>] xfs_reclaim_inode+0xdb/0x30d
[<79178260>] xfs_reclaim_inodes_ag+0x27a/0x3a7
[<791780cc>] ? xfs_reclaim_inodes_ag+0xe6/0x3a7
[<7917840b>] xfs_reclaim_inodes_nr+0x2d/0x33
[<791837d1>] xfs_fs_free_cached_objects+0x13/0x15
[<790e57f8>] super_cache_scan+0x129/0x12d
[<790befdc>] shrink_slab_node+0x125/0x270
[<790c04b7>] ? shrink_slab+0x2a/0xda
[<790c04f0>] shrink_slab+0x63/0xda
[<790c273e>] kswapd+0x31d/0x770
[<790c2421>] ? try_to_free_pages+0x373/0x373
[<790458fa>] kthread+0xa1/0xb6
[<790624ff>] ? trace_hardirqs_on+0xb/0xd
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
[<79045859>] ? __kthread_parkme+0x57/0x57
[sched_delayed] sched: RT throttling activated
cp (149) used greatest stack depth: 5152 bytes left
I call it harmless because its single-CPU variant can be reproduced
as far back as I could bisect in earlier testing (way before
kernel 2.6.20). However, such lockdep splats have never manifested
in a noticeable problem on production kernels on x86. Either
down_write_nested or down_read_nested is in all of them, depending
on which kernel version is in use. At least one reclaim-related
function is in there as well. vm_map_ram used to be in there, but Dave
took care of that (thanks!).
This particular splat has been showing up more often, though. It's not
tied to one particular commit, event, or change; just something that
has crept in gradually since maybe kernel 3.11. It's like watching
grass grow or watching paint dry.
Might somebody keep an eye on this? With enough debug enabled, it
might show on a large cp or `tar -x` operation. xfstests rarely
invokes such an issue. It happens when there are enough inodes
and data flowing that RAM should be full. In many cases, more than one
partition is in play. crypt partitions are not required.
The test system is an i686 Pentium 4 with 1280 MB of RAM, running a
stripped-down Slackware 14.1 with elfutils and test programs on top.
Thanks!
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-09 2:58 Multi-CPU harmless lockdep on x86 while copying data Michael L. Semon
@ 2014-03-10 2:55 ` Dave Chinner
2014-03-10 10:37 ` Christoph Hellwig
2014-03-10 20:52 ` Ben Myers
0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2014-03-10 2:55 UTC (permalink / raw)
To: Michael L. Semon; +Cc: xfs-oss
On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote:
> Hi! I was playing a shell game with some precious backup data. It
> went like this:
>
> open a 36-GB source partition (DM linear, partitions from 2 drives,
> v5-superblock XFS)
>
> open a 32-GB aes-xts crypt (v4-superblock XFS)
>
> `cp -av` from holding partition to crypt
>
> It was during the cp operation that I got this multi-CPU version of
> the harmless lockdep:
>
> =========================================================
> [ INFO: possible irq lock inversion dependency detected ]
> 3.14.0-rc5+ #6 Not tainted
> ---------------------------------------------------------
> kswapd0/25 just changed the state of lock:
> (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
> but this lock took another, RECLAIM_FS-unsafe lock in the past:
> (&mm->mmap_sem){++++++}
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&mm->mmap_sem);
> local_irq_disable();
> lock(&xfs_dir_ilock_class);
> lock(&mm->mmap_sem);
> <Interrupt>
> lock(&xfs_dir_ilock_class);
Well, that's rather interesting. I'd love to know where we take the
directory ilock with interupts disabled - and then take a page
fault... :/
....
> }
> ... key at: [<795f1eec>] __key.44037+0x0/0x8
> ... acquired at:
> [<79064360>] __lock_acquire+0x397/0xa6c
> [<7906510f>] lock_acquire+0x8b/0x101
> [<790d1718>] might_fault+0x81/0xa9
> [<790f379a>] filldir64+0x92/0xe2
> [<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c
> [<7917009e>] xfs_readdir+0xc4/0x126
> [<79171b8b>] xfs_file_readdir+0x25/0x3e
> [<790f385a>] iterate_dir+0x70/0x9b
> [<790f3a31>] SyS_getdents64+0x6d/0xcc
> [<792f85b8>] sysenter_do_call+0x12/0x36
All of these paths are from the syscall path, except for one in
kswapd context. I don't see any stack trace here that could result
in the above "deadlock". It looks to be yet another false
positive...
> I call it harmless because its single-CPU variant can be reproduced
> as far back as I could bisect in earlier testing (way before
> kernel 2.6.20). However, such lockdep splats have never manifested
> in a noticeable problem on production kernels on x86. Either
> down_write_nested or down_read_nested is in all of them, depending
> on which kernel version is in use. At least one reclaim-related
> function is in there as well. vm_map_ram used to be in there, but Dave
> took care of that (thanks!).
>
> This particular splat has been showing up more often, though. It's not
> tied to one particular commit, event, or change; just something that
> has crept in gradually since maybe kernel 3.11. It's like watching
> grass grow or watching paint dry.
The above reports all indicate that the taking a page fault while
holding the directory ilock is problematic. So have all the other
new lockdep reports we've got that have been caused by that change.
Realistically, I think that the readdir locking change was
fundamentally broken. Why? Because taking page faults while holding
the ilock violates the lock order we have for inodes. For regular
files, the lock heirarchy is iolock -> page fault -> ilock, and we
got rid of the need for the iolock in the inode reclaim path because
of all the problems it caused lockdep. Now we've gone and
reintroduced that same set of problems - only worse - by allowing
the readdir code to take page faults while holding the ilock....
I'd like to revert the change that introduced the ilock into the
readdir path, but I suspect that woul dbe an unpopular direction to
take. However, we need a solution that doesn't drive us insane with
lockdep false positives.
IOWs, we need to do this differently - readdir needs to be protected
by the iolock, not the ilock, and only the block mapping routines in
the readdir code should be protected by the ilock. i.e. the same
locking strategy that is used for regular files: iolock protects the
contents of the file, the ilock protects the inode metadata.
What I think we really need to do is drive the ilock all the way
into the block mapping functions of the directory code and to the
places where the inode metadata is actually going to be modified.
The iolock protects access to the directory data, and logging of
changes to those blocks is serialised by the buffer locks, so we
don't actually need the inode ilock to serialise access to the
directory data - we only need the ilock for modifications to the
inode and it's blockmaps itself, same as for regular files.
Changing the directory code to handle this sort of locking is going
to require a bit of surgery. However, I can see advantages to moving
directory data to the same locking strategy as regular file data -
locking heirarchies are identical, directory ilock hold times are
much reduced, we don't get lockdep whining about taking page faults
with the ilock held, etc.
A quick hack at to demonstrate the high level, initial step of using
the IOLOCK for readdir serialisation. I've done a little smoke
testing on it, so it won't die immediately. It should get rid of all
the nasty lockdep issues, but it doesn't start to address the deeper
restructing that is needed.
Thoughts, comments? I'm especially interested in what SGI people
have to say here because this locking change was needed for CXFS,
and changing the directory locking iheirarchy is probably going to
upset CXFS a lot...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
xfs; holding ilock over readdir is wrong
From: Dave Chinner <dchinner@redhat.com>
The recent change to the readdir locking made in 40194ec ("xfs:
reinstate the ilock in xfs_readdir") for CXFS directory sanity was
fundamentally the wrong thing to do. deep in the readdir code we
can take page faults in the filldir callback, and so taking a page
fault while holding an inode ilock creates a new set of locking
issues that lockdep warns all over the place about.
The locking order for regularl inodes w.r.t. page faults is io_lock
-> pagefault -> mmap_sem -> ilock. The directory readdir code now
triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
at this point, it inverts all the locking patterns that lockdep
normally sees on XFS inodes, and so triggers lockdep. We worked
around this with commit 93a8614 ("xfs: fix directory inode iolock
lockdep false positive"), but that then just moved the lockdep
warning to deeper in the page fault path and triggered on seurity
inode locks.
Further, if we enter memory reclaim in a readdir path, we now get
lockdep warning about potential deadlocks because the ilock is held
when we enter reclaim. This, again, is different to a regular file
in that we never allow memory reclaim to run while holding the
ilock for regular files. Hence lockdep now throws
ilock->kmalloc->reclaim->ilock warnings.
Basically, the problem is that the ilock is being used to protect
the directory data and the inode metadata, whereas for a regular
file the iolock protects the data and the ilock protects the
metadata. From the VFS perspective, the i_mutex serialises all
accesses to the directory data, and so not holding the ilock for
readdir doesn't matter. The issue is that CXFS doesn't access
directory data via the VFS, so it has no "data serialisaton"
mechanism.
While the simplest way to fix these problems would simply be to
revert the ilock addition made to xfs_readdir,but I figure that
would be unpopular with some people. Hence we need a different
solution: we should use the iolock to protect readdir against
modification races.
The ilock can then be used just when the extent list needs to be
read, just like we do for regular files. The directory modification
code can now take the iolock exclusive when the ilock is also taken,
and this then ensures that readdir is correct excluded while
modifications are in progress.
This would be a straight forward change, except for two things:
filestreams and lockdep. The filestream allocator takes the
directory iolock and makes assumptions about parent->child locking
order of the iolock which will now be invalidated. Hence some
changes to the filestreams code is needed to ensure that it never
blocks on directory iolocks and deadlocks. instead it needs to fail
stream associations when such problems occur.
Lockdep is just plain annoying. We have a bug in our handling of the
XFS_LOCK_PARENT handling when it comes to locking multiple inodes
with that flag set - lockdep classes are exclusive and can't be ORed
together to form new classes. And there's only 8 subclasses
available. Luckily, we only need the new annotations on the iolock,
and we never lok more than 2 of those at once, so there's space
available in the iolock lockdep mask to shoehorn in a separate
lockdep subclass map for parent based iolocks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_dir2.c | 3 +++
fs/xfs/xfs_dir2_readdir.c | 10 ++++++++--
fs/xfs/xfs_filestream.c | 26 +++++++++++++++---------
fs/xfs/xfs_inode.c | 50 +++++++++++++++++++++++++++++++++++------------
fs/xfs/xfs_inode.h | 7 ++++++-
fs/xfs/xfs_symlink.c | 7 ++++---
6 files changed, 75 insertions(+), 28 deletions(-)
diff --git a/fs/xfs/xfs_dir2.c b/fs/xfs/xfs_dir2.c
index 0c8ba87..b206da1 100644
--- a/fs/xfs/xfs_dir2.c
+++ b/fs/xfs/xfs_dir2.c
@@ -307,6 +307,7 @@ xfs_dir_lookup(
struct xfs_da_args *args;
int rval;
int v; /* type-checking value */
+ int lock_mode;
ASSERT(S_ISDIR(dp->i_d.di_mode));
XFS_STATS_INC(xs_dir_lookup);
@@ -331,6 +332,7 @@ xfs_dir_lookup(
if (ci_name)
args->op_flags |= XFS_DA_OP_CILOOKUP;
+ lock_mode = xfs_ilock_data_map_shared(dp);
if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
rval = xfs_dir2_sf_lookup(args);
goto out_check_rval;
@@ -363,6 +365,7 @@ out_check_rval:
}
}
out_free:
+ xfs_iunlock(dp, lock_mode);
kmem_free(args);
return rval;
}
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index aead369..19863fe 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -190,6 +190,7 @@ xfs_dir2_block_getdents(
char *ptr; /* current data entry */
int wantoff; /* starting block offset */
xfs_off_t cook;
+ int lock_mode;
mp = dp->i_mount;
/*
@@ -198,7 +199,9 @@ xfs_dir2_block_getdents(
if (xfs_dir2_dataptr_to_db(mp, ctx->pos) > mp->m_dirdatablk)
return 0;
+ lock_mode = xfs_ilock_data_map_shared(dp);
error = xfs_dir3_block_read(NULL, dp, &bp);
+ xfs_iunlock(dp, lock_mode);
if (error)
return error;
@@ -552,9 +555,12 @@ xfs_dir2_leaf_getdents(
* current buffer, need to get another one.
*/
if (!bp || ptr >= (char *)bp->b_addr + mp->m_dirblksize) {
+ int lock_mode;
+ lock_mode = xfs_ilock_data_map_shared(dp);
error = xfs_dir2_leaf_readbuf(dp, bufsize, map_info,
&curoff, &bp);
+ xfs_iunlock(dp, lock_mode);
if (error || !map_info->map_valid)
break;
@@ -684,7 +690,7 @@ xfs_readdir(
ASSERT(S_ISDIR(dp->i_d.di_mode));
XFS_STATS_INC(xs_dir_getdents);
- lock_mode = xfs_ilock_data_map_shared(dp);
+ xfs_ilock(dp, XFS_IOLOCK_SHARED);
if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
rval = xfs_dir2_sf_getdents(dp, ctx);
else if ((rval = xfs_dir2_isblock(NULL, dp, &v)))
@@ -693,7 +699,7 @@ xfs_readdir(
rval = xfs_dir2_block_getdents(dp, ctx);
else
rval = xfs_dir2_leaf_getdents(dp, ctx, bufsize);
- xfs_iunlock(dp, lock_mode);
+ xfs_iunlock(dp, XFS_IOLOCK_SHARED);
return rval;
}
diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
index 12b6e77..a2b8b12 100644
--- a/fs/xfs/xfs_filestream.c
+++ b/fs/xfs/xfs_filestream.c
@@ -597,9 +597,6 @@ xfs_filestream_associate(
* waiting for the lock because someone else is waiting on the lock we
* hold and we cannot drop that as we are in a transaction here.
*
- * Lucky for us, this inversion is not a problem because it's a
- * directory inode that we are trying to lock here.
- *
* So, if we can't get the iolock without sleeping then just give up
*/
if (!xfs_ilock_nowait(pip, XFS_IOLOCK_EXCL))
@@ -746,13 +743,24 @@ xfs_filestream_new_ag(
* themselves and their parent to different AGs.
*
* Note that we lock the parent directory iolock inside the child
- * iolock here. That's fine as we never hold both parent and child
- * iolock in any other place. This is different from the ilock,
- * which requires locking of the child after the parent for namespace
- * operations.
+ * iolock here. That's fine as we never hold both parent and child
+ * iolock in any other place. However, we also hold the child ilock
+ * here, so we are locking inside that as well. This inverts directory
+ * iolock/child ilock order for operations like rename, and hence we
+ * cannot block in the parent iolock here.
+ *
+ * This is different from the ilock, However, we also hold the child
+ * ilock here, so we are locking inside that as well. This inverts
+ * directory iolock/child ilock order for operations like rename, and
+ * hence we cannot block in the parent iolock here. If we can't get the
+ * parent iolock, then just default to AG 0 and return.
*/
- if (pip)
- xfs_ilock(pip, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT);
+ if (pip) {
+ if (!xfs_ilock_nowait(pip, XFS_IOLOCK_EXCL)) {
+ *agp = 0;
+ return 0;
+ }
+ }
/*
* A new AG needs to be found for the file. If the file's parent
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7a1668b..422603d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -321,10 +321,25 @@ int xfs_lock_delays;
static inline int
xfs_lock_inumorder(int lock_mode, int subclass)
{
- if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
- lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT;
- if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))
- lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT;
+ if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+ ASSERT(subclass <= XFS_IOLOCK_MAX_SUBCLASS);
+ subclass += XFS_IOLOCK_INUMORDER;
+ if (lock_mode & XFS_IOLOCK_PARENT) {
+ subclass += XFS_IOLOCK_INUMORDER_PARENT;
+ lock_mode &= ~XFS_IOLOCK_PARENT;
+ }
+ ASSERT(subclass < MAX_LOCKDEP_SUBCLASSES);
+ lock_mode |= subclass << XFS_IOLOCK_SHIFT;
+ }
+
+ if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) {
+ ASSERT(subclass <= XFS_ILOCK_MAX_SUBCLASS);
+ ASSERT(!(lock_mode & XFS_ILOCK_PARENT));
+ subclass += XFS_ILOCK_INUMORDER;
+ ASSERT(subclass < MAX_LOCKDEP_SUBCLASSES);
+ lock_mode |= subclass << XFS_ILOCK_SHIFT;
+ }
+
return lock_mode;
}
@@ -584,9 +599,9 @@ xfs_lookup(
if (XFS_FORCED_SHUTDOWN(dp->i_mount))
return XFS_ERROR(EIO);
- lock_mode = xfs_ilock_data_map_shared(dp);
+ xfs_ilock(dp, XFS_IOLOCK_SHARED);
error = xfs_dir_lookup(NULL, dp, name, &inum, ci_name);
- xfs_iunlock(dp, lock_mode);
+ xfs_iunlock(dp, XFS_IOLOCK_SHARED);
if (error)
goto out;
@@ -1191,7 +1206,8 @@ xfs_create(
goto out_trans_cancel;
}
- xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT);
+ xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL |
+ XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
unlock_dp_on_error = true;
xfs_bmap_init(&free_list, &first_block);
@@ -1228,7 +1244,7 @@ xfs_create(
* the transaction cancel unlocking dp so don't do it explicitly in the
* error path.
*/
- xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
unlock_dp_on_error = false;
error = xfs_dir_createname(tp, dp, name, ip->i_ino,
@@ -1301,7 +1317,7 @@ xfs_create(
xfs_qm_dqrele(pdqp);
if (unlock_dp_on_error)
- xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ xfs_iunlock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
return error;
}
@@ -1348,10 +1364,11 @@ xfs_link(
goto error_return;
}
+ xfs_ilock(tdp, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT);
xfs_lock_two_inodes(sip, tdp, XFS_ILOCK_EXCL);
xfs_trans_ijoin(tp, sip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, tdp, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, tdp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
/*
* If we are using project inheritance, we only allow hard link
@@ -2454,9 +2471,10 @@ xfs_remove(
goto out_trans_cancel;
}
+ xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT);
xfs_lock_two_inodes(dp, ip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
/*
@@ -2655,6 +2673,12 @@ xfs_rename(
* whether the target directory is the same as the source
* directory, we can lock from 2 to 4 inodes.
*/
+ if (!new_parent)
+ xfs_ilock(src_dp, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT);
+ else
+ xfs_lock_two_inodes(src_dp, target_dp,
+ XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT);
+
xfs_lock_inodes(inodes, num_inodes, XFS_ILOCK_EXCL);
/*
@@ -2662,9 +2686,9 @@ xfs_rename(
* we can rely on either trans_commit or trans_cancel to unlock
* them.
*/
- xfs_trans_ijoin(tp, src_dp, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, src_dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
if (new_parent)
- xfs_trans_ijoin(tp, target_dp, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, target_dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
xfs_trans_ijoin(tp, src_ip, XFS_ILOCK_EXCL);
if (target_ip)
xfs_trans_ijoin(tp, target_ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 5f421a1..f20d4d0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -290,15 +290,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
#define XFS_LOCK_PARENT 1
#define XFS_LOCK_RTBITMAP 2
#define XFS_LOCK_RTSUM 3
-#define XFS_LOCK_INUMORDER 4
#define XFS_IOLOCK_SHIFT 16
#define XFS_IOLOCK_PARENT (XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT)
+#define XFS_IOLOCK_INUMORDER (XFS_LOCK_PARENT + 1)
+#define XFS_IOLOCK_MAX_SUBCLASS 1
+#define XFS_IOLOCK_INUMORDER_PARENT \
+ (XFS_IOLOCK_INUMORDER + XFS_IOLOCK_MAX_SUBCLASS + 1)
#define XFS_ILOCK_SHIFT 24
#define XFS_ILOCK_PARENT (XFS_LOCK_PARENT << XFS_ILOCK_SHIFT)
#define XFS_ILOCK_RTBITMAP (XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT)
#define XFS_ILOCK_RTSUM (XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT)
+#define XFS_ILOCK_INUMORDER (XFS_LOCK_RTSUM + 1);
+#define XFS_ILOCK_MAX_SUBCLASS 3
#define XFS_IOLOCK_DEP_MASK 0x00ff0000
#define XFS_ILOCK_DEP_MASK 0xff000000
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 5fda189..32c3278 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -249,7 +249,8 @@ xfs_symlink(
goto error_return;
}
- xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT);
+ xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL |
+ XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
unlock_dp_on_error = true;
/*
@@ -296,7 +297,7 @@ xfs_symlink(
* transaction cancel unlocking dp so don't do it explicitly in the
* error path.
*/
- xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
unlock_dp_on_error = false;
/*
@@ -418,7 +419,7 @@ xfs_symlink(
xfs_qm_dqrele(pdqp);
if (unlock_dp_on_error)
- xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ xfs_iunlock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
std_return:
return error;
}
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 2:55 ` Dave Chinner
@ 2014-03-10 10:37 ` Christoph Hellwig
2014-03-10 11:12 ` Christoph Hellwig
2014-03-10 20:46 ` Dave Chinner
2014-03-10 20:52 ` Ben Myers
1 sibling, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2014-03-10 10:37 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss
On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> Changing the directory code to handle this sort of locking is going
> to require a bit of surgery. However, I can see advantages to moving
> directory data to the same locking strategy as regular file data -
> locking heirarchies are identical, directory ilock hold times are
> much reduced, we don't get lockdep whining about taking page faults
> with the ilock held, etc.
>
> A quick hack at to demonstrate the high level, initial step of using
> the IOLOCK for readdir serialisation. I've done a little smoke
> testing on it, so it won't die immediately. It should get rid of all
> the nasty lockdep issues, but it doesn't start to address the deeper
> restructing that is needed.
What synchronization do we actually need from the iolock? Pushing the
ilock down to where it's actually needed is a good idea either way,
though.
> This would be a straight forward change, except for two things:
> filestreams and lockdep. The filestream allocator takes the
> directory iolock and makes assumptions about parent->child locking
> order of the iolock which will now be invalidated. Hence some
> changes to the filestreams code is needed to ensure that it never
> blocks on directory iolocks and deadlocks. instead it needs to fail
> stream associations when such problems occur.
I think the right fix is to stop abusing the iolock in filestreams.
To me it seems like a look inside fstrm_item_t should be fine
for what the filestreams code wants if I understand it correctly.
>From looking over some of the filestreams code just for a few minutes
I get an urge to redo lots of it right now..
> @@ -1228,7 +1244,7 @@ xfs_create(
> * the transaction cancel unlocking dp so don't do it explicitly in the
> * error path.
> */
> - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL);
> + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
What do we need the iolock on these operations for?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 10:37 ` Christoph Hellwig
@ 2014-03-10 11:12 ` Christoph Hellwig
2014-03-10 20:51 ` Dave Chinner
2014-03-10 20:46 ` Dave Chinner
1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2014-03-10 11:12 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss
On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote:
> I think the right fix is to stop abusing the iolock in filestreams.
> To me it seems like a look inside fstrm_item_t should be fine
> for what the filestreams code wants if I understand it correctly.
Seems like the iolock could be removed fairly easily by using either of
the two options:
a) reference count fstrm_item, and just grab a reference to it for each
child as well as the parent and insert it multiple times. Kill
->pip.
b) only allocate and insert fstrm_items for directories. Find the
directory by grabbing an entry off inode->i_dentry and then grabbing
the parent. There always should be a dentry around when we allocate
blocks, and if none we can just skip out of the filestreams
allocator if there's none. For the cases that matter there is.
Both mean that the race it tries to protect against using the iolock is
remove entirely, and the code becomes more efficient as well. Option a)
seems simple to implement, but b) will save a lot more memory and
operations when using the filestreams allocator.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 10:37 ` Christoph Hellwig
2014-03-10 11:12 ` Christoph Hellwig
@ 2014-03-10 20:46 ` Dave Chinner
2014-03-10 21:16 ` Ben Myers
1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2014-03-10 20:46 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Michael L. Semon, xfs-oss
On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > Changing the directory code to handle this sort of locking is going
> > to require a bit of surgery. However, I can see advantages to moving
> > directory data to the same locking strategy as regular file data -
> > locking heirarchies are identical, directory ilock hold times are
> > much reduced, we don't get lockdep whining about taking page faults
> > with the ilock held, etc.
> >
> > A quick hack at to demonstrate the high level, initial step of using
> > the IOLOCK for readdir serialisation. I've done a little smoke
> > testing on it, so it won't die immediately. It should get rid of all
> > the nasty lockdep issues, but it doesn't start to address the deeper
> > restructing that is needed.
>
> What synchronization do we actually need from the iolock? Pushing the
> ilock down to where it's actually needed is a good idea either way,
> though.
The issue is that if we push the ilock down to the just the block
mapping routines, the directory can be modified while the readdir is
in progress. That's the root problem that adding the ilock solved.
Now, just pushing the ilock down to protect the bmbt lookups might
result in a consistent lookup, but it won't serialise sanely against
modifications.
i.e. readdir only walks one dir block at a time but
it maps multiple blocks for readahead and keeps them in a local
array and doesn't validate them again before issuing read o nthose
buffers. Hence at a high level we currently have to serialise
readdir against all directory modifications.
The only other option we might have is to completely rewrite the
directory readahead code not to cache mappings. If we use the ilock
purely for bmbt lookup and buffer read, then the ilock will
serialise against modification, and the buffer lock will stabilise
the buffer until the readdir moves to the next buffer and picks the
ilock up again to read it.
That would avoid the need for high level serialisation, but it's a
lot more work than using the iolock to provide the high level
serialisation and i'm still not sure it's 100% safe. And I've got no
idea if it would work for CXFS. Hopefully someone from SGI will
chime in here....
> > This would be a straight forward change, except for two things:
> > filestreams and lockdep. The filestream allocator takes the
> > directory iolock and makes assumptions about parent->child locking
> > order of the iolock which will now be invalidated. Hence some
> > changes to the filestreams code is needed to ensure that it never
> > blocks on directory iolocks and deadlocks. instead it needs to fail
> > stream associations when such problems occur.
>
> I think the right fix is to stop abusing the iolock in filestreams.
> To me it seems like a look inside fstrm_item_t should be fine
> for what the filestreams code wants if I understand it correctly.
>
> From looking over some of the filestreams code just for a few minutes
> I get an urge to redo lots of it right now..
I get that urge from time to time, too. So far I've managed to avoid
it.
> > @@ -1228,7 +1244,7 @@ xfs_create(
> > * the transaction cancel unlocking dp so don't do it explicitly in the
> > * error path.
> > */
> > - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL);
> > + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
>
> What do we need the iolock on these operations for?
These are providing the high level readdir vs modification
serialisation protection. And we have to unlock it on transaction
commit, which is why it needs to be added to the xfs_trans_ijoin()
calls...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 11:12 ` Christoph Hellwig
@ 2014-03-10 20:51 ` Dave Chinner
2014-03-11 16:48 ` Christoph Hellwig
0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2014-03-10 20:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Michael L. Semon, xfs-oss
On Mon, Mar 10, 2014 at 04:12:53AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote:
> > I think the right fix is to stop abusing the iolock in filestreams.
> > To me it seems like a look inside fstrm_item_t should be fine
> > for what the filestreams code wants if I understand it correctly.
>
> Seems like the iolock could be removed fairly easily by using either of
> the two options:
>
> a) reference count fstrm_item, and just grab a reference to it for each
> child as well as the parent and insert it multiple times. Kill
> ->pip.
> b) only allocate and insert fstrm_items for directories. Find the
> directory by grabbing an entry off inode->i_dentry and then grabbing
> the parent. There always should be a dentry around when we allocate
> blocks, and if none we can just skip out of the filestreams
> allocator if there's none. For the cases that matter there is.
>
> Both mean that the race it tries to protect against using the iolock is
> remove entirely, and the code becomes more efficient as well. Option a)
> seems simple to implement, but b) will save a lot more memory and
> operations when using the filestreams allocator.
Yeah, b) seems like the way to simplify it - the filestreams code
really only needs to track the parent/ag relationship rather than
the child/parent relationship if there is a reliable way of
determining the parent from the child.
What do we do with hardlinked files in this case? I'm happy to say
"too bad" for these files mainly because the filestream allocator is
aimed at associating multiple file creations together, so hard links
really don't matter AFAICT...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 2:55 ` Dave Chinner
2014-03-10 10:37 ` Christoph Hellwig
@ 2014-03-10 20:52 ` Ben Myers
2014-03-10 21:20 ` Dave Chinner
1 sibling, 1 reply; 13+ messages in thread
From: Ben Myers @ 2014-03-10 20:52 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss
Hi Dave,
On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote:
> > Hi! I was playing a shell game with some precious backup data. It
> > went like this:
> >
> > open a 36-GB source partition (DM linear, partitions from 2 drives,
> > v5-superblock XFS)
> >
> > open a 32-GB aes-xts crypt (v4-superblock XFS)
> >
> > `cp -av` from holding partition to crypt
> >
> > It was during the cp operation that I got this multi-CPU version of
> > the harmless lockdep:
> >
> > =========================================================
> > [ INFO: possible irq lock inversion dependency detected ]
> > 3.14.0-rc5+ #6 Not tainted
> > ---------------------------------------------------------
> > kswapd0/25 just changed the state of lock:
> > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
> > but this lock took another, RECLAIM_FS-unsafe lock in the past:
> > (&mm->mmap_sem){++++++}
> >
> > and interrupts could create inverse lock ordering between them.
> >
> >
> > other info that might help us debug this:
> > Possible interrupt unsafe locking scenario:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(&mm->mmap_sem);
> > local_irq_disable();
> > lock(&xfs_dir_ilock_class);
> > lock(&mm->mmap_sem);
> > <Interrupt>
> > lock(&xfs_dir_ilock_class);
>
> Well, that's rather interesting. I'd love to know where we take the
> directory ilock with interupts disabled - and then take a page
> fault... :/
>
> ....
>
> > }
> > ... key at: [<795f1eec>] __key.44037+0x0/0x8
> > ... acquired at:
> > [<79064360>] __lock_acquire+0x397/0xa6c
> > [<7906510f>] lock_acquire+0x8b/0x101
> > [<790d1718>] might_fault+0x81/0xa9
> > [<790f379a>] filldir64+0x92/0xe2
> > [<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c
> > [<7917009e>] xfs_readdir+0xc4/0x126
> > [<79171b8b>] xfs_file_readdir+0x25/0x3e
> > [<790f385a>] iterate_dir+0x70/0x9b
> > [<790f3a31>] SyS_getdents64+0x6d/0xcc
> > [<792f85b8>] sysenter_do_call+0x12/0x36
>
> All of these paths are from the syscall path, except for one in
> kswapd context. I don't see any stack trace here that could result
> in the above "deadlock". It looks to be yet another false
> positive...
>
> > I call it harmless because its single-CPU variant can be reproduced
> > as far back as I could bisect in earlier testing (way before
> > kernel 2.6.20). However, such lockdep splats have never manifested
> > in a noticeable problem on production kernels on x86. Either
> > down_write_nested or down_read_nested is in all of them, depending
> > on which kernel version is in use. At least one reclaim-related
> > function is in there as well. vm_map_ram used to be in there, but Dave
> > took care of that (thanks!).
> >
> > This particular splat has been showing up more often, though. It's not
> > tied to one particular commit, event, or change; just something that
> > has crept in gradually since maybe kernel 3.11. It's like watching
> > grass grow or watching paint dry.
>
> The above reports all indicate that the taking a page fault while
> holding the directory ilock is problematic. So have all the other
> new lockdep reports we've got that have been caused by that change.
>
> Realistically, I think that the readdir locking change was
> fundamentally broken. Why? Because taking page faults while holding
> the ilock violates the lock order we have for inodes. For regular
> files, the lock heirarchy is iolock -> page fault -> ilock, and we
> got rid of the need for the iolock in the inode reclaim path because
> of all the problems it caused lockdep. Now we've gone and
> reintroduced that same set of problems - only worse - by allowing
> the readdir code to take page faults while holding the ilock....
>
> I'd like to revert the change that introduced the ilock into the
> readdir path, but I suspect that woul dbe an unpopular direction to
> take. However, we need a solution that doesn't drive us insane with
> lockdep false positives.
>
> IOWs, we need to do this differently - readdir needs to be protected
> by the iolock, not the ilock, and only the block mapping routines in
> the readdir code should be protected by the ilock. i.e. the same
> locking strategy that is used for regular files: iolock protects the
> contents of the file, the ilock protects the inode metadata.
>
> What I think we really need to do is drive the ilock all the way
> into the block mapping functions of the directory code and to the
> places where the inode metadata is actually going to be modified.
> The iolock protects access to the directory data, and logging of
> changes to those blocks is serialised by the buffer locks, so we
> don't actually need the inode ilock to serialise access to the
> directory data - we only need the ilock for modifications to the
> inode and it's blockmaps itself, same as for regular files.
>
> Changing the directory code to handle this sort of locking is going
> to require a bit of surgery. However, I can see advantages to moving
> directory data to the same locking strategy as regular file data -
> locking heirarchies are identical, directory ilock hold times are
> much reduced, we don't get lockdep whining about taking page faults
> with the ilock held, etc.
>
> A quick hack at to demonstrate the high level, initial step of using
> the IOLOCK for readdir serialisation. I've done a little smoke
> testing on it, so it won't die immediately. It should get rid of all
> the nasty lockdep issues, but it doesn't start to address the deeper
> restructing that is needed.
>
> Thoughts, comments? I'm especially interested in what SGI people
> have to say here because this locking change was needed for CXFS,
> and changing the directory locking iheirarchy is probably going to
> upset CXFS a lot...
Protecting directory contents with the iolock and pushing the ilock down
to protect the directory block mapping functions seems like a good
strategy to me. In this area I think it's good to treat directories the
same as regular files. I have not had a very close look yet... and
couldn't find where we take a fault with irqs disabled. Maybe something
out of a workqueue?
Thanks,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 20:46 ` Dave Chinner
@ 2014-03-10 21:16 ` Ben Myers
2014-03-10 21:24 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Ben Myers @ 2014-03-10 21:16 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Michael L. Semon, xfs-oss
Hi,
On Tue, Mar 11, 2014 at 07:46:47AM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > > Changing the directory code to handle this sort of locking is going
> > > to require a bit of surgery. However, I can see advantages to moving
> > > directory data to the same locking strategy as regular file data -
> > > locking heirarchies are identical, directory ilock hold times are
> > > much reduced, we don't get lockdep whining about taking page faults
> > > with the ilock held, etc.
> > >
> > > A quick hack at to demonstrate the high level, initial step of using
> > > the IOLOCK for readdir serialisation. I've done a little smoke
> > > testing on it, so it won't die immediately. It should get rid of all
> > > the nasty lockdep issues, but it doesn't start to address the deeper
> > > restructing that is needed.
> >
> > What synchronization do we actually need from the iolock? Pushing the
> > ilock down to where it's actually needed is a good idea either way,
> > though.
>
> The issue is that if we push the ilock down to the just the block
> mapping routines, the directory can be modified while the readdir is
> in progress. That's the root problem that adding the ilock solved.
> Now, just pushing the ilock down to protect the bmbt lookups might
> result in a consistent lookup, but it won't serialise sanely against
> modifications.
>
> i.e. readdir only walks one dir block at a time but
> it maps multiple blocks for readahead and keeps them in a local
> array and doesn't validate them again before issuing read o nthose
> buffers. Hence at a high level we currently have to serialise
> readdir against all directory modifications.
>
> The only other option we might have is to completely rewrite the
> directory readahead code not to cache mappings. If we use the ilock
> purely for bmbt lookup and buffer read, then the ilock will
> serialise against modification, and the buffer lock will stabilise
> the buffer until the readdir moves to the next buffer and picks the
> ilock up again to read it.
>
> That would avoid the need for high level serialisation, but it's a
> lot more work than using the iolock to provide the high level
> serialisation and i'm still not sure it's 100% safe. And I've got no
> idea if it would work for CXFS. Hopefully someone from SGI will
> chime in here....
Also in leaf and node formats a single modification can change multiple
buffers, so I suspect the buffer lock isn't enough serialization to maintain a
consistent directory in the face of multiple readers and writers. The iolock
does resolve that issue.
> > > This would be a straight forward change, except for two things:
> > > filestreams and lockdep. The filestream allocator takes the
> > > directory iolock and makes assumptions about parent->child locking
> > > order of the iolock which will now be invalidated. Hence some
> > > changes to the filestreams code is needed to ensure that it never
> > > blocks on directory iolocks and deadlocks. instead it needs to fail
> > > stream associations when such problems occur.
> >
> > I think the right fix is to stop abusing the iolock in filestreams.
> > To me it seems like a look inside fstrm_item_t should be fine
> > for what the filestreams code wants if I understand it correctly.
> >
> > From looking over some of the filestreams code just for a few minutes
> > I get an urge to redo lots of it right now..
>
> I get that urge from time to time, too. So far I've managed to avoid
> it.
>
> > > @@ -1228,7 +1244,7 @@ xfs_create(
> > > * the transaction cancel unlocking dp so don't do it explicitly in the
> > > * error path.
> > > */
> > > - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL);
> > > + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
> >
> > What do we need the iolock on these operations for?
>
> These are providing the high level readdir vs modification
> serialisation protection. And we have to unlock it on transaction
> commit, which is why it needs to be added to the xfs_trans_ijoin()
> calls...
Makes sense, I think. I'm not sure what the changes to the directory code
would look like.
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 20:52 ` Ben Myers
@ 2014-03-10 21:20 ` Dave Chinner
2014-03-10 21:30 ` Ben Myers
0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2014-03-10 21:20 UTC (permalink / raw)
To: Ben Myers; +Cc: Michael L. Semon, xfs-oss
On Mon, Mar 10, 2014 at 03:52:49PM -0500, Ben Myers wrote:
> Hi Dave,
>
> On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote:
> > > Hi! I was playing a shell game with some precious backup data. It
> > > went like this:
> > >
> > > open a 36-GB source partition (DM linear, partitions from 2 drives,
> > > v5-superblock XFS)
> > >
> > > open a 32-GB aes-xts crypt (v4-superblock XFS)
> > >
> > > `cp -av` from holding partition to crypt
> > >
> > > It was during the cp operation that I got this multi-CPU version of
> > > the harmless lockdep:
> > >
> > > =========================================================
> > > [ INFO: possible irq lock inversion dependency detected ]
> > > 3.14.0-rc5+ #6 Not tainted
> > > ---------------------------------------------------------
> > > kswapd0/25 just changed the state of lock:
> > > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
> > > but this lock took another, RECLAIM_FS-unsafe lock in the past:
> > > (&mm->mmap_sem){++++++}
> > >
> > > and interrupts could create inverse lock ordering between them.
> > >
> > >
> > > other info that might help us debug this:
> > > Possible interrupt unsafe locking scenario:
> > >
> > > CPU0 CPU1
> > > ---- ----
> > > lock(&mm->mmap_sem);
> > > local_irq_disable();
> > > lock(&xfs_dir_ilock_class);
> > > lock(&mm->mmap_sem);
> > > <Interrupt>
> > > lock(&xfs_dir_ilock_class);
> >
> > Well, that's rather interesting. I'd love to know where we take the
> > directory ilock with interupts disabled - and then take a page
> > fault... :/
.....
> > Realistically, I think that the readdir locking change was
> > fundamentally broken. Why? Because taking page faults while holding
> > the ilock violates the lock order we have for inodes. For regular
> > files, the lock heirarchy is iolock -> page fault -> ilock, and we
> > got rid of the need for the iolock in the inode reclaim path because
> > of all the problems it caused lockdep. Now we've gone and
> > reintroduced that same set of problems - only worse - by allowing
> > the readdir code to take page faults while holding the ilock....
> >
> > I'd like to revert the change that introduced the ilock into the
> > readdir path, but I suspect that woul dbe an unpopular direction to
> > take. However, we need a solution that doesn't drive us insane with
> > lockdep false positives.
> >
> > IOWs, we need to do this differently - readdir needs to be protected
> > by the iolock, not the ilock, and only the block mapping routines in
> > the readdir code should be protected by the ilock. i.e. the same
> > locking strategy that is used for regular files: iolock protects the
> > contents of the file, the ilock protects the inode metadata.
> >
> > What I think we really need to do is drive the ilock all the way
> > into the block mapping functions of the directory code and to the
> > places where the inode metadata is actually going to be modified.
> > The iolock protects access to the directory data, and logging of
> > changes to those blocks is serialised by the buffer locks, so we
> > don't actually need the inode ilock to serialise access to the
> > directory data - we only need the ilock for modifications to the
> > inode and it's blockmaps itself, same as for regular files.
> >
> > Changing the directory code to handle this sort of locking is going
> > to require a bit of surgery. However, I can see advantages to moving
> > directory data to the same locking strategy as regular file data -
> > locking heirarchies are identical, directory ilock hold times are
> > much reduced, we don't get lockdep whining about taking page faults
> > with the ilock held, etc.
> >
> > A quick hack at to demonstrate the high level, initial step of using
> > the IOLOCK for readdir serialisation. I've done a little smoke
> > testing on it, so it won't die immediately. It should get rid of all
> > the nasty lockdep issues, but it doesn't start to address the deeper
> > restructing that is needed.
> >
> > Thoughts, comments? I'm especially interested in what SGI people
> > have to say here because this locking change was needed for CXFS,
> > and changing the directory locking iheirarchy is probably going to
> > upset CXFS a lot...
>
> Protecting directory contents with the iolock and pushing the ilock down
> to protect the directory block mapping functions seems like a good
> strategy to me.
Ok, it seems like a good strategy, but does it work for CXFS? What
about the other option of a combination of ilock for mapping and
read, and buffer locks for stabilisation?
> In this area I think it's good to treat directories the
> same as regular files. I have not had a very close look yet... and
> couldn't find where we take a fault with irqs disabled. Maybe something
> out of a workqueue?
Workqueues don't run with interrupts disabled. Yes, we take inode
locks in workqueue context (e.g. AIL tail pushing), but that's not
what is being warned about here. And the page fault only comes from
the filldir callback which runs copy_to_user(), and that most
definitely does not occur in interrupt context.
FWIW, it's invalid to take the ilock in interrupt context because
it's a sleeping lock without using trylock semantics. If you use
"trylock" semantics, then lockdep doesn't complain about that
because it can't deadlock. I *think* that lockdep is complaining
about reclaim contexts, not interrupts, and it's issuing this
"interrupts disabled" deadlocks because the memory reclaim state
code is piggy-backed on the interrupt state tracking within lockdep.
Either way, though, lockdep is wrong.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 21:16 ` Ben Myers
@ 2014-03-10 21:24 ` Dave Chinner
2014-03-10 22:10 ` Ben Myers
0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2014-03-10 21:24 UTC (permalink / raw)
To: Ben Myers; +Cc: Christoph Hellwig, Michael L. Semon, xfs-oss
On Mon, Mar 10, 2014 at 04:16:58PM -0500, Ben Myers wrote:
> Hi,
>
> On Tue, Mar 11, 2014 at 07:46:47AM +1100, Dave Chinner wrote:
> > On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote:
> > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > > > Changing the directory code to handle this sort of locking is going
> > > > to require a bit of surgery. However, I can see advantages to moving
> > > > directory data to the same locking strategy as regular file data -
> > > > locking heirarchies are identical, directory ilock hold times are
> > > > much reduced, we don't get lockdep whining about taking page faults
> > > > with the ilock held, etc.
> > > >
> > > > A quick hack at to demonstrate the high level, initial step of using
> > > > the IOLOCK for readdir serialisation. I've done a little smoke
> > > > testing on it, so it won't die immediately. It should get rid of all
> > > > the nasty lockdep issues, but it doesn't start to address the deeper
> > > > restructing that is needed.
> > >
> > > What synchronization do we actually need from the iolock? Pushing the
> > > ilock down to where it's actually needed is a good idea either way,
> > > though.
> >
> > The issue is that if we push the ilock down to the just the block
> > mapping routines, the directory can be modified while the readdir is
> > in progress. That's the root problem that adding the ilock solved.
> > Now, just pushing the ilock down to protect the bmbt lookups might
> > result in a consistent lookup, but it won't serialise sanely against
> > modifications.
> >
> > i.e. readdir only walks one dir block at a time but
> > it maps multiple blocks for readahead and keeps them in a local
> > array and doesn't validate them again before issuing read o nthose
> > buffers. Hence at a high level we currently have to serialise
> > readdir against all directory modifications.
> >
> > The only other option we might have is to completely rewrite the
> > directory readahead code not to cache mappings. If we use the ilock
> > purely for bmbt lookup and buffer read, then the ilock will
> > serialise against modification, and the buffer lock will stabilise
> > the buffer until the readdir moves to the next buffer and picks the
> > ilock up again to read it.
> >
> > That would avoid the need for high level serialisation, but it's a
> > lot more work than using the iolock to provide the high level
> > serialisation and i'm still not sure it's 100% safe. And I've got no
> > idea if it would work for CXFS. Hopefully someone from SGI will
> > chime in here....
>
> Also in leaf and node formats a single modification can change multiple
> buffers, so I suspect the buffer lock isn't enough serialization to maintain a
> consistent directory in the face of multiple readers and writers. The iolock
> does resolve that issue.
Right, but we don't care about anything other than the leaf block
that we are currently reading is consistent when the read starts and
is consistent across the entire processing. i.e. if the leaf is locked by
readdir, then the modification is completely stalled until the
readdir lets it go. And readdir then can't get the next buffer until
the modification is complete because it blocks on the ilock to get
the next mapping and buffer....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 21:20 ` Dave Chinner
@ 2014-03-10 21:30 ` Ben Myers
0 siblings, 0 replies; 13+ messages in thread
From: Ben Myers @ 2014-03-10 21:30 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss
Hi Dave,
On Tue, Mar 11, 2014 at 08:20:27AM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2014 at 03:52:49PM -0500, Ben Myers wrote:
> > Hi Dave,
> >
> > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote:
> > > > Hi! I was playing a shell game with some precious backup data. It
> > > > went like this:
> > > >
> > > > open a 36-GB source partition (DM linear, partitions from 2 drives,
> > > > v5-superblock XFS)
> > > >
> > > > open a 32-GB aes-xts crypt (v4-superblock XFS)
> > > >
> > > > `cp -av` from holding partition to crypt
> > > >
> > > > It was during the cp operation that I got this multi-CPU version of
> > > > the harmless lockdep:
> > > >
> > > > =========================================================
> > > > [ INFO: possible irq lock inversion dependency detected ]
> > > > 3.14.0-rc5+ #6 Not tainted
> > > > ---------------------------------------------------------
> > > > kswapd0/25 just changed the state of lock:
> > > > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
> > > > but this lock took another, RECLAIM_FS-unsafe lock in the past:
> > > > (&mm->mmap_sem){++++++}
> > > >
> > > > and interrupts could create inverse lock ordering between them.
> > > >
> > > >
> > > > other info that might help us debug this:
> > > > Possible interrupt unsafe locking scenario:
> > > >
> > > > CPU0 CPU1
> > > > ---- ----
> > > > lock(&mm->mmap_sem);
> > > > local_irq_disable();
> > > > lock(&xfs_dir_ilock_class);
> > > > lock(&mm->mmap_sem);
> > > > <Interrupt>
> > > > lock(&xfs_dir_ilock_class);
> > >
> > > Well, that's rather interesting. I'd love to know where we take the
> > > directory ilock with interupts disabled - and then take a page
> > > fault... :/
> .....
> > > Realistically, I think that the readdir locking change was
> > > fundamentally broken. Why? Because taking page faults while holding
> > > the ilock violates the lock order we have for inodes. For regular
> > > files, the lock heirarchy is iolock -> page fault -> ilock, and we
> > > got rid of the need for the iolock in the inode reclaim path because
> > > of all the problems it caused lockdep. Now we've gone and
> > > reintroduced that same set of problems - only worse - by allowing
> > > the readdir code to take page faults while holding the ilock....
> > >
> > > I'd like to revert the change that introduced the ilock into the
> > > readdir path, but I suspect that woul dbe an unpopular direction to
> > > take. However, we need a solution that doesn't drive us insane with
> > > lockdep false positives.
> > >
> > > IOWs, we need to do this differently - readdir needs to be protected
> > > by the iolock, not the ilock, and only the block mapping routines in
> > > the readdir code should be protected by the ilock. i.e. the same
> > > locking strategy that is used for regular files: iolock protects the
> > > contents of the file, the ilock protects the inode metadata.
> > >
> > > What I think we really need to do is drive the ilock all the way
> > > into the block mapping functions of the directory code and to the
> > > places where the inode metadata is actually going to be modified.
> > > The iolock protects access to the directory data, and logging of
> > > changes to those blocks is serialised by the buffer locks, so we
> > > don't actually need the inode ilock to serialise access to the
> > > directory data - we only need the ilock for modifications to the
> > > inode and it's blockmaps itself, same as for regular files.
> > >
> > > Changing the directory code to handle this sort of locking is going
> > > to require a bit of surgery. However, I can see advantages to moving
> > > directory data to the same locking strategy as regular file data -
> > > locking heirarchies are identical, directory ilock hold times are
> > > much reduced, we don't get lockdep whining about taking page faults
> > > with the ilock held, etc.
> > >
> > > A quick hack at to demonstrate the high level, initial step of using
> > > the IOLOCK for readdir serialisation. I've done a little smoke
> > > testing on it, so it won't die immediately. It should get rid of all
> > > the nasty lockdep issues, but it doesn't start to address the deeper
> > > restructing that is needed.
> > >
> > > Thoughts, comments? I'm especially interested in what SGI people
> > > have to say here because this locking change was needed for CXFS,
> > > and changing the directory locking iheirarchy is probably going to
> > > upset CXFS a lot...
> >
> > Protecting directory contents with the iolock and pushing the ilock down
> > to protect the directory block mapping functions seems like a good
> > strategy to me.
>
> Ok, it seems like a good strategy, but does it work for CXFS?
Yes it should work for CXFS.
> What about the other option of a combination of ilock for mapping and
> read, and buffer locks for stabilisation?
I think this option probably won't work because of those cases where you
modify >1 buffer for a single directory modification, e.g. in a node
directory you might modify a buffer each for the data, freeindex, node,
and leaf blocks.
> > In this area I think it's good to treat directories the
> > same as regular files. I have not had a very close look yet... and
> > couldn't find where we take a fault with irqs disabled. Maybe something
> > out of a workqueue?
>
> Workqueues don't run with interrupts disabled. Yes, we take inode
> locks in workqueue context (e.g. AIL tail pushing), but that's not
> what is being warned about here. And the page fault only comes from
> the filldir callback which runs copy_to_user(), and that most
> definitely does not occur in interrupt context.
>
> FWIW, it's invalid to take the ilock in interrupt context because
> it's a sleeping lock without using trylock semantics. If you use
> "trylock" semantics, then lockdep doesn't complain about that
> because it can't deadlock. I *think* that lockdep is complaining
> about reclaim contexts, not interrupts, and it's issuing this
> "interrupts disabled" deadlocks because the memory reclaim state
> code is piggy-backed on the interrupt state tracking within lockdep.
> Either way, though, lockdep is wrong.
Yeah, that sounds plausible.
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 21:24 ` Dave Chinner
@ 2014-03-10 22:10 ` Ben Myers
0 siblings, 0 replies; 13+ messages in thread
From: Ben Myers @ 2014-03-10 22:10 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Michael L. Semon, xfs-oss
On Tue, Mar 11, 2014 at 08:24:30AM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2014 at 04:16:58PM -0500, Ben Myers wrote:
> > Hi,
> >
> > On Tue, Mar 11, 2014 at 07:46:47AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote:
> > > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > > > > Changing the directory code to handle this sort of locking is going
> > > > > to require a bit of surgery. However, I can see advantages to moving
> > > > > directory data to the same locking strategy as regular file data -
> > > > > locking heirarchies are identical, directory ilock hold times are
> > > > > much reduced, we don't get lockdep whining about taking page faults
> > > > > with the ilock held, etc.
> > > > >
> > > > > A quick hack at to demonstrate the high level, initial step of using
> > > > > the IOLOCK for readdir serialisation. I've done a little smoke
> > > > > testing on it, so it won't die immediately. It should get rid of all
> > > > > the nasty lockdep issues, but it doesn't start to address the deeper
> > > > > restructing that is needed.
> > > >
> > > > What synchronization do we actually need from the iolock? Pushing the
> > > > ilock down to where it's actually needed is a good idea either way,
> > > > though.
> > >
> > > The issue is that if we push the ilock down to the just the block
> > > mapping routines, the directory can be modified while the readdir is
> > > in progress. That's the root problem that adding the ilock solved.
> > > Now, just pushing the ilock down to protect the bmbt lookups might
> > > result in a consistent lookup, but it won't serialise sanely against
> > > modifications.
> > >
> > > i.e. readdir only walks one dir block at a time but
> > > it maps multiple blocks for readahead and keeps them in a local
> > > array and doesn't validate them again before issuing read o nthose
> > > buffers. Hence at a high level we currently have to serialise
> > > readdir against all directory modifications.
> > >
> > > The only other option we might have is to completely rewrite the
> > > directory readahead code not to cache mappings. If we use the ilock
> > > purely for bmbt lookup and buffer read, then the ilock will
> > > serialise against modification, and the buffer lock will stabilise
> > > the buffer until the readdir moves to the next buffer and picks the
> > > ilock up again to read it.
> > >
> > > That would avoid the need for high level serialisation, but it's a
> > > lot more work than using the iolock to provide the high level
> > > serialisation and i'm still not sure it's 100% safe. And I've got no
> > > idea if it would work for CXFS. Hopefully someone from SGI will
> > > chime in here....
> >
> > Also in leaf and node formats a single modification can change multiple
> > buffers, so I suspect the buffer lock isn't enough serialization to maintain a
> > consistent directory in the face of multiple readers and writers. The iolock
> > does resolve that issue.
>
> Right, but we don't care about anything other than the leaf block
> that we are currently reading is consistent when the read starts and
> is consistent across the entire processing. i.e. if the leaf is locked by
> readdir, then the modification is completely stalled until the
> readdir lets it go. And readdir then can't get the next buffer until
> the modification is complete because it blocks on the ilock to get
> the next mapping and buffer....
As long as [you pointed out above] the readahead buffers aren't cached, and all
of the callers who do require that data/freeindex/node/leaf blocks be
consistent continue to take the ilock... Yeah, I think that might work.
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data
2014-03-10 20:51 ` Dave Chinner
@ 2014-03-11 16:48 ` Christoph Hellwig
0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2014-03-11 16:48 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss
On Tue, Mar 11, 2014 at 07:51:18AM +1100, Dave Chinner wrote:
> Yeah, b) seems like the way to simplify it - the filestreams code
> really only needs to track the parent/ag relationship rather than
> the child/parent relationship if there is a reliable way of
> determining the parent from the child.
>
> What do we do with hardlinked files in this case? I'm happy to say
> "too bad" for these files mainly because the filestream allocator is
> aimed at associating multiple file creations together, so hard links
> really don't matter AFAICT...
We could flatly refuse filesystems for files with i_nlink > 1, or just
grab the first parent we get at. Refusing seems like the better
alternative to me.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-03-11 16:48 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-09 2:58 Multi-CPU harmless lockdep on x86 while copying data Michael L. Semon
2014-03-10 2:55 ` Dave Chinner
2014-03-10 10:37 ` Christoph Hellwig
2014-03-10 11:12 ` Christoph Hellwig
2014-03-10 20:51 ` Dave Chinner
2014-03-11 16:48 ` Christoph Hellwig
2014-03-10 20:46 ` Dave Chinner
2014-03-10 21:16 ` Ben Myers
2014-03-10 21:24 ` Dave Chinner
2014-03-10 22:10 ` Ben Myers
2014-03-10 20:52 ` Ben Myers
2014-03-10 21:20 ` Dave Chinner
2014-03-10 21:30 ` Ben Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).