* Multi-CPU harmless lockdep on x86 while copying data
@ 2014-03-09 2:58 Michael L. Semon
2014-03-10 2:55 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Michael L. Semon @ 2014-03-09 2:58 UTC (permalink / raw)
To: xfs-oss
Hi! I was playing a shell game with some precious backup data. It
went like this:
open a 36-GB source partition (DM linear, partitions from 2 drives,
v5-superblock XFS)
open a 32-GB aes-xts crypt (v4-superblock XFS)
`cp -av` from holding partition to crypt
It was during the cp operation that I got this multi-CPU version of
the harmless lockdep:
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc5+ #6 Not tainted
---------------------------------------------------------
kswapd0/25 just changed the state of lock:
(&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(&mm->mmap_sem){++++++}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
local_irq_disable();
lock(&xfs_dir_ilock_class);
lock(&mm->mmap_sem);
<Interrupt>
lock(&xfs_dir_ilock_class);
*** DEADLOCK ***
3 locks held by kswapd0/25:
#0: (shrinker_rwsem){++++..}, at: [<790c04b7>] shrink_slab+0x2a/0xda
#1: (&type->s_umount_key#20){++++.+}, at: [<790e55c3>] grab_super_passive+0x3b/0x75
#2: (&pag->pag_ici_reclaim_lock){+.+...}, at: [<7917809a>] xfs_reclaim_inodes_ag+0xb4/0x3a7
the shortest dependencies between 2nd lock and 1st lock:
-> (&mm->mmap_sem){++++++} ops: 146382 {
HARDIRQ-ON-W at:
[<79064553>] __lock_acquire+0x58a/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<792f6409>] down_write+0x4c/0xa3
[<790e96e5>] do_execve_common+0x2bf/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
HARDIRQ-ON-R at:
[<79064477>] __lock_acquire+0x4ae/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<790d1718>] might_fault+0x81/0xa9
[<7922fc80>] clear_user+0x13/0x46
[<79123c80>] padzero+0x22/0x2e
[<7912552b>] load_elf_binary+0x500/0xd35
[<790e93ec>] search_binary_handler+0x72/0xac
[<790e98c1>] do_execve_common+0x49b/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
SOFTIRQ-ON-W at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<792f6409>] down_write+0x4c/0xa3
[<790e96e5>] do_execve_common+0x2bf/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
SOFTIRQ-ON-R at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<790d1718>] might_fault+0x81/0xa9
[<7922fc80>] clear_user+0x13/0x46
[<79123c80>] padzero+0x22/0x2e
[<7912552b>] load_elf_binary+0x500/0xd35
[<790e93ec>] search_binary_handler+0x72/0xac
[<790e98c1>] do_execve_common+0x49b/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
RECLAIM_FS-ON-W at:
[<790620d7>] mark_held_locks+0x81/0xe7
[<79062aac>] lockdep_trace_alloc+0xa5/0xe6
[<790b97a5>] __alloc_pages_nodemask+0x6f/0x6be
[<790dfb14>] new_slab+0x5f/0x21c
[<792f1a88>] __slab_alloc.isra.59.constprop.67+0x25f/0x43d
[<790e0974>] kmem_cache_alloc+0x91/0xf9
[<790d35ed>] __split_vma.isra.34+0x28/0x141
[<790d4523>] do_munmap+0x234/0x2b0
[<790d5115>] vm_munmap+0x37/0x4a
[<790d513b>] SyS_munmap+0x13/0x15
[<792f85b8>] sysenter_do_call+0x12/0x36
RECLAIM_FS-ON-R at:
[<790620d7>] mark_held_locks+0x81/0xe7
[<79062aac>] lockdep_trace_alloc+0xa5/0xe6
[<790b97a5>] __alloc_pages_nodemask+0x6f/0x6be
[<790269fa>] pte_alloc_one+0x24/0x3c
[<790cead5>] __pte_alloc+0x1a/0x85
[<790d0d0c>] handle_mm_fault+0x5d8/0x604
[<79023d1f>] __do_page_fault+0x110/0x3bc
[<790240d3>] do_page_fault+0xd/0xf
[<792f832b>] error_code+0x5f/0x64
[<79123c80>] padzero+0x22/0x2e
[<7912552b>] load_elf_binary+0x500/0xd35
[<790e93ec>] search_binary_handler+0x72/0xac
[<790e98c1>] do_execve_common+0x49b/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
INITIAL USE at:
[<79064276>] __lock_acquire+0x2ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<792f6409>] down_write+0x4c/0xa3
[<790e96e5>] do_execve_common+0x2bf/0x5bf
[<790e99f2>] do_execve+0xd/0xf
[<79000341>] run_init_process+0x21/0x23
[<79000357>] try_to_run_init_process+0x14/0x4d
[<792ec5cf>] kernel_init+0x90/0xce
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
}
... key at: [<795f1eec>] __key.44037+0x0/0x8
... acquired at:
[<79064360>] __lock_acquire+0x397/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<790d1718>] might_fault+0x81/0xa9
[<790f379a>] filldir64+0x92/0xe2
[<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c
[<7917009e>] xfs_readdir+0xc4/0x126
[<79171b8b>] xfs_file_readdir+0x25/0x3e
[<790f385a>] iterate_dir+0x70/0x9b
[<790f3a31>] SyS_getdents64+0x6d/0xcc
[<792f85b8>] sysenter_do_call+0x12/0x36
-> (&xfs_dir_ilock_class){++++-.} ops: 11354 {
HARDIRQ-ON-W at:
[<79064553>] __lock_acquire+0x58a/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<7917db25>] xfs_vn_update_time+0x6c/0x150
[<790f9302>] update_time+0x1e/0x9e
[<790faed8>] touch_atime+0xcd/0x101
[<790f3879>] iterate_dir+0x8f/0x9b
[<790f3a31>] SyS_getdents64+0x6d/0xcc
[<792f85b8>] sysenter_do_call+0x12/0x36
HARDIRQ-ON-R at:
[<79064477>] __lock_acquire+0x4ae/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e409>] down_read_nested+0x4f/0x8a
[<791c09b4>] xfs_ilock+0xb9/0x16c
[<791c0a85>] xfs_ilock_data_map_shared+0x1e/0x3a
[<79172550>] xfs_dir_open+0x2e/0x64
[<790e1d86>] do_dentry_open.isra.26+0x115/0x221
[<790e2f06>] finish_open+0x1b/0x27
[<790ef0e6>] do_last.isra.60+0x89c/0xe6b
[<790ef75e>] path_openat+0xa9/0x4fe
[<790efbe4>] do_filp_open+0x31/0x72
[<790e339a>] do_sys_open+0x112/0x181
[<790e344d>] SyS_openat+0x20/0x22
[<792f85b8>] sysenter_do_call+0x12/0x36
SOFTIRQ-ON-W at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<7917db25>] xfs_vn_update_time+0x6c/0x150
[<790f9302>] update_time+0x1e/0x9e
[<790faed8>] touch_atime+0xcd/0x101
[<790f3879>] iterate_dir+0x8f/0x9b
[<790f3a31>] SyS_getdents64+0x6d/0xcc
[<792f85b8>] sysenter_do_call+0x12/0x36
SOFTIRQ-ON-R at:
[<79064579>] __lock_acquire+0x5b0/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e409>] down_read_nested+0x4f/0x8a
[<791c09b4>] xfs_ilock+0xb9/0x16c
[<791c0a85>] xfs_ilock_data_map_shared+0x1e/0x3a
[<79172550>] xfs_dir_open+0x2e/0x64
[<790e1d86>] do_dentry_open.isra.26+0x115/0x221
[<790e2f06>] finish_open+0x1b/0x27
[<790ef0e6>] do_last.isra.60+0x89c/0xe6b
[<790ef75e>] path_openat+0xa9/0x4fe
[<790efbe4>] do_filp_open+0x31/0x72
[<790e339a>] do_sys_open+0x112/0x181
[<790e344d>] SyS_openat+0x20/0x22
[<792f85b8>] sysenter_do_call+0x12/0x36
IN-RECLAIM_FS-W at:
[<79064676>] __lock_acquire+0x6ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<79177db4>] xfs_reclaim_inode+0xdb/0x30d
[<79178260>] xfs_reclaim_inodes_ag+0x27a/0x3a7
[<7917840b>] xfs_reclaim_inodes_nr+0x2d/0x33
[<791837d1>] xfs_fs_free_cached_objects+0x13/0x15
[<790e57f8>] super_cache_scan+0x129/0x12d
[<790befdc>] shrink_slab_node+0x125/0x270
[<790c04f0>] shrink_slab+0x63/0xda
[<790c273e>] kswapd+0x31d/0x770
[<790458fa>] kthread+0xa1/0xb6
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
INITIAL USE at:
[<79064276>] __lock_acquire+0x2ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e409>] down_read_nested+0x4f/0x8a
[<791c09b4>] xfs_ilock+0xb9/0x16c
[<791c0a85>] xfs_ilock_data_map_shared+0x1e/0x3a
[<79172550>] xfs_dir_open+0x2e/0x64
[<790e1d86>] do_dentry_open.isra.26+0x115/0x221
[<790e2f06>] finish_open+0x1b/0x27
[<790ef0e6>] do_last.isra.60+0x89c/0xe6b
[<790ef75e>] path_openat+0xa9/0x4fe
[<790efbe4>] do_filp_open+0x31/0x72
[<790e339a>] do_sys_open+0x112/0x181
[<790e344d>] SyS_openat+0x20/0x22
[<792f85b8>] sysenter_do_call+0x12/0x36
}
... key at: [<79bfe07c>] xfs_dir_ilock_class+0x0/0x8
... acquired at:
[<79060873>] check_usage_forwards+0xf8/0xfa
[<79061f62>] mark_lock+0x1b3/0x2a7
[<79064676>] __lock_acquire+0x6ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] xfs_ilock+0xff/0x16c
[<79177db4>] xfs_reclaim_inode+0xdb/0x30d
[<79178260>] xfs_reclaim_inodes_ag+0x27a/0x3a7
[<7917840b>] xfs_reclaim_inodes_nr+0x2d/0x33
[<791837d1>] xfs_fs_free_cached_objects+0x13/0x15
[<790e57f8>] super_cache_scan+0x129/0x12d
[<790befdc>] shrink_slab_node+0x125/0x270
[<790c04f0>] shrink_slab+0x63/0xda
[<790c273e>] kswapd+0x31d/0x770
[<790458fa>] kthread+0xa1/0xb6
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
stack backtrace:
CPU: 0 PID: 25 Comm: kswapd0 Not tainted 3.14.0-rc5+ #6
Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
00000000 00000000 781b1b6c 792f2495 781b1bb0 781b1b90 792efde6 793b75f7
793b7965 00000019 7819b1c0 00000000 7819b654 7819b1c0 781b1be0 79060873
7819b654 00000001 793b7965 781b1bac 79a83100 79a83100 79834600 7819b654
Call Trace:
[<792f2495>] dump_stack+0x48/0x60
[<792efde6>] print_irq_inversion_bug.part.36+0x173/0x17b
[<79060873>] check_usage_forwards+0xf8/0xfa
[<7906077b>] ? check_usage_backwards+0xfc/0xfc
[<79061f62>] mark_lock+0x1b3/0x2a7
[<79053956>] ? sched_clock_local+0x42/0x12e
[<7906077b>] ? check_usage_backwards+0xfc/0xfc
[<79064676>] __lock_acquire+0x6ad/0xa6c
[<7906510f>] lock_acquire+0x8b/0x101
[<791c09fa>] ? xfs_ilock+0xff/0x16c
[<7905e559>] down_write_nested+0x4f/0x9f
[<791c09fa>] ? xfs_ilock+0xff/0x16c
[<791c09fa>] xfs_ilock+0xff/0x16c
[<79177db4>] xfs_reclaim_inode+0xdb/0x30d
[<79178260>] xfs_reclaim_inodes_ag+0x27a/0x3a7
[<791780cc>] ? xfs_reclaim_inodes_ag+0xe6/0x3a7
[<7917840b>] xfs_reclaim_inodes_nr+0x2d/0x33
[<791837d1>] xfs_fs_free_cached_objects+0x13/0x15
[<790e57f8>] super_cache_scan+0x129/0x12d
[<790befdc>] shrink_slab_node+0x125/0x270
[<790c04b7>] ? shrink_slab+0x2a/0xda
[<790c04f0>] shrink_slab+0x63/0xda
[<790c273e>] kswapd+0x31d/0x770
[<790c2421>] ? try_to_free_pages+0x373/0x373
[<790458fa>] kthread+0xa1/0xb6
[<790624ff>] ? trace_hardirqs_on+0xb/0xd
[<792f8537>] ret_from_kernel_thread+0x1b/0x28
[<79045859>] ? __kthread_parkme+0x57/0x57
[sched_delayed] sched: RT throttling activated
cp (149) used greatest stack depth: 5152 bytes left
I call it harmless because its single-CPU variant can be reproduced
as far back as I could bisect in earlier testing (way before
kernel 2.6.20). However, such lockdep splats have never manifested
in a noticeable problem on production kernels on x86. Either
down_write_nested or down_read_nested is in all of them, depending
on which kernel version is in use. At least one reclaim-related
function is in there as well. vm_map_ram used to be in there, but Dave
took care of that (thanks!).
This particular splat has been showing up more often, though. It's not
tied to one particular commit, event, or change; just something that
has crept in gradually since maybe kernel 3.11. It's like watching
grass grow or watching paint dry.
Might somebody keep an eye on this? With enough debug enabled, it
might show on a large cp or `tar -x` operation. xfstests rarely
invokes such an issue. It happens when there are enough inodes
and data flowing that RAM should be full. In many cases, more than one
partition is in play. crypt partitions are not required.
The test system is an i686 Pentium 4 with 1280 MB of RAM, running a
stripped-down Slackware 14.1 with elfutils and test programs on top.
Thanks!
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-09 2:58 Multi-CPU harmless lockdep on x86 while copying data Michael L. Semon @ 2014-03-10 2:55 ` Dave Chinner 2014-03-10 10:37 ` Christoph Hellwig 2014-03-10 20:52 ` Ben Myers 0 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2014-03-10 2:55 UTC (permalink / raw) To: Michael L. Semon; +Cc: xfs-oss On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote: > Hi! I was playing a shell game with some precious backup data. It > went like this: > > open a 36-GB source partition (DM linear, partitions from 2 drives, > v5-superblock XFS) > > open a 32-GB aes-xts crypt (v4-superblock XFS) > > `cp -av` from holding partition to crypt > > It was during the cp operation that I got this multi-CPU version of > the harmless lockdep: > > ========================================================= > [ INFO: possible irq lock inversion dependency detected ] > 3.14.0-rc5+ #6 Not tainted > --------------------------------------------------------- > kswapd0/25 just changed the state of lock: > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c > but this lock took another, RECLAIM_FS-unsafe lock in the past: > (&mm->mmap_sem){++++++} > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&mm->mmap_sem); > local_irq_disable(); > lock(&xfs_dir_ilock_class); > lock(&mm->mmap_sem); > <Interrupt> > lock(&xfs_dir_ilock_class); Well, that's rather interesting. I'd love to know where we take the directory ilock with interupts disabled - and then take a page fault... :/ .... > } > ... key at: [<795f1eec>] __key.44037+0x0/0x8 > ... acquired at: > [<79064360>] __lock_acquire+0x397/0xa6c > [<7906510f>] lock_acquire+0x8b/0x101 > [<790d1718>] might_fault+0x81/0xa9 > [<790f379a>] filldir64+0x92/0xe2 > [<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c > [<7917009e>] xfs_readdir+0xc4/0x126 > [<79171b8b>] xfs_file_readdir+0x25/0x3e > [<790f385a>] iterate_dir+0x70/0x9b > [<790f3a31>] SyS_getdents64+0x6d/0xcc > [<792f85b8>] sysenter_do_call+0x12/0x36 All of these paths are from the syscall path, except for one in kswapd context. I don't see any stack trace here that could result in the above "deadlock". It looks to be yet another false positive... > I call it harmless because its single-CPU variant can be reproduced > as far back as I could bisect in earlier testing (way before > kernel 2.6.20). However, such lockdep splats have never manifested > in a noticeable problem on production kernels on x86. Either > down_write_nested or down_read_nested is in all of them, depending > on which kernel version is in use. At least one reclaim-related > function is in there as well. vm_map_ram used to be in there, but Dave > took care of that (thanks!). > > This particular splat has been showing up more often, though. It's not > tied to one particular commit, event, or change; just something that > has crept in gradually since maybe kernel 3.11. It's like watching > grass grow or watching paint dry. The above reports all indicate that the taking a page fault while holding the directory ilock is problematic. So have all the other new lockdep reports we've got that have been caused by that change. Realistically, I think that the readdir locking change was fundamentally broken. Why? Because taking page faults while holding the ilock violates the lock order we have for inodes. For regular files, the lock heirarchy is iolock -> page fault -> ilock, and we got rid of the need for the iolock in the inode reclaim path because of all the problems it caused lockdep. Now we've gone and reintroduced that same set of problems - only worse - by allowing the readdir code to take page faults while holding the ilock.... I'd like to revert the change that introduced the ilock into the readdir path, but I suspect that woul dbe an unpopular direction to take. However, we need a solution that doesn't drive us insane with lockdep false positives. IOWs, we need to do this differently - readdir needs to be protected by the iolock, not the ilock, and only the block mapping routines in the readdir code should be protected by the ilock. i.e. the same locking strategy that is used for regular files: iolock protects the contents of the file, the ilock protects the inode metadata. What I think we really need to do is drive the ilock all the way into the block mapping functions of the directory code and to the places where the inode metadata is actually going to be modified. The iolock protects access to the directory data, and logging of changes to those blocks is serialised by the buffer locks, so we don't actually need the inode ilock to serialise access to the directory data - we only need the ilock for modifications to the inode and it's blockmaps itself, same as for regular files. Changing the directory code to handle this sort of locking is going to require a bit of surgery. However, I can see advantages to moving directory data to the same locking strategy as regular file data - locking heirarchies are identical, directory ilock hold times are much reduced, we don't get lockdep whining about taking page faults with the ilock held, etc. A quick hack at to demonstrate the high level, initial step of using the IOLOCK for readdir serialisation. I've done a little smoke testing on it, so it won't die immediately. It should get rid of all the nasty lockdep issues, but it doesn't start to address the deeper restructing that is needed. Thoughts, comments? I'm especially interested in what SGI people have to say here because this locking change was needed for CXFS, and changing the directory locking iheirarchy is probably going to upset CXFS a lot... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs; holding ilock over readdir is wrong From: Dave Chinner <dchinner@redhat.com> The recent change to the readdir locking made in 40194ec ("xfs: reinstate the ilock in xfs_readdir") for CXFS directory sanity was fundamentally the wrong thing to do. deep in the readdir code we can take page faults in the filldir callback, and so taking a page fault while holding an inode ilock creates a new set of locking issues that lockdep warns all over the place about. The locking order for regularl inodes w.r.t. page faults is io_lock -> pagefault -> mmap_sem -> ilock. The directory readdir code now triggers ilock -> page fault -> mmap_sem. While we cannot deadlock at this point, it inverts all the locking patterns that lockdep normally sees on XFS inodes, and so triggers lockdep. We worked around this with commit 93a8614 ("xfs: fix directory inode iolock lockdep false positive"), but that then just moved the lockdep warning to deeper in the page fault path and triggered on seurity inode locks. Further, if we enter memory reclaim in a readdir path, we now get lockdep warning about potential deadlocks because the ilock is held when we enter reclaim. This, again, is different to a regular file in that we never allow memory reclaim to run while holding the ilock for regular files. Hence lockdep now throws ilock->kmalloc->reclaim->ilock warnings. Basically, the problem is that the ilock is being used to protect the directory data and the inode metadata, whereas for a regular file the iolock protects the data and the ilock protects the metadata. From the VFS perspective, the i_mutex serialises all accesses to the directory data, and so not holding the ilock for readdir doesn't matter. The issue is that CXFS doesn't access directory data via the VFS, so it has no "data serialisaton" mechanism. While the simplest way to fix these problems would simply be to revert the ilock addition made to xfs_readdir,but I figure that would be unpopular with some people. Hence we need a different solution: we should use the iolock to protect readdir against modification races. The ilock can then be used just when the extent list needs to be read, just like we do for regular files. The directory modification code can now take the iolock exclusive when the ilock is also taken, and this then ensures that readdir is correct excluded while modifications are in progress. This would be a straight forward change, except for two things: filestreams and lockdep. The filestream allocator takes the directory iolock and makes assumptions about parent->child locking order of the iolock which will now be invalidated. Hence some changes to the filestreams code is needed to ensure that it never blocks on directory iolocks and deadlocks. instead it needs to fail stream associations when such problems occur. Lockdep is just plain annoying. We have a bug in our handling of the XFS_LOCK_PARENT handling when it comes to locking multiple inodes with that flag set - lockdep classes are exclusive and can't be ORed together to form new classes. And there's only 8 subclasses available. Luckily, we only need the new annotations on the iolock, and we never lok more than 2 of those at once, so there's space available in the iolock lockdep mask to shoehorn in a separate lockdep subclass map for parent based iolocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_dir2.c | 3 +++ fs/xfs/xfs_dir2_readdir.c | 10 ++++++++-- fs/xfs/xfs_filestream.c | 26 +++++++++++++++--------- fs/xfs/xfs_inode.c | 50 +++++++++++++++++++++++++++++++++++------------ fs/xfs/xfs_inode.h | 7 ++++++- fs/xfs/xfs_symlink.c | 7 ++++--- 6 files changed, 75 insertions(+), 28 deletions(-) diff --git a/fs/xfs/xfs_dir2.c b/fs/xfs/xfs_dir2.c index 0c8ba87..b206da1 100644 --- a/fs/xfs/xfs_dir2.c +++ b/fs/xfs/xfs_dir2.c @@ -307,6 +307,7 @@ xfs_dir_lookup( struct xfs_da_args *args; int rval; int v; /* type-checking value */ + int lock_mode; ASSERT(S_ISDIR(dp->i_d.di_mode)); XFS_STATS_INC(xs_dir_lookup); @@ -331,6 +332,7 @@ xfs_dir_lookup( if (ci_name) args->op_flags |= XFS_DA_OP_CILOOKUP; + lock_mode = xfs_ilock_data_map_shared(dp); if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_lookup(args); goto out_check_rval; @@ -363,6 +365,7 @@ out_check_rval: } } out_free: + xfs_iunlock(dp, lock_mode); kmem_free(args); return rval; } diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index aead369..19863fe 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -190,6 +190,7 @@ xfs_dir2_block_getdents( char *ptr; /* current data entry */ int wantoff; /* starting block offset */ xfs_off_t cook; + int lock_mode; mp = dp->i_mount; /* @@ -198,7 +199,9 @@ xfs_dir2_block_getdents( if (xfs_dir2_dataptr_to_db(mp, ctx->pos) > mp->m_dirdatablk) return 0; + lock_mode = xfs_ilock_data_map_shared(dp); error = xfs_dir3_block_read(NULL, dp, &bp); + xfs_iunlock(dp, lock_mode); if (error) return error; @@ -552,9 +555,12 @@ xfs_dir2_leaf_getdents( * current buffer, need to get another one. */ if (!bp || ptr >= (char *)bp->b_addr + mp->m_dirblksize) { + int lock_mode; + lock_mode = xfs_ilock_data_map_shared(dp); error = xfs_dir2_leaf_readbuf(dp, bufsize, map_info, &curoff, &bp); + xfs_iunlock(dp, lock_mode); if (error || !map_info->map_valid) break; @@ -684,7 +690,7 @@ xfs_readdir( ASSERT(S_ISDIR(dp->i_d.di_mode)); XFS_STATS_INC(xs_dir_getdents); - lock_mode = xfs_ilock_data_map_shared(dp); + xfs_ilock(dp, XFS_IOLOCK_SHARED); if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_getdents(dp, ctx); else if ((rval = xfs_dir2_isblock(NULL, dp, &v))) @@ -693,7 +699,7 @@ xfs_readdir( rval = xfs_dir2_block_getdents(dp, ctx); else rval = xfs_dir2_leaf_getdents(dp, ctx, bufsize); - xfs_iunlock(dp, lock_mode); + xfs_iunlock(dp, XFS_IOLOCK_SHARED); return rval; } diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c index 12b6e77..a2b8b12 100644 --- a/fs/xfs/xfs_filestream.c +++ b/fs/xfs/xfs_filestream.c @@ -597,9 +597,6 @@ xfs_filestream_associate( * waiting for the lock because someone else is waiting on the lock we * hold and we cannot drop that as we are in a transaction here. * - * Lucky for us, this inversion is not a problem because it's a - * directory inode that we are trying to lock here. - * * So, if we can't get the iolock without sleeping then just give up */ if (!xfs_ilock_nowait(pip, XFS_IOLOCK_EXCL)) @@ -746,13 +743,24 @@ xfs_filestream_new_ag( * themselves and their parent to different AGs. * * Note that we lock the parent directory iolock inside the child - * iolock here. That's fine as we never hold both parent and child - * iolock in any other place. This is different from the ilock, - * which requires locking of the child after the parent for namespace - * operations. + * iolock here. That's fine as we never hold both parent and child + * iolock in any other place. However, we also hold the child ilock + * here, so we are locking inside that as well. This inverts directory + * iolock/child ilock order for operations like rename, and hence we + * cannot block in the parent iolock here. + * + * This is different from the ilock, However, we also hold the child + * ilock here, so we are locking inside that as well. This inverts + * directory iolock/child ilock order for operations like rename, and + * hence we cannot block in the parent iolock here. If we can't get the + * parent iolock, then just default to AG 0 and return. */ - if (pip) - xfs_ilock(pip, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT); + if (pip) { + if (!xfs_ilock_nowait(pip, XFS_IOLOCK_EXCL)) { + *agp = 0; + return 0; + } + } /* * A new AG needs to be found for the file. If the file's parent diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 7a1668b..422603d 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -321,10 +321,25 @@ int xfs_lock_delays; static inline int xfs_lock_inumorder(int lock_mode, int subclass) { - if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) - lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT; - if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) - lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT; + if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) { + ASSERT(subclass <= XFS_IOLOCK_MAX_SUBCLASS); + subclass += XFS_IOLOCK_INUMORDER; + if (lock_mode & XFS_IOLOCK_PARENT) { + subclass += XFS_IOLOCK_INUMORDER_PARENT; + lock_mode &= ~XFS_IOLOCK_PARENT; + } + ASSERT(subclass < MAX_LOCKDEP_SUBCLASSES); + lock_mode |= subclass << XFS_IOLOCK_SHIFT; + } + + if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) { + ASSERT(subclass <= XFS_ILOCK_MAX_SUBCLASS); + ASSERT(!(lock_mode & XFS_ILOCK_PARENT)); + subclass += XFS_ILOCK_INUMORDER; + ASSERT(subclass < MAX_LOCKDEP_SUBCLASSES); + lock_mode |= subclass << XFS_ILOCK_SHIFT; + } + return lock_mode; } @@ -584,9 +599,9 @@ xfs_lookup( if (XFS_FORCED_SHUTDOWN(dp->i_mount)) return XFS_ERROR(EIO); - lock_mode = xfs_ilock_data_map_shared(dp); + xfs_ilock(dp, XFS_IOLOCK_SHARED); error = xfs_dir_lookup(NULL, dp, name, &inum, ci_name); - xfs_iunlock(dp, lock_mode); + xfs_iunlock(dp, XFS_IOLOCK_SHARED); if (error) goto out; @@ -1191,7 +1206,8 @@ xfs_create( goto out_trans_cancel; } - xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); + xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL | + XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT); unlock_dp_on_error = true; xfs_bmap_init(&free_list, &first_block); @@ -1228,7 +1244,7 @@ xfs_create( * the transaction cancel unlocking dp so don't do it explicitly in the * error path. */ - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); unlock_dp_on_error = false; error = xfs_dir_createname(tp, dp, name, ip->i_ino, @@ -1301,7 +1317,7 @@ xfs_create( xfs_qm_dqrele(pdqp); if (unlock_dp_on_error) - xfs_iunlock(dp, XFS_ILOCK_EXCL); + xfs_iunlock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); return error; } @@ -1348,10 +1364,11 @@ xfs_link( goto error_return; } + xfs_ilock(tdp, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT); xfs_lock_two_inodes(sip, tdp, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, sip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, tdp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, tdp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); /* * If we are using project inheritance, we only allow hard link @@ -2454,9 +2471,10 @@ xfs_remove( goto out_trans_cancel; } + xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT); xfs_lock_two_inodes(dp, ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); /* @@ -2655,6 +2673,12 @@ xfs_rename( * whether the target directory is the same as the source * directory, we can lock from 2 to 4 inodes. */ + if (!new_parent) + xfs_ilock(src_dp, XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT); + else + xfs_lock_two_inodes(src_dp, target_dp, + XFS_IOLOCK_EXCL | XFS_IOLOCK_PARENT); + xfs_lock_inodes(inodes, num_inodes, XFS_ILOCK_EXCL); /* @@ -2662,9 +2686,9 @@ xfs_rename( * we can rely on either trans_commit or trans_cancel to unlock * them. */ - xfs_trans_ijoin(tp, src_dp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, src_dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); if (new_parent) - xfs_trans_ijoin(tp, target_dp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, target_dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, src_ip, XFS_ILOCK_EXCL); if (target_ip) xfs_trans_ijoin(tp, target_ip, XFS_ILOCK_EXCL); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 5f421a1..f20d4d0 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -290,15 +290,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip) #define XFS_LOCK_PARENT 1 #define XFS_LOCK_RTBITMAP 2 #define XFS_LOCK_RTSUM 3 -#define XFS_LOCK_INUMORDER 4 #define XFS_IOLOCK_SHIFT 16 #define XFS_IOLOCK_PARENT (XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT) +#define XFS_IOLOCK_INUMORDER (XFS_LOCK_PARENT + 1) +#define XFS_IOLOCK_MAX_SUBCLASS 1 +#define XFS_IOLOCK_INUMORDER_PARENT \ + (XFS_IOLOCK_INUMORDER + XFS_IOLOCK_MAX_SUBCLASS + 1) #define XFS_ILOCK_SHIFT 24 #define XFS_ILOCK_PARENT (XFS_LOCK_PARENT << XFS_ILOCK_SHIFT) #define XFS_ILOCK_RTBITMAP (XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT) #define XFS_ILOCK_RTSUM (XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT) +#define XFS_ILOCK_INUMORDER (XFS_LOCK_RTSUM + 1); +#define XFS_ILOCK_MAX_SUBCLASS 3 #define XFS_IOLOCK_DEP_MASK 0x00ff0000 #define XFS_ILOCK_DEP_MASK 0xff000000 diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 5fda189..32c3278 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -249,7 +249,8 @@ xfs_symlink( goto error_return; } - xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); + xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL | + XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT); unlock_dp_on_error = true; /* @@ -296,7 +297,7 @@ xfs_symlink( * transaction cancel unlocking dp so don't do it explicitly in the * error path. */ - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); unlock_dp_on_error = false; /* @@ -418,7 +419,7 @@ xfs_symlink( xfs_qm_dqrele(pdqp); if (unlock_dp_on_error) - xfs_iunlock(dp, XFS_ILOCK_EXCL); + xfs_iunlock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); std_return: return error; } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 2:55 ` Dave Chinner @ 2014-03-10 10:37 ` Christoph Hellwig 2014-03-10 11:12 ` Christoph Hellwig 2014-03-10 20:46 ` Dave Chinner 2014-03-10 20:52 ` Ben Myers 1 sibling, 2 replies; 13+ messages in thread From: Christoph Hellwig @ 2014-03-10 10:37 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > Changing the directory code to handle this sort of locking is going > to require a bit of surgery. However, I can see advantages to moving > directory data to the same locking strategy as regular file data - > locking heirarchies are identical, directory ilock hold times are > much reduced, we don't get lockdep whining about taking page faults > with the ilock held, etc. > > A quick hack at to demonstrate the high level, initial step of using > the IOLOCK for readdir serialisation. I've done a little smoke > testing on it, so it won't die immediately. It should get rid of all > the nasty lockdep issues, but it doesn't start to address the deeper > restructing that is needed. What synchronization do we actually need from the iolock? Pushing the ilock down to where it's actually needed is a good idea either way, though. > This would be a straight forward change, except for two things: > filestreams and lockdep. The filestream allocator takes the > directory iolock and makes assumptions about parent->child locking > order of the iolock which will now be invalidated. Hence some > changes to the filestreams code is needed to ensure that it never > blocks on directory iolocks and deadlocks. instead it needs to fail > stream associations when such problems occur. I think the right fix is to stop abusing the iolock in filestreams. To me it seems like a look inside fstrm_item_t should be fine for what the filestreams code wants if I understand it correctly. >From looking over some of the filestreams code just for a few minutes I get an urge to redo lots of it right now.. > @@ -1228,7 +1244,7 @@ xfs_create( > * the transaction cancel unlocking dp so don't do it explicitly in the > * error path. > */ > - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); What do we need the iolock on these operations for? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 10:37 ` Christoph Hellwig @ 2014-03-10 11:12 ` Christoph Hellwig 2014-03-10 20:51 ` Dave Chinner 2014-03-10 20:46 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2014-03-10 11:12 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote: > I think the right fix is to stop abusing the iolock in filestreams. > To me it seems like a look inside fstrm_item_t should be fine > for what the filestreams code wants if I understand it correctly. Seems like the iolock could be removed fairly easily by using either of the two options: a) reference count fstrm_item, and just grab a reference to it for each child as well as the parent and insert it multiple times. Kill ->pip. b) only allocate and insert fstrm_items for directories. Find the directory by grabbing an entry off inode->i_dentry and then grabbing the parent. There always should be a dentry around when we allocate blocks, and if none we can just skip out of the filestreams allocator if there's none. For the cases that matter there is. Both mean that the race it tries to protect against using the iolock is remove entirely, and the code becomes more efficient as well. Option a) seems simple to implement, but b) will save a lot more memory and operations when using the filestreams allocator. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 11:12 ` Christoph Hellwig @ 2014-03-10 20:51 ` Dave Chinner 2014-03-11 16:48 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2014-03-10 20:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Michael L. Semon, xfs-oss On Mon, Mar 10, 2014 at 04:12:53AM -0700, Christoph Hellwig wrote: > On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote: > > I think the right fix is to stop abusing the iolock in filestreams. > > To me it seems like a look inside fstrm_item_t should be fine > > for what the filestreams code wants if I understand it correctly. > > Seems like the iolock could be removed fairly easily by using either of > the two options: > > a) reference count fstrm_item, and just grab a reference to it for each > child as well as the parent and insert it multiple times. Kill > ->pip. > b) only allocate and insert fstrm_items for directories. Find the > directory by grabbing an entry off inode->i_dentry and then grabbing > the parent. There always should be a dentry around when we allocate > blocks, and if none we can just skip out of the filestreams > allocator if there's none. For the cases that matter there is. > > Both mean that the race it tries to protect against using the iolock is > remove entirely, and the code becomes more efficient as well. Option a) > seems simple to implement, but b) will save a lot more memory and > operations when using the filestreams allocator. Yeah, b) seems like the way to simplify it - the filestreams code really only needs to track the parent/ag relationship rather than the child/parent relationship if there is a reliable way of determining the parent from the child. What do we do with hardlinked files in this case? I'm happy to say "too bad" for these files mainly because the filestream allocator is aimed at associating multiple file creations together, so hard links really don't matter AFAICT... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 20:51 ` Dave Chinner @ 2014-03-11 16:48 ` Christoph Hellwig 0 siblings, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2014-03-11 16:48 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss On Tue, Mar 11, 2014 at 07:51:18AM +1100, Dave Chinner wrote: > Yeah, b) seems like the way to simplify it - the filestreams code > really only needs to track the parent/ag relationship rather than > the child/parent relationship if there is a reliable way of > determining the parent from the child. > > What do we do with hardlinked files in this case? I'm happy to say > "too bad" for these files mainly because the filestream allocator is > aimed at associating multiple file creations together, so hard links > really don't matter AFAICT... We could flatly refuse filesystems for files with i_nlink > 1, or just grab the first parent we get at. Refusing seems like the better alternative to me. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 10:37 ` Christoph Hellwig 2014-03-10 11:12 ` Christoph Hellwig @ 2014-03-10 20:46 ` Dave Chinner 2014-03-10 21:16 ` Ben Myers 1 sibling, 1 reply; 13+ messages in thread From: Dave Chinner @ 2014-03-10 20:46 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Michael L. Semon, xfs-oss On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote: > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > > Changing the directory code to handle this sort of locking is going > > to require a bit of surgery. However, I can see advantages to moving > > directory data to the same locking strategy as regular file data - > > locking heirarchies are identical, directory ilock hold times are > > much reduced, we don't get lockdep whining about taking page faults > > with the ilock held, etc. > > > > A quick hack at to demonstrate the high level, initial step of using > > the IOLOCK for readdir serialisation. I've done a little smoke > > testing on it, so it won't die immediately. It should get rid of all > > the nasty lockdep issues, but it doesn't start to address the deeper > > restructing that is needed. > > What synchronization do we actually need from the iolock? Pushing the > ilock down to where it's actually needed is a good idea either way, > though. The issue is that if we push the ilock down to the just the block mapping routines, the directory can be modified while the readdir is in progress. That's the root problem that adding the ilock solved. Now, just pushing the ilock down to protect the bmbt lookups might result in a consistent lookup, but it won't serialise sanely against modifications. i.e. readdir only walks one dir block at a time but it maps multiple blocks for readahead and keeps them in a local array and doesn't validate them again before issuing read o nthose buffers. Hence at a high level we currently have to serialise readdir against all directory modifications. The only other option we might have is to completely rewrite the directory readahead code not to cache mappings. If we use the ilock purely for bmbt lookup and buffer read, then the ilock will serialise against modification, and the buffer lock will stabilise the buffer until the readdir moves to the next buffer and picks the ilock up again to read it. That would avoid the need for high level serialisation, but it's a lot more work than using the iolock to provide the high level serialisation and i'm still not sure it's 100% safe. And I've got no idea if it would work for CXFS. Hopefully someone from SGI will chime in here.... > > This would be a straight forward change, except for two things: > > filestreams and lockdep. The filestream allocator takes the > > directory iolock and makes assumptions about parent->child locking > > order of the iolock which will now be invalidated. Hence some > > changes to the filestreams code is needed to ensure that it never > > blocks on directory iolocks and deadlocks. instead it needs to fail > > stream associations when such problems occur. > > I think the right fix is to stop abusing the iolock in filestreams. > To me it seems like a look inside fstrm_item_t should be fine > for what the filestreams code wants if I understand it correctly. > > From looking over some of the filestreams code just for a few minutes > I get an urge to redo lots of it right now.. I get that urge from time to time, too. So far I've managed to avoid it. > > @@ -1228,7 +1244,7 @@ xfs_create( > > * the transaction cancel unlocking dp so don't do it explicitly in the > > * error path. > > */ > > - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); > > + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); > > What do we need the iolock on these operations for? These are providing the high level readdir vs modification serialisation protection. And we have to unlock it on transaction commit, which is why it needs to be added to the xfs_trans_ijoin() calls... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 20:46 ` Dave Chinner @ 2014-03-10 21:16 ` Ben Myers 2014-03-10 21:24 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Ben Myers @ 2014-03-10 21:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, Michael L. Semon, xfs-oss Hi, On Tue, Mar 11, 2014 at 07:46:47AM +1100, Dave Chinner wrote: > On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote: > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > > > Changing the directory code to handle this sort of locking is going > > > to require a bit of surgery. However, I can see advantages to moving > > > directory data to the same locking strategy as regular file data - > > > locking heirarchies are identical, directory ilock hold times are > > > much reduced, we don't get lockdep whining about taking page faults > > > with the ilock held, etc. > > > > > > A quick hack at to demonstrate the high level, initial step of using > > > the IOLOCK for readdir serialisation. I've done a little smoke > > > testing on it, so it won't die immediately. It should get rid of all > > > the nasty lockdep issues, but it doesn't start to address the deeper > > > restructing that is needed. > > > > What synchronization do we actually need from the iolock? Pushing the > > ilock down to where it's actually needed is a good idea either way, > > though. > > The issue is that if we push the ilock down to the just the block > mapping routines, the directory can be modified while the readdir is > in progress. That's the root problem that adding the ilock solved. > Now, just pushing the ilock down to protect the bmbt lookups might > result in a consistent lookup, but it won't serialise sanely against > modifications. > > i.e. readdir only walks one dir block at a time but > it maps multiple blocks for readahead and keeps them in a local > array and doesn't validate them again before issuing read o nthose > buffers. Hence at a high level we currently have to serialise > readdir against all directory modifications. > > The only other option we might have is to completely rewrite the > directory readahead code not to cache mappings. If we use the ilock > purely for bmbt lookup and buffer read, then the ilock will > serialise against modification, and the buffer lock will stabilise > the buffer until the readdir moves to the next buffer and picks the > ilock up again to read it. > > That would avoid the need for high level serialisation, but it's a > lot more work than using the iolock to provide the high level > serialisation and i'm still not sure it's 100% safe. And I've got no > idea if it would work for CXFS. Hopefully someone from SGI will > chime in here.... Also in leaf and node formats a single modification can change multiple buffers, so I suspect the buffer lock isn't enough serialization to maintain a consistent directory in the face of multiple readers and writers. The iolock does resolve that issue. > > > This would be a straight forward change, except for two things: > > > filestreams and lockdep. The filestream allocator takes the > > > directory iolock and makes assumptions about parent->child locking > > > order of the iolock which will now be invalidated. Hence some > > > changes to the filestreams code is needed to ensure that it never > > > blocks on directory iolocks and deadlocks. instead it needs to fail > > > stream associations when such problems occur. > > > > I think the right fix is to stop abusing the iolock in filestreams. > > To me it seems like a look inside fstrm_item_t should be fine > > for what the filestreams code wants if I understand it correctly. > > > > From looking over some of the filestreams code just for a few minutes > > I get an urge to redo lots of it right now.. > > I get that urge from time to time, too. So far I've managed to avoid > it. > > > > @@ -1228,7 +1244,7 @@ xfs_create( > > > * the transaction cancel unlocking dp so don't do it explicitly in the > > > * error path. > > > */ > > > - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); > > > + xfs_trans_ijoin(tp, dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); > > > > What do we need the iolock on these operations for? > > These are providing the high level readdir vs modification > serialisation protection. And we have to unlock it on transaction > commit, which is why it needs to be added to the xfs_trans_ijoin() > calls... Makes sense, I think. I'm not sure what the changes to the directory code would look like. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 21:16 ` Ben Myers @ 2014-03-10 21:24 ` Dave Chinner 2014-03-10 22:10 ` Ben Myers 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2014-03-10 21:24 UTC (permalink / raw) To: Ben Myers; +Cc: Christoph Hellwig, Michael L. Semon, xfs-oss On Mon, Mar 10, 2014 at 04:16:58PM -0500, Ben Myers wrote: > Hi, > > On Tue, Mar 11, 2014 at 07:46:47AM +1100, Dave Chinner wrote: > > On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote: > > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > > > > Changing the directory code to handle this sort of locking is going > > > > to require a bit of surgery. However, I can see advantages to moving > > > > directory data to the same locking strategy as regular file data - > > > > locking heirarchies are identical, directory ilock hold times are > > > > much reduced, we don't get lockdep whining about taking page faults > > > > with the ilock held, etc. > > > > > > > > A quick hack at to demonstrate the high level, initial step of using > > > > the IOLOCK for readdir serialisation. I've done a little smoke > > > > testing on it, so it won't die immediately. It should get rid of all > > > > the nasty lockdep issues, but it doesn't start to address the deeper > > > > restructing that is needed. > > > > > > What synchronization do we actually need from the iolock? Pushing the > > > ilock down to where it's actually needed is a good idea either way, > > > though. > > > > The issue is that if we push the ilock down to the just the block > > mapping routines, the directory can be modified while the readdir is > > in progress. That's the root problem that adding the ilock solved. > > Now, just pushing the ilock down to protect the bmbt lookups might > > result in a consistent lookup, but it won't serialise sanely against > > modifications. > > > > i.e. readdir only walks one dir block at a time but > > it maps multiple blocks for readahead and keeps them in a local > > array and doesn't validate them again before issuing read o nthose > > buffers. Hence at a high level we currently have to serialise > > readdir against all directory modifications. > > > > The only other option we might have is to completely rewrite the > > directory readahead code not to cache mappings. If we use the ilock > > purely for bmbt lookup and buffer read, then the ilock will > > serialise against modification, and the buffer lock will stabilise > > the buffer until the readdir moves to the next buffer and picks the > > ilock up again to read it. > > > > That would avoid the need for high level serialisation, but it's a > > lot more work than using the iolock to provide the high level > > serialisation and i'm still not sure it's 100% safe. And I've got no > > idea if it would work for CXFS. Hopefully someone from SGI will > > chime in here.... > > Also in leaf and node formats a single modification can change multiple > buffers, so I suspect the buffer lock isn't enough serialization to maintain a > consistent directory in the face of multiple readers and writers. The iolock > does resolve that issue. Right, but we don't care about anything other than the leaf block that we are currently reading is consistent when the read starts and is consistent across the entire processing. i.e. if the leaf is locked by readdir, then the modification is completely stalled until the readdir lets it go. And readdir then can't get the next buffer until the modification is complete because it blocks on the ilock to get the next mapping and buffer.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 21:24 ` Dave Chinner @ 2014-03-10 22:10 ` Ben Myers 0 siblings, 0 replies; 13+ messages in thread From: Ben Myers @ 2014-03-10 22:10 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, Michael L. Semon, xfs-oss On Tue, Mar 11, 2014 at 08:24:30AM +1100, Dave Chinner wrote: > On Mon, Mar 10, 2014 at 04:16:58PM -0500, Ben Myers wrote: > > Hi, > > > > On Tue, Mar 11, 2014 at 07:46:47AM +1100, Dave Chinner wrote: > > > On Mon, Mar 10, 2014 at 03:37:16AM -0700, Christoph Hellwig wrote: > > > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > > > > > Changing the directory code to handle this sort of locking is going > > > > > to require a bit of surgery. However, I can see advantages to moving > > > > > directory data to the same locking strategy as regular file data - > > > > > locking heirarchies are identical, directory ilock hold times are > > > > > much reduced, we don't get lockdep whining about taking page faults > > > > > with the ilock held, etc. > > > > > > > > > > A quick hack at to demonstrate the high level, initial step of using > > > > > the IOLOCK for readdir serialisation. I've done a little smoke > > > > > testing on it, so it won't die immediately. It should get rid of all > > > > > the nasty lockdep issues, but it doesn't start to address the deeper > > > > > restructing that is needed. > > > > > > > > What synchronization do we actually need from the iolock? Pushing the > > > > ilock down to where it's actually needed is a good idea either way, > > > > though. > > > > > > The issue is that if we push the ilock down to the just the block > > > mapping routines, the directory can be modified while the readdir is > > > in progress. That's the root problem that adding the ilock solved. > > > Now, just pushing the ilock down to protect the bmbt lookups might > > > result in a consistent lookup, but it won't serialise sanely against > > > modifications. > > > > > > i.e. readdir only walks one dir block at a time but > > > it maps multiple blocks for readahead and keeps them in a local > > > array and doesn't validate them again before issuing read o nthose > > > buffers. Hence at a high level we currently have to serialise > > > readdir against all directory modifications. > > > > > > The only other option we might have is to completely rewrite the > > > directory readahead code not to cache mappings. If we use the ilock > > > purely for bmbt lookup and buffer read, then the ilock will > > > serialise against modification, and the buffer lock will stabilise > > > the buffer until the readdir moves to the next buffer and picks the > > > ilock up again to read it. > > > > > > That would avoid the need for high level serialisation, but it's a > > > lot more work than using the iolock to provide the high level > > > serialisation and i'm still not sure it's 100% safe. And I've got no > > > idea if it would work for CXFS. Hopefully someone from SGI will > > > chime in here.... > > > > Also in leaf and node formats a single modification can change multiple > > buffers, so I suspect the buffer lock isn't enough serialization to maintain a > > consistent directory in the face of multiple readers and writers. The iolock > > does resolve that issue. > > Right, but we don't care about anything other than the leaf block > that we are currently reading is consistent when the read starts and > is consistent across the entire processing. i.e. if the leaf is locked by > readdir, then the modification is completely stalled until the > readdir lets it go. And readdir then can't get the next buffer until > the modification is complete because it blocks on the ilock to get > the next mapping and buffer.... As long as [you pointed out above] the readahead buffers aren't cached, and all of the callers who do require that data/freeindex/node/leaf blocks be consistent continue to take the ilock... Yeah, I think that might work. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 2:55 ` Dave Chinner 2014-03-10 10:37 ` Christoph Hellwig @ 2014-03-10 20:52 ` Ben Myers 2014-03-10 21:20 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Ben Myers @ 2014-03-10 20:52 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss Hi Dave, On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote: > > Hi! I was playing a shell game with some precious backup data. It > > went like this: > > > > open a 36-GB source partition (DM linear, partitions from 2 drives, > > v5-superblock XFS) > > > > open a 32-GB aes-xts crypt (v4-superblock XFS) > > > > `cp -av` from holding partition to crypt > > > > It was during the cp operation that I got this multi-CPU version of > > the harmless lockdep: > > > > ========================================================= > > [ INFO: possible irq lock inversion dependency detected ] > > 3.14.0-rc5+ #6 Not tainted > > --------------------------------------------------------- > > kswapd0/25 just changed the state of lock: > > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c > > but this lock took another, RECLAIM_FS-unsafe lock in the past: > > (&mm->mmap_sem){++++++} > > > > and interrupts could create inverse lock ordering between them. > > > > > > other info that might help us debug this: > > Possible interrupt unsafe locking scenario: > > > > CPU0 CPU1 > > ---- ---- > > lock(&mm->mmap_sem); > > local_irq_disable(); > > lock(&xfs_dir_ilock_class); > > lock(&mm->mmap_sem); > > <Interrupt> > > lock(&xfs_dir_ilock_class); > > Well, that's rather interesting. I'd love to know where we take the > directory ilock with interupts disabled - and then take a page > fault... :/ > > .... > > > } > > ... key at: [<795f1eec>] __key.44037+0x0/0x8 > > ... acquired at: > > [<79064360>] __lock_acquire+0x397/0xa6c > > [<7906510f>] lock_acquire+0x8b/0x101 > > [<790d1718>] might_fault+0x81/0xa9 > > [<790f379a>] filldir64+0x92/0xe2 > > [<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c > > [<7917009e>] xfs_readdir+0xc4/0x126 > > [<79171b8b>] xfs_file_readdir+0x25/0x3e > > [<790f385a>] iterate_dir+0x70/0x9b > > [<790f3a31>] SyS_getdents64+0x6d/0xcc > > [<792f85b8>] sysenter_do_call+0x12/0x36 > > All of these paths are from the syscall path, except for one in > kswapd context. I don't see any stack trace here that could result > in the above "deadlock". It looks to be yet another false > positive... > > > I call it harmless because its single-CPU variant can be reproduced > > as far back as I could bisect in earlier testing (way before > > kernel 2.6.20). However, such lockdep splats have never manifested > > in a noticeable problem on production kernels on x86. Either > > down_write_nested or down_read_nested is in all of them, depending > > on which kernel version is in use. At least one reclaim-related > > function is in there as well. vm_map_ram used to be in there, but Dave > > took care of that (thanks!). > > > > This particular splat has been showing up more often, though. It's not > > tied to one particular commit, event, or change; just something that > > has crept in gradually since maybe kernel 3.11. It's like watching > > grass grow or watching paint dry. > > The above reports all indicate that the taking a page fault while > holding the directory ilock is problematic. So have all the other > new lockdep reports we've got that have been caused by that change. > > Realistically, I think that the readdir locking change was > fundamentally broken. Why? Because taking page faults while holding > the ilock violates the lock order we have for inodes. For regular > files, the lock heirarchy is iolock -> page fault -> ilock, and we > got rid of the need for the iolock in the inode reclaim path because > of all the problems it caused lockdep. Now we've gone and > reintroduced that same set of problems - only worse - by allowing > the readdir code to take page faults while holding the ilock.... > > I'd like to revert the change that introduced the ilock into the > readdir path, but I suspect that woul dbe an unpopular direction to > take. However, we need a solution that doesn't drive us insane with > lockdep false positives. > > IOWs, we need to do this differently - readdir needs to be protected > by the iolock, not the ilock, and only the block mapping routines in > the readdir code should be protected by the ilock. i.e. the same > locking strategy that is used for regular files: iolock protects the > contents of the file, the ilock protects the inode metadata. > > What I think we really need to do is drive the ilock all the way > into the block mapping functions of the directory code and to the > places where the inode metadata is actually going to be modified. > The iolock protects access to the directory data, and logging of > changes to those blocks is serialised by the buffer locks, so we > don't actually need the inode ilock to serialise access to the > directory data - we only need the ilock for modifications to the > inode and it's blockmaps itself, same as for regular files. > > Changing the directory code to handle this sort of locking is going > to require a bit of surgery. However, I can see advantages to moving > directory data to the same locking strategy as regular file data - > locking heirarchies are identical, directory ilock hold times are > much reduced, we don't get lockdep whining about taking page faults > with the ilock held, etc. > > A quick hack at to demonstrate the high level, initial step of using > the IOLOCK for readdir serialisation. I've done a little smoke > testing on it, so it won't die immediately. It should get rid of all > the nasty lockdep issues, but it doesn't start to address the deeper > restructing that is needed. > > Thoughts, comments? I'm especially interested in what SGI people > have to say here because this locking change was needed for CXFS, > and changing the directory locking iheirarchy is probably going to > upset CXFS a lot... Protecting directory contents with the iolock and pushing the ilock down to protect the directory block mapping functions seems like a good strategy to me. In this area I think it's good to treat directories the same as regular files. I have not had a very close look yet... and couldn't find where we take a fault with irqs disabled. Maybe something out of a workqueue? Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 20:52 ` Ben Myers @ 2014-03-10 21:20 ` Dave Chinner 2014-03-10 21:30 ` Ben Myers 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2014-03-10 21:20 UTC (permalink / raw) To: Ben Myers; +Cc: Michael L. Semon, xfs-oss On Mon, Mar 10, 2014 at 03:52:49PM -0500, Ben Myers wrote: > Hi Dave, > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote: > > > Hi! I was playing a shell game with some precious backup data. It > > > went like this: > > > > > > open a 36-GB source partition (DM linear, partitions from 2 drives, > > > v5-superblock XFS) > > > > > > open a 32-GB aes-xts crypt (v4-superblock XFS) > > > > > > `cp -av` from holding partition to crypt > > > > > > It was during the cp operation that I got this multi-CPU version of > > > the harmless lockdep: > > > > > > ========================================================= > > > [ INFO: possible irq lock inversion dependency detected ] > > > 3.14.0-rc5+ #6 Not tainted > > > --------------------------------------------------------- > > > kswapd0/25 just changed the state of lock: > > > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c > > > but this lock took another, RECLAIM_FS-unsafe lock in the past: > > > (&mm->mmap_sem){++++++} > > > > > > and interrupts could create inverse lock ordering between them. > > > > > > > > > other info that might help us debug this: > > > Possible interrupt unsafe locking scenario: > > > > > > CPU0 CPU1 > > > ---- ---- > > > lock(&mm->mmap_sem); > > > local_irq_disable(); > > > lock(&xfs_dir_ilock_class); > > > lock(&mm->mmap_sem); > > > <Interrupt> > > > lock(&xfs_dir_ilock_class); > > > > Well, that's rather interesting. I'd love to know where we take the > > directory ilock with interupts disabled - and then take a page > > fault... :/ ..... > > Realistically, I think that the readdir locking change was > > fundamentally broken. Why? Because taking page faults while holding > > the ilock violates the lock order we have for inodes. For regular > > files, the lock heirarchy is iolock -> page fault -> ilock, and we > > got rid of the need for the iolock in the inode reclaim path because > > of all the problems it caused lockdep. Now we've gone and > > reintroduced that same set of problems - only worse - by allowing > > the readdir code to take page faults while holding the ilock.... > > > > I'd like to revert the change that introduced the ilock into the > > readdir path, but I suspect that woul dbe an unpopular direction to > > take. However, we need a solution that doesn't drive us insane with > > lockdep false positives. > > > > IOWs, we need to do this differently - readdir needs to be protected > > by the iolock, not the ilock, and only the block mapping routines in > > the readdir code should be protected by the ilock. i.e. the same > > locking strategy that is used for regular files: iolock protects the > > contents of the file, the ilock protects the inode metadata. > > > > What I think we really need to do is drive the ilock all the way > > into the block mapping functions of the directory code and to the > > places where the inode metadata is actually going to be modified. > > The iolock protects access to the directory data, and logging of > > changes to those blocks is serialised by the buffer locks, so we > > don't actually need the inode ilock to serialise access to the > > directory data - we only need the ilock for modifications to the > > inode and it's blockmaps itself, same as for regular files. > > > > Changing the directory code to handle this sort of locking is going > > to require a bit of surgery. However, I can see advantages to moving > > directory data to the same locking strategy as regular file data - > > locking heirarchies are identical, directory ilock hold times are > > much reduced, we don't get lockdep whining about taking page faults > > with the ilock held, etc. > > > > A quick hack at to demonstrate the high level, initial step of using > > the IOLOCK for readdir serialisation. I've done a little smoke > > testing on it, so it won't die immediately. It should get rid of all > > the nasty lockdep issues, but it doesn't start to address the deeper > > restructing that is needed. > > > > Thoughts, comments? I'm especially interested in what SGI people > > have to say here because this locking change was needed for CXFS, > > and changing the directory locking iheirarchy is probably going to > > upset CXFS a lot... > > Protecting directory contents with the iolock and pushing the ilock down > to protect the directory block mapping functions seems like a good > strategy to me. Ok, it seems like a good strategy, but does it work for CXFS? What about the other option of a combination of ilock for mapping and read, and buffer locks for stabilisation? > In this area I think it's good to treat directories the > same as regular files. I have not had a very close look yet... and > couldn't find where we take a fault with irqs disabled. Maybe something > out of a workqueue? Workqueues don't run with interrupts disabled. Yes, we take inode locks in workqueue context (e.g. AIL tail pushing), but that's not what is being warned about here. And the page fault only comes from the filldir callback which runs copy_to_user(), and that most definitely does not occur in interrupt context. FWIW, it's invalid to take the ilock in interrupt context because it's a sleeping lock without using trylock semantics. If you use "trylock" semantics, then lockdep doesn't complain about that because it can't deadlock. I *think* that lockdep is complaining about reclaim contexts, not interrupts, and it's issuing this "interrupts disabled" deadlocks because the memory reclaim state code is piggy-backed on the interrupt state tracking within lockdep. Either way, though, lockdep is wrong. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Multi-CPU harmless lockdep on x86 while copying data 2014-03-10 21:20 ` Dave Chinner @ 2014-03-10 21:30 ` Ben Myers 0 siblings, 0 replies; 13+ messages in thread From: Ben Myers @ 2014-03-10 21:30 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael L. Semon, xfs-oss Hi Dave, On Tue, Mar 11, 2014 at 08:20:27AM +1100, Dave Chinner wrote: > On Mon, Mar 10, 2014 at 03:52:49PM -0500, Ben Myers wrote: > > Hi Dave, > > > > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > > > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote: > > > > Hi! I was playing a shell game with some precious backup data. It > > > > went like this: > > > > > > > > open a 36-GB source partition (DM linear, partitions from 2 drives, > > > > v5-superblock XFS) > > > > > > > > open a 32-GB aes-xts crypt (v4-superblock XFS) > > > > > > > > `cp -av` from holding partition to crypt > > > > > > > > It was during the cp operation that I got this multi-CPU version of > > > > the harmless lockdep: > > > > > > > > ========================================================= > > > > [ INFO: possible irq lock inversion dependency detected ] > > > > 3.14.0-rc5+ #6 Not tainted > > > > --------------------------------------------------------- > > > > kswapd0/25 just changed the state of lock: > > > > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c > > > > but this lock took another, RECLAIM_FS-unsafe lock in the past: > > > > (&mm->mmap_sem){++++++} > > > > > > > > and interrupts could create inverse lock ordering between them. > > > > > > > > > > > > other info that might help us debug this: > > > > Possible interrupt unsafe locking scenario: > > > > > > > > CPU0 CPU1 > > > > ---- ---- > > > > lock(&mm->mmap_sem); > > > > local_irq_disable(); > > > > lock(&xfs_dir_ilock_class); > > > > lock(&mm->mmap_sem); > > > > <Interrupt> > > > > lock(&xfs_dir_ilock_class); > > > > > > Well, that's rather interesting. I'd love to know where we take the > > > directory ilock with interupts disabled - and then take a page > > > fault... :/ > ..... > > > Realistically, I think that the readdir locking change was > > > fundamentally broken. Why? Because taking page faults while holding > > > the ilock violates the lock order we have for inodes. For regular > > > files, the lock heirarchy is iolock -> page fault -> ilock, and we > > > got rid of the need for the iolock in the inode reclaim path because > > > of all the problems it caused lockdep. Now we've gone and > > > reintroduced that same set of problems - only worse - by allowing > > > the readdir code to take page faults while holding the ilock.... > > > > > > I'd like to revert the change that introduced the ilock into the > > > readdir path, but I suspect that woul dbe an unpopular direction to > > > take. However, we need a solution that doesn't drive us insane with > > > lockdep false positives. > > > > > > IOWs, we need to do this differently - readdir needs to be protected > > > by the iolock, not the ilock, and only the block mapping routines in > > > the readdir code should be protected by the ilock. i.e. the same > > > locking strategy that is used for regular files: iolock protects the > > > contents of the file, the ilock protects the inode metadata. > > > > > > What I think we really need to do is drive the ilock all the way > > > into the block mapping functions of the directory code and to the > > > places where the inode metadata is actually going to be modified. > > > The iolock protects access to the directory data, and logging of > > > changes to those blocks is serialised by the buffer locks, so we > > > don't actually need the inode ilock to serialise access to the > > > directory data - we only need the ilock for modifications to the > > > inode and it's blockmaps itself, same as for regular files. > > > > > > Changing the directory code to handle this sort of locking is going > > > to require a bit of surgery. However, I can see advantages to moving > > > directory data to the same locking strategy as regular file data - > > > locking heirarchies are identical, directory ilock hold times are > > > much reduced, we don't get lockdep whining about taking page faults > > > with the ilock held, etc. > > > > > > A quick hack at to demonstrate the high level, initial step of using > > > the IOLOCK for readdir serialisation. I've done a little smoke > > > testing on it, so it won't die immediately. It should get rid of all > > > the nasty lockdep issues, but it doesn't start to address the deeper > > > restructing that is needed. > > > > > > Thoughts, comments? I'm especially interested in what SGI people > > > have to say here because this locking change was needed for CXFS, > > > and changing the directory locking iheirarchy is probably going to > > > upset CXFS a lot... > > > > Protecting directory contents with the iolock and pushing the ilock down > > to protect the directory block mapping functions seems like a good > > strategy to me. > > Ok, it seems like a good strategy, but does it work for CXFS? Yes it should work for CXFS. > What about the other option of a combination of ilock for mapping and > read, and buffer locks for stabilisation? I think this option probably won't work because of those cases where you modify >1 buffer for a single directory modification, e.g. in a node directory you might modify a buffer each for the data, freeindex, node, and leaf blocks. > > In this area I think it's good to treat directories the > > same as regular files. I have not had a very close look yet... and > > couldn't find where we take a fault with irqs disabled. Maybe something > > out of a workqueue? > > Workqueues don't run with interrupts disabled. Yes, we take inode > locks in workqueue context (e.g. AIL tail pushing), but that's not > what is being warned about here. And the page fault only comes from > the filldir callback which runs copy_to_user(), and that most > definitely does not occur in interrupt context. > > FWIW, it's invalid to take the ilock in interrupt context because > it's a sleeping lock without using trylock semantics. If you use > "trylock" semantics, then lockdep doesn't complain about that > because it can't deadlock. I *think* that lockdep is complaining > about reclaim contexts, not interrupts, and it's issuing this > "interrupts disabled" deadlocks because the memory reclaim state > code is piggy-backed on the interrupt state tracking within lockdep. > Either way, though, lockdep is wrong. Yeah, that sounds plausible. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-03-11 16:48 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-09 2:58 Multi-CPU harmless lockdep on x86 while copying data Michael L. Semon 2014-03-10 2:55 ` Dave Chinner 2014-03-10 10:37 ` Christoph Hellwig 2014-03-10 11:12 ` Christoph Hellwig 2014-03-10 20:51 ` Dave Chinner 2014-03-11 16:48 ` Christoph Hellwig 2014-03-10 20:46 ` Dave Chinner 2014-03-10 21:16 ` Ben Myers 2014-03-10 21:24 ` Dave Chinner 2014-03-10 22:10 ` Ben Myers 2014-03-10 20:52 ` Ben Myers 2014-03-10 21:20 ` Dave Chinner 2014-03-10 21:30 ` Ben Myers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).