__sb_start_write() && force

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* __sb_start_write() && force_trylock hack
@ 2015-08-18 14:49 Oleg Nesterov
  2015-08-18 15:18 ` Oleg Nesterov
  2015-08-19  3:11 ` Dave Chinner
  0 siblings, 2 replies; 7+ messages in thread
From: Oleg Nesterov @ 2015-08-18 14:49 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara; +Cc: linux-fsdevel, linux-kernel

Jan, Dave, perhaps you can take a look...

On 08/14, Oleg Nesterov wrote:
>
> Plus another patch which removes the "trylock"
> hack in __sb_start_write().

I meant the patch we already discussed (attached at the end). And yes,
previously I reported it passed the tests. However, I only ran the same
'grep -il freeze tests/*/???' tests. When I tried to run all tests, I
got the new reports from lockdep.

	[ 2098.200818] [ INFO: possible recursive locking detected ]
	[ 2098.206845] 4.2.0-rc6+ #27 Not tainted
	[ 2098.211026] ---------------------------------------------
	[ 2098.217050] fsstress/50971 is trying to acquire lock:
	[ 2098.222679]  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.231886]
		       but task is already holding lock:
	[ 2098.238394]  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.247578]
		       other info that might help us debug this:
	[ 2098.254852]  Possible unsafe locking scenario:

	[ 2098.261458]        CPU0
	[ 2098.264185]        ----
	[ 2098.266913]   lock(sb_internal#2);
	[ 2098.270740]   lock(sb_internal#2);
	[ 2098.274566]
			*** DEADLOCK ***

	[ 2098.281171]  May be due to missing lock nesting notation

	[ 2098.288744] 4 locks held by fsstress/50971:
	[ 2098.293408]  #0:  (sb_writers#13){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.303085]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffff8125685f>] filename_create+0x7f/0x170
	[ 2098.314038]  #2:  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.323711]  #3:  (&type->s_umount_key#54){++++++}, at: [<ffffffffa05a638c>] xfs_flush_inodes+0x1c/0x40 [xfs]
	[ 2098.334898]
		       stack backtrace:
	[ 2098.339762] CPU: 3 PID: 50971 Comm: fsstress Not tainted 4.2.0-rc6+ #27
	[ 2098.347143] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
	[ 2098.358303]  0000000000000000 00000000e70ee864 ffff880c05a2b9c8 ffffffff817ee692
	[ 2098.366603]  0000000000000000 ffffffff826f8030 ffff880c05a2bab8 ffffffff810f45be
	[ 2098.374900]  0000000000000000 ffff880c05a2bb20 0000000000000000 0000000000000000
	[ 2098.383197] Call Trace:
	[ 2098.385930]  [<ffffffff817ee692>] dump_stack+0x45/0x57
	[ 2098.391667]  [<ffffffff810f45be>] __lock_acquire+0x1e9e/0x2040
	[ 2098.398177]  [<ffffffff810f310d>] ? __lock_acquire+0x9ed/0x2040
	[ 2098.404787]  [<ffffffff811d4702>] ? pagevec_lookup_entries+0x22/0x30
	[ 2098.411879]  [<ffffffff811d5142>] ? truncate_inode_pages_range+0x2b2/0x7e0
	[ 2098.419551]  [<ffffffff810f542e>] lock_acquire+0xbe/0x150
	[ 2098.425566]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
	[ 2098.432079]  [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
	[ 2098.438396]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
	[ 2098.444905]  [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.451244]  [<ffffffffa05a7f84>] xfs_trans_alloc+0x24/0x40 [xfs]
	[ 2098.458076]  [<ffffffffa059e872>] xfs_inactive_truncate+0x32/0x110 [xfs]
	[ 2098.465580]  [<ffffffffa059f662>] xfs_inactive+0x102/0x120 [xfs]
	[ 2098.472308]  [<ffffffffa05a475b>] xfs_fs_evict_inode+0x7b/0xc0 [xfs]
	[ 2098.479401]  [<ffffffff812642ab>] evict+0xab/0x170
	[ 2098.484748]  [<ffffffff81264568>] iput+0x1a8/0x230
	[ 2098.490100]  [<ffffffff8127701c>] sync_inodes_sb+0x14c/0x1d0
	[ 2098.496439]  [<ffffffffa05a6398>] xfs_flush_inodes+0x28/0x40 [xfs]
	[ 2098.503361]  [<ffffffffa059e213>] xfs_create+0x5c3/0x6d0 [xfs]
	[ 2098.509871]  [<ffffffff8135e6bd>] ? avc_has_perm+0x2d/0x280
	[ 2098.516105]  [<ffffffffa059a33e>] xfs_generic_create+0x1de/0x2b0 [xfs]
	[ 2098.523412]  [<ffffffffa059a444>] xfs_vn_mknod+0x14/0x20 [xfs]
	[ 2098.529923]  [<ffffffff81251db5>] vfs_mknod+0xe5/0x160
	[ 2098.535657]  [<ffffffff81257619>] SyS_mknod+0x1a9/0x200
	[ 2098.541490]  [<ffffffff817f702e>] entry_SYSCALL_64_fastpath+0x12/0x76

	[ 2151.962337] run fstests generic/084 at 2015-08-17 10:39:08

So it seems that xfs still does recursive read_lock().

Then another one after I rebooted and restarted the test:

	[ 1835.388225] ======================================================
	[ 1835.395118] [ INFO: possible circular locking dependency detected ]
	[ 1835.402110] 4.2.0-rc6+ #27 Not tainted
	[ 1835.406288] -------------------------------------------------------
	[ 1835.413278] renameat2/42316 is trying to acquire lock:
	[ 1835.419008]  (sb_internal){++++..}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 1835.427982]
		       but task is already holding lock:
	[ 1835.434490]  (jbd2_handle){+.+...}, at: [<ffffffff81328ac0>] start_this_handle+0x1a0/0x600
	[ 1835.443753]
		       which lock already depends on the new lock.

	[ 1835.452874]
		       the existing dependency chain (in reverse order) is:
	[ 1835.461223]
		       -> #1 (jbd2_handle){+.+...}:
	[ 1835.465832]        [<ffffffff810f542e>] lock_acquire+0xbe/0x150
	[ 1835.472449]        [<ffffffff81328b1b>] start_this_handle+0x1fb/0x600
	[ 1835.479644]        [<ffffffff813291ce>] jbd2__journal_start+0xae/0x1b0
	[ 1835.486936]        [<ffffffff8130fa35>] __ext4_journal_start_sb+0x75/0x110
	[ 1835.494617]        [<ffffffff812e1c7a>] ext4_evict_inode+0x1aa/0x4c0
	[ 1835.501716]        [<ffffffff812642ab>] evict+0xab/0x170
	[ 1835.507651]        [<ffffffff81264568>] iput+0x1a8/0x230
	[ 1835.513585]        [<ffffffff81256d91>] do_unlinkat+0x1d1/0x2a0
	[ 1835.520200]        [<ffffffff81257896>] SyS_unlink+0x16/0x20
	[ 1835.526520]        [<ffffffff817f702e>] entry_SYSCALL_64_fastpath+0x12/0x76
	[ 1835.534301]
		       -> #0 (sb_internal){++++..}:
	[ 1835.538907]        [<ffffffff810f441f>] __lock_acquire+0x1cff/0x2040
	[ 1835.546004]        [<ffffffff810f542e>] lock_acquire+0xbe/0x150
	[ 1835.552617]        [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
	[ 1835.559521]        [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 1835.566424]        [<ffffffff812e1c03>] ext4_evict_inode+0x133/0x4c0
	[ 1835.573521]        [<ffffffff812642ab>] evict+0xab/0x170
	[ 1835.579454]        [<ffffffff81264568>] iput+0x1a8/0x230
	[ 1835.585386]        [<ffffffff812eb61e>] ext4_rename+0x14e/0x8a0
	[ 1835.591999]        [<ffffffff812ebd8d>] ext4_rename2+0x1d/0x30
	[ 1835.598513]        [<ffffffff812524c2>] vfs_rename+0x552/0x880
	[ 1835.605029]        [<ffffffff81258417>] SyS_renameat2+0x537/0x590
	[ 1835.611836]        [<ffffffff817f702e>] entry_SYSCALL_64_fastpath+0x12/0x76
	[ 1835.619612]
		       other info that might help us debug this:
	[ 1835.628540]  Possible unsafe locking scenario:

	[ 1835.635143]        CPU0                    CPU1
	[ 1835.640193]        ----                    ----
	[ 1835.645245]   lock(jbd2_handle);
	[ 1835.648861]                                lock(sb_internal);
	[ 1835.655287]                                lock(jbd2_handle);
	[ 1835.661714]   lock(sb_internal);
	[ 1835.665330]
			*** DEADLOCK ***

	[ 1835.671932] 4 locks held by renameat2/42316:
	[ 1835.676691]  #0:  (sb_writers#8){++++.+}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 1835.686263]  #1:  (&type->i_mutex_dir_key#3/1){+.+.+.}, at: [<ffffffff81250e04>] lock_rename+0xe4/0x110
	[ 1835.696815]  #2:  (&type->i_mutex_dir_key#3){+.+.+.}, at: [<ffffffff81252306>] vfs_rename+0x396/0x880
	[ 1835.707165]  #3:  (jbd2_handle){+.+...}, at: [<ffffffff81328ac0>] start_this_handle+0x1a0/0x600
	...
	[ 1836.741417] run fstests generic/079 at 2015-08-17 09:48:19

This is the same thing. This lock inversion is safe but only because we
hold the lock on the higher layer.

Looks like, we can't remove this force_trylock hack. Probably the comment
should be updated...

This is 4.2.0-rc6 + Dave's xfs ILOCK fixes + "[PATCH v3 0/8] change sb_writers
to use percpu_rw_semaphore" I sent to Al.

Oleg.

-------------------------------------------------------------------------------
Subject: [PATCH] kill the "force_trylock" hack in __sb_start_write()

As Dave explains, the comment in __sb_start_write() is not right, xfs
no longer does read_lock() twice at the same level:

	when we duplicate a transaction for a rolling commit, we do it
	before committing the current transaction is committed. I think
	that used to take a second freeze reference, which only existed
	until the first transaction was committed. We do things a bit
	differently now - we hold a state flag on the transaction to
	indicate it needs to release the freeze reference when it is
	freed and we pass it to the new transaction so that the first
	transaction commit doesn't release it.

So we can always use percpu_down_read() if wait == T.

TODO: export percpu_down_read/etc and turn __sb_start_write() and
__sb_end_write() into the trivial inlines in include/linux/fs.h.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   30 +++---------------------------
 1 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 1170dec..0b06e36 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1169,36 +1169,12 @@ EXPORT_SYMBOL(__sb_end_write);
  */
 int __sb_start_write(struct super_block *sb, int level, bool wait)
 {
-	bool force_trylock = false;
-	int ret = 1;
-
-#ifdef CONFIG_LOCKDEP
-	/*
-	 * We want lockdep to tell us about possible deadlocks with freezing
-	 * but it's it bit tricky to properly instrument it. Getting a freeze
-	 * protection works as getting a read lock but there are subtle
-	 * problems. XFS for example gets freeze protection on internal level
-	 * twice in some cases, which is OK only because we already hold a
-	 * freeze protection also on higher level. Due to these cases we have
-	 * to use wait == F (trylock mode) which must not fail.
-	 */
 	if (wait) {
-		int i;
-
-		for (i = 0; i < level - 1; i++)
-			if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
-				force_trylock = true;
-				break;
-			}
-	}
-#endif
-	if (wait && !force_trylock)
 		percpu_down_read(sb->s_writers.rw_sem + level-1);
-	else
-		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
+		return 1;
+	}

-	WARN_ON(force_trylock & !ret);
-	return ret;
+	return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
 }
 EXPORT_SYMBOL(__sb_start_write);

--
1.5.5.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: __sb_start_write() && force_trylock hack
  2015-08-18 14:49 __sb_start_write() && force_trylock hack Oleg Nesterov
@ 2015-08-18 15:18 ` Oleg Nesterov
  2015-08-19  3:25   ` Dave Chinner
  2015-08-19  3:11 ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2015-08-18 15:18 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara; +Cc: linux-fsdevel, linux-kernel

On 08/18, Oleg Nesterov wrote:
>
> When I tried to run all tests, I
> got the new reports from lockdep.

Just in case... when I run all tests I see misc failures (with or without
the changes above) which I didn't try to interpret. In particular xfs/073
just hangs, "shutdown -r" doesn't work, the serial console continues to
print

	[10784.605139] XFS (loop2): metadata I/O error: block 0x7d4948 ("xfs_buf_iodone_callbacks") error 5 numblks 8
	[10784.605207] loop: Write error at byte offset 26843578368, length 4096.
	[10784.605222] loop: Write error at byte offset 26843578368, length 4096.
	[10784.605235] loop: Write error at byte offset 26843578368, length 4096.
	[10784.605248] loop: Write error at byte offset 26843578368, length 4096.
	[10784.605261] loop: Write error at byte offset 26843578368, length 4096.
	[10784.605437] blk_update_request: I/O error, dev loop2, sector 52428864
	[10784.605642] blk_update_request: I/O error, dev loop2, sector 52429104
	[10784.605897] blk_update_request: I/O error, dev loop2, sector 104960064
	[10784.606028] blk_update_request: I/O error, dev loop2, sector 104960304
	[10784.606069] blk_update_request: I/O error, dev loop2, sector 104960416
	[10784.606307] blk_update_request: I/O error, dev loop2, sector 157286464
	[10784.606414] blk_update_request: I/O error, dev loop2, sector 157286704
	[10784.635455] XFS: Failing async write: 55598 callbacks suppressed
	[10784.635458] XFS (loop2): Failing async write on buffer block 0x3200070. Retrying async write.
	[10784.635460] XFS (loop2): Failing async write on buffer block 0x3200068. Retrying async write.
	[10784.635463] XFS (loop2): Failing async write on buffer block 0x9600060. Retrying async write.
	[10784.635465] XFS (loop2): Failing async write on buffer block 0x3200060. Retrying async write.
	[10784.635468] XFS (loop2): Failing async write on buffer block 0x9600050. Retrying async write.
	[10784.635469] XFS (loop2): Failing async write on buffer block 0x9600040. Retrying async write.
	[10784.635470] XFS (loop2): Failing async write on buffer block 0x6419050. Retrying async write.
	[10784.635472] XFS (loop2): Failing async write on buffer block 0x6419040. Retrying async write.
	[10784.635473] XFS (loop2): Failing async write on buffer block 0x3200050. Retrying async write.
	[10784.635474] XFS (loop2): Failing async write on buffer block 0x3200040. Retrying async write.
	[10789.577360] lo_write_bvec: 18978 callbacks suppressed
	[10789.583033] loop: Write error at byte offset 4203909120, length 4096.
	[10789.590232] blk_update_request: 10746 callbacks suppressed
	[10789.596370] blk_update_request: I/O error, dev loop2, sector 8210760
	[10789.603498] loop: Write error at byte offset 4208594944, length 4096.
	[10789.610704] blk_update_request: I/O error, dev loop2, sector 8219912
	[10789.617830] XFS (loop2): metadata I/O error: block 0x7d4948 ("xfs_buf_iodone_callbacks") error 5 numblks 8

again and again.

Plus some tests need a lot of time (for example generic/127 more than 70 minutes!),
so I never managed to run this test-suite completely.



Also, tests/generic/078 on 4.2.0-rc6 (without other changes) triggered

	[ 2100.404545] BUG: looking up invalid subclass: 8
	[ 2100.409600] turning off the locking correctness validator.
	[ 2100.415722] CPU: 11 PID: 45819 Comm: renameat2 Not tainted 4.2.0-rc6 #29
	[ 2100.423190] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
	[ 2100.434349]  0000000000000000 000000002d12fba7 ffff880befce7a78 ffffffff817ee6f2
	[ 2100.442641]  0000000000000000 0000000000000296 ffff880befce7b68 ffffffff810f2e94
	[ 2100.450946]  ffffffff826f95e0 ffff880c08834380 ffff880befce7ab8 ffffffff8102611f
	[ 2100.459241] Call Trace:
	[ 2100.461971]  [<ffffffff817ee6f2>] dump_stack+0x45/0x57
	[ 2100.467705]  [<ffffffff810f2e94>] __lock_acquire+0xb14/0x2040
	[ 2100.474107]  [<ffffffff8102611f>] ? save_stack_trace+0x2f/0x50
	[ 2100.480614]  [<ffffffff810f2d6d>] ? __lock_acquire+0x9ed/0x2040
	[ 2100.487221]  [<ffffffff8110fa5d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
	[ 2100.494600]  [<ffffffff810f508e>] lock_acquire+0xbe/0x150
	[ 2100.500648]  [<ffffffffa04edeee>] ? xfs_ilock+0x10e/0x160 [xfs]
	[ 2100.507266]  [<ffffffffa04ee941>] ? xfs_lock_inodes+0x131/0x1e0 [xfs]
	[ 2100.514454]  [<ffffffff810edbcc>] down_write_nested+0x3c/0x70
	[ 2100.520875]  [<ffffffffa04edeee>] ? xfs_ilock+0x10e/0x160 [xfs]
	[ 2100.527496]  [<ffffffffa04ff803>] ? xfs_log_reserve+0x173/0x1c0 [xfs]
	[ 2100.534692]  [<ffffffffa04edeee>] xfs_ilock+0x10e/0x160 [xfs]
	[ 2100.541104]  [<ffffffffa04ee941>] xfs_lock_inodes+0x131/0x1e0 [xfs]
	[ 2100.548108]  [<ffffffffa04f1c37>] xfs_rename+0x2d7/0x970 [xfs]
	[ 2100.554615]  [<ffffffff810f1de3>] ? mark_held_locks+0x73/0xb0
	[ 2100.561026]  [<ffffffff810f202d>] ? trace_hardirqs_on+0xd/0x10
	[ 2100.567546]  [<ffffffffa04ebc33>] xfs_vn_rename+0xb3/0xf0 [xfs]
	[ 2100.574150]  [<ffffffff81252512>] vfs_rename+0x552/0x880
	[ 2100.580077]  [<ffffffff81258467>] SyS_renameat2+0x537/0x590
	[ 2100.586293]  [<ffffffff817f70ae>] entry_SYSCALL_64_fastpath+0x12/0x76
	[ 2101.268858] XFS (loop0): Unmounting Filesystem
	[ 2101.615330] XFS (loop0): Mounting V4 Filesystem
	[ 2101.660285] XFS (loop0): Ending clean mount
	[ 2101.694933] run fstests generic/079 at 2015-08-18 08:31:55

perhaps this was fixed by Dave's ILOCK patches.

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: __sb_start_write() && force_trylock hack
  2015-08-18 14:49 __sb_start_write() && force_trylock hack Oleg Nesterov
  2015-08-18 15:18 ` Oleg Nesterov
@ 2015-08-19  3:11 ` Dave Chinner
  2015-08-19 15:00   ` Oleg Nesterov
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-08-19  3:11 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Jan Kara, linux-fsdevel, linux-kernel

On Tue, Aug 18, 2015 at 04:49:00PM +0200, Oleg Nesterov wrote:
> Jan, Dave, perhaps you can take a look...
> 
> On 08/14, Oleg Nesterov wrote:
> >
> > Plus another patch which removes the "trylock"
> > hack in __sb_start_write().
> 
> I meant the patch we already discussed (attached at the end). And yes,
> previously I reported it passed the tests. However, I only ran the same
> 'grep -il freeze tests/*/???' tests. When I tried to run all tests, I
> got the new reports from lockdep.
> 
> 	[ 2098.281171]  May be due to missing lock nesting notation

<groan>

> 	[ 2098.288744] 4 locks held by fsstress/50971:
> 	[ 2098.293408]  #0:  (sb_writers#13){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> 	[ 2098.303085]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffff8125685f>] filename_create+0x7f/0x170
> 	[ 2098.314038]  #2:  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> 	[ 2098.323711]  #3:  (&type->s_umount_key#54){++++++}, at: [<ffffffffa05a638c>] xfs_flush_inodes+0x1c/0x40 [xfs]
> 	[ 2098.334898]
> 		       stack backtrace:
> 	[ 2098.339762] CPU: 3 PID: 50971 Comm: fsstress Not tainted 4.2.0-rc6+ #27
> 	[ 2098.347143] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
> 	[ 2098.358303]  0000000000000000 00000000e70ee864 ffff880c05a2b9c8 ffffffff817ee692
> 	[ 2098.366603]  0000000000000000 ffffffff826f8030 ffff880c05a2bab8 ffffffff810f45be
> 	[ 2098.374900]  0000000000000000 ffff880c05a2bb20 0000000000000000 0000000000000000
> 	[ 2098.383197] Call Trace:
> 	[ 2098.385930]  [<ffffffff817ee692>] dump_stack+0x45/0x57
> 	[ 2098.391667]  [<ffffffff810f45be>] __lock_acquire+0x1e9e/0x2040
> 	[ 2098.398177]  [<ffffffff810f310d>] ? __lock_acquire+0x9ed/0x2040
> 	[ 2098.404787]  [<ffffffff811d4702>] ? pagevec_lookup_entries+0x22/0x30
> 	[ 2098.411879]  [<ffffffff811d5142>] ? truncate_inode_pages_range+0x2b2/0x7e0
> 	[ 2098.419551]  [<ffffffff810f542e>] lock_acquire+0xbe/0x150
> 	[ 2098.425566]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
> 	[ 2098.432079]  [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
> 	[ 2098.438396]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
> 	[ 2098.444905]  [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> 	[ 2098.451244]  [<ffffffffa05a7f84>] xfs_trans_alloc+0x24/0x40 [xfs]
> 	[ 2098.458076]  [<ffffffffa059e872>] xfs_inactive_truncate+0x32/0x110 [xfs]
> 	[ 2098.465580]  [<ffffffffa059f662>] xfs_inactive+0x102/0x120 [xfs]
> 	[ 2098.472308]  [<ffffffffa05a475b>] xfs_fs_evict_inode+0x7b/0xc0 [xfs]
> 	[ 2098.479401]  [<ffffffff812642ab>] evict+0xab/0x170
> 	[ 2098.484748]  [<ffffffff81264568>] iput+0x1a8/0x230
> 	[ 2098.490100]  [<ffffffff8127701c>] sync_inodes_sb+0x14c/0x1d0
> 	[ 2098.496439]  [<ffffffffa05a6398>] xfs_flush_inodes+0x28/0x40 [xfs]
> 	[ 2098.503361]  [<ffffffffa059e213>] xfs_create+0x5c3/0x6d0 [xfs]

Another false positive.  We have a transaction handle here, but
we've failed to reserve space for it so it's not an active
transaction yet. i.e. the filesystem is operating at ENOSPC.

On an ENOSPC error, we flush out dirty data to free up reserved
delalloc space, which obviously is obviously then racing with an
unlink of an inode in another thread and so drops the final
reference to the inode, and it goes down the eviction path that runs
transactions.

But there is no deadlock here - while we have a transaction handle
allocated at xfs_create() context, it is not yet active and holds no
reservation, so it is safe for eviction here to run new transactions
as there is no active transaction in progress.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: __sb_start_write() && force_trylock hack
  2015-08-18 15:18 ` Oleg Nesterov
@ 2015-08-19  3:25   ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2015-08-19  3:25 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Jan Kara, linux-fsdevel, linux-kernel

On Tue, Aug 18, 2015 at 05:18:16PM +0200, Oleg Nesterov wrote:
> On 08/18, Oleg Nesterov wrote:
> >
> > When I tried to run all tests, I
> > got the new reports from lockdep.
> 
> Just in case... when I run all tests I see misc failures (with or without
> the changes above) which I didn't try to interpret. In particular xfs/073
> just hangs, "shutdown -r" doesn't work, the serial console continues to
> print
> 
> 	[10784.605139] XFS (loop2): metadata I/O error: block 0x7d4948 ("xfs_buf_iodone_callbacks") error 5 numblks 8

EIO from the loop device.

> 	[10784.605207] loop: Write error at byte offset 26843578368, length 4096.
> 	[10784.605222] loop: Write error at byte offset 26843578368, length 4096.
> 	[10784.605235] loop: Write error at byte offset 26843578368, length 4096.
> 	[10784.605248] loop: Write error at byte offset 26843578368, length 4096.
> 	[10784.605261] loop: Write error at byte offset 26843578368, length 4096.

That makes me think the underlying filesystem (i.e. the fs the loop
image file is hosted on) ran out of space. Everything will go to
shit if that happens.

> again and again.
> 
> Plus some tests need a lot of time (for example generic/127 more than 70 minutes!),

Yup, that one only does about 50,000 synchronous writes via fsx.

If you want to run fast, use a virtual machine with 10GB RAM and use
a pair of 4GB ramdisks as the storage.

Or expunge the long running tests (look up the -X option) so they
are skipped.

> Also, tests/generic/078 on 4.2.0-rc6 (without other changes) triggered
> 
> 	[ 2100.404545] BUG: looking up invalid subclass: 8
> 	[ 2100.409600] turning off the locking correctness validator.

That's another lockdep annotation limitation I've had to work
around. The RENAME_WHITEOUT changes caused that by introducing the
need to lock 5 inodes at the same time (src/dst dir, src/dst file,
whiteout inode) instead of only 4.

> perhaps this was fixed by Dave's ILOCK patches.

It greatly complicated those patches, you mean?

But, yes, those patches should have fixed it - I had to shoehorn
about 15 different XFS inode locking subclasses into the 8
subclasses that lockdep supports. And I used all 8 subclasses for
the ILOCK annotations, so if theres some other annotation we need to
add (e.g more than one level of ILOCK_PARENT lock nesting) then,
well, we will need at least 16 subclasses to annotate everything.

I hate lockdep.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: __sb_start_write() && force_trylock hack
  2015-08-19  3:11 ` Dave Chinner
@ 2015-08-19 15:00   ` Oleg Nesterov
  2015-08-19 23:26     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2015-08-19 15:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-kernel

On 08/19, Dave Chinner wrote:
>
> On Tue, Aug 18, 2015 at 04:49:00PM +0200, Oleg Nesterov wrote:
> > Jan, Dave, perhaps you can take a look...
> >
> > On 08/14, Oleg Nesterov wrote:
> > >
> > > Plus another patch which removes the "trylock"
> > > hack in __sb_start_write().
> >
> > I meant the patch we already discussed (attached at the end). And yes,
> > previously I reported it passed the tests. However, I only ran the same
> > 'grep -il freeze tests/*/???' tests. When I tried to run all tests, I
> > got the new reports from lockdep.
> >
> > 	[ 2098.281171]  May be due to missing lock nesting notation
>
> <groan>
>
> > 	[ 2098.288744] 4 locks held by fsstress/50971:
> > 	[ 2098.293408]  #0:  (sb_writers#13){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> > 	[ 2098.303085]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffff8125685f>] filename_create+0x7f/0x170
> > 	[ 2098.314038]  #2:  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> > 	[ 2098.323711]  #3:  (&type->s_umount_key#54){++++++}, at: [<ffffffffa05a638c>] xfs_flush_inodes+0x1c/0x40 [xfs]
> > 	[ 2098.334898]
> > 		       stack backtrace:
> > 	[ 2098.339762] CPU: 3 PID: 50971 Comm: fsstress Not tainted 4.2.0-rc6+ #27
> > 	[ 2098.347143] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
> > 	[ 2098.358303]  0000000000000000 00000000e70ee864 ffff880c05a2b9c8 ffffffff817ee692
> > 	[ 2098.366603]  0000000000000000 ffffffff826f8030 ffff880c05a2bab8 ffffffff810f45be
> > 	[ 2098.374900]  0000000000000000 ffff880c05a2bb20 0000000000000000 0000000000000000
> > 	[ 2098.383197] Call Trace:
> > 	[ 2098.385930]  [<ffffffff817ee692>] dump_stack+0x45/0x57
> > 	[ 2098.391667]  [<ffffffff810f45be>] __lock_acquire+0x1e9e/0x2040
> > 	[ 2098.398177]  [<ffffffff810f310d>] ? __lock_acquire+0x9ed/0x2040
> > 	[ 2098.404787]  [<ffffffff811d4702>] ? pagevec_lookup_entries+0x22/0x30
> > 	[ 2098.411879]  [<ffffffff811d5142>] ? truncate_inode_pages_range+0x2b2/0x7e0
> > 	[ 2098.419551]  [<ffffffff810f542e>] lock_acquire+0xbe/0x150
> > 	[ 2098.425566]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
> > 	[ 2098.432079]  [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
> > 	[ 2098.438396]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
> > 	[ 2098.444905]  [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> > 	[ 2098.451244]  [<ffffffffa05a7f84>] xfs_trans_alloc+0x24/0x40 [xfs]
> > 	[ 2098.458076]  [<ffffffffa059e872>] xfs_inactive_truncate+0x32/0x110 [xfs]
> > 	[ 2098.465580]  [<ffffffffa059f662>] xfs_inactive+0x102/0x120 [xfs]
> > 	[ 2098.472308]  [<ffffffffa05a475b>] xfs_fs_evict_inode+0x7b/0xc0 [xfs]
> > 	[ 2098.479401]  [<ffffffff812642ab>] evict+0xab/0x170
> > 	[ 2098.484748]  [<ffffffff81264568>] iput+0x1a8/0x230
> > 	[ 2098.490100]  [<ffffffff8127701c>] sync_inodes_sb+0x14c/0x1d0
> > 	[ 2098.496439]  [<ffffffffa05a6398>] xfs_flush_inodes+0x28/0x40 [xfs]
> > 	[ 2098.503361]  [<ffffffffa059e213>] xfs_create+0x5c3/0x6d0 [xfs]
>
> Another false positive.

Confused...

Dave, I removed your explanation which I can't understand anyway, let
me remind that I know absolutely nothing about filesystems.

> But there is no deadlock here

Yes, we hold SB_FREEZE_WRITE lock, so recursive SB_FREEZE_FS is safe.

But, this means that the comment in __sb_start_write() is still correct,
"XFS for example gets freeze protection on internal level twice" is true,
and we can not remove this force_trylock hack.

(Plus we have the jbd2_handle/sb_internal lock inversion but this is
 the same thing, safe because of the held lock on higher level).

Or I misunderstood?

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: __sb_start_write() && force_trylock hack
  2015-08-19 15:00   ` Oleg Nesterov
@ 2015-08-19 23:26     ` Dave Chinner
  2015-08-21 18:30       ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-08-19 23:26 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Jan Kara, linux-fsdevel, linux-kernel

On Wed, Aug 19, 2015 at 05:00:26PM +0200, Oleg Nesterov wrote:
> On 08/19, Dave Chinner wrote:
> >
> > On Tue, Aug 18, 2015 at 04:49:00PM +0200, Oleg Nesterov wrote:
> > > Jan, Dave, perhaps you can take a look...
> > >
> > > On 08/14, Oleg Nesterov wrote:
> > > >
> > > > Plus another patch which removes the "trylock"
> > > > hack in __sb_start_write().
> > >
> > > I meant the patch we already discussed (attached at the end). And yes,
> > > previously I reported it passed the tests. However, I only ran the same
> > > 'grep -il freeze tests/*/???' tests. When I tried to run all tests, I
> > > got the new reports from lockdep.
> > >
> > > 	[ 2098.281171]  May be due to missing lock nesting notation
> >
> > <groan>
> >
> > > 	[ 2098.288744] 4 locks held by fsstress/50971:
> > > 	[ 2098.293408]  #0:  (sb_writers#13){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> > > 	[ 2098.303085]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffff8125685f>] filename_create+0x7f/0x170
> > > 	[ 2098.314038]  #2:  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> > > 	[ 2098.323711]  #3:  (&type->s_umount_key#54){++++++}, at: [<ffffffffa05a638c>] xfs_flush_inodes+0x1c/0x40 [xfs]
> > > 	[ 2098.334898]
> > > 		       stack backtrace:
> > > 	[ 2098.339762] CPU: 3 PID: 50971 Comm: fsstress Not tainted 4.2.0-rc6+ #27
> > > 	[ 2098.347143] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
> > > 	[ 2098.358303]  0000000000000000 00000000e70ee864 ffff880c05a2b9c8 ffffffff817ee692
> > > 	[ 2098.366603]  0000000000000000 ffffffff826f8030 ffff880c05a2bab8 ffffffff810f45be
> > > 	[ 2098.374900]  0000000000000000 ffff880c05a2bb20 0000000000000000 0000000000000000
> > > 	[ 2098.383197] Call Trace:
> > > 	[ 2098.385930]  [<ffffffff817ee692>] dump_stack+0x45/0x57
> > > 	[ 2098.391667]  [<ffffffff810f45be>] __lock_acquire+0x1e9e/0x2040
> > > 	[ 2098.398177]  [<ffffffff810f310d>] ? __lock_acquire+0x9ed/0x2040
> > > 	[ 2098.404787]  [<ffffffff811d4702>] ? pagevec_lookup_entries+0x22/0x30
> > > 	[ 2098.411879]  [<ffffffff811d5142>] ? truncate_inode_pages_range+0x2b2/0x7e0
> > > 	[ 2098.419551]  [<ffffffff810f542e>] lock_acquire+0xbe/0x150
> > > 	[ 2098.425566]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
> > > 	[ 2098.432079]  [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
> > > 	[ 2098.438396]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
> > > 	[ 2098.444905]  [<ffffffff81248d32>] __sb_start_write+0x32/0x40
> > > 	[ 2098.451244]  [<ffffffffa05a7f84>] xfs_trans_alloc+0x24/0x40 [xfs]
> > > 	[ 2098.458076]  [<ffffffffa059e872>] xfs_inactive_truncate+0x32/0x110 [xfs]
> > > 	[ 2098.465580]  [<ffffffffa059f662>] xfs_inactive+0x102/0x120 [xfs]
> > > 	[ 2098.472308]  [<ffffffffa05a475b>] xfs_fs_evict_inode+0x7b/0xc0 [xfs]
> > > 	[ 2098.479401]  [<ffffffff812642ab>] evict+0xab/0x170
> > > 	[ 2098.484748]  [<ffffffff81264568>] iput+0x1a8/0x230
> > > 	[ 2098.490100]  [<ffffffff8127701c>] sync_inodes_sb+0x14c/0x1d0
> > > 	[ 2098.496439]  [<ffffffffa05a6398>] xfs_flush_inodes+0x28/0x40 [xfs]
> > > 	[ 2098.503361]  [<ffffffffa059e213>] xfs_create+0x5c3/0x6d0 [xfs]
> >
> > Another false positive.
> 
> Confused...
> 
> Dave, I removed your explanation which I can't understand anyway, let
> me remind that I know absolutely nothing about filesystems.
> 
> > But there is no deadlock here
> 
> Yes, we hold SB_FREEZE_WRITE lock, so recursive SB_FREEZE_FS is safe.
> 
> But, this means that the comment in __sb_start_write() is still correct,
> "XFS for example gets freeze protection on internal level twice" is true,
> and we can not remove this force_trylock hack.

You've hit a very rare corner case of a rare corner case. Please
forgive me if I forget such paths - stuff that just works because it
was required to be deadlock free is easy to forget because, well, it
doesn't deadlock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: __sb_start_write() && force_trylock hack
  2015-08-19 23:26     ` Dave Chinner
@ 2015-08-21 18:30       ` Oleg Nesterov
  0 siblings, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2015-08-21 18:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-kernel

On 08/20, Dave Chinner wrote:
>
> On Wed, Aug 19, 2015 at 05:00:26PM +0200, Oleg Nesterov wrote:
> >
> > Yes, we hold SB_FREEZE_WRITE lock, so recursive SB_FREEZE_FS is safe.
> >
> > But, this means that the comment in __sb_start_write() is still correct,
> > "XFS for example gets freeze protection on internal level twice" is true,
> > and we can not remove this force_trylock hack.
>
> You've hit a very rare corner case of a rare corner case.

Yes, I see, thanks.

Just fyi, I ran the tests again with the stupid debugging patch below
and I do not see anything new in dmesg.

So perhaps xfs_create() which does the "recursive" xfs_trans_alloc()
is the only source of double-lock in fs/xfs/.

Oleg.
---

diff --git a/fs/super.c b/fs/super.c
index a38ad91..32b1846 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1153,13 +1153,57 @@ out:
 	return ERR_PTR(error);
 }
 
+void __sb_writers_acquired(struct super_block *sb, int level)
+{
+	struct percpu_rw_semaphore *sem = sb->s_writers.rw_sem + level-1;
+	struct ltrace *lt = current->ltrace + level-1;
+
+	WARN_ON(percpu_rwsem_is_held(sem));
+
+	percpu_rwsem_acquire(sem, 1, _RET_IP_);
+
+	if (!lt->lock) {
+		lt->lock = sem;
+		save_stack_trace(&lt->trace);
+	}
+}
+EXPORT_SYMBOL(__sb_writers_acquired);
+
+void __sb_writers_release(struct super_block *sb, int level)
+{
+	struct percpu_rw_semaphore *sem = sb->s_writers.rw_sem + level-1;
+	struct ltrace *lt = current->ltrace + level-1;
+
+	WARN_ON(!lt->lock);
+
+	percpu_rwsem_release(sem, 1, _RET_IP_);
+
+	WARN_ON(percpu_rwsem_is_held(sem));
+	if (lt->lock == sem) {
+		lt->lock = NULL;
+		lt->trace.nr_entries = 0;
+	}
+}
+EXPORT_SYMBOL(__sb_writers_release);
+
+
 /*
  * This is an internal function, please use sb_end_{write,pagefault,intwrite}
  * instead.
  */
 void __sb_end_write(struct super_block *sb, int level)
 {
+	struct percpu_rw_semaphore *sem = sb->s_writers.rw_sem + level-1;
+	struct ltrace *lt = current->ltrace + level-1;
+
+	WARN_ON(!lt->lock);
+
 	percpu_up_read(sb->s_writers.rw_sem + level-1);
+
+	if (lt->lock == sem && !percpu_rwsem_is_held(sem)) {
+		lt->lock = NULL;
+		lt->trace.nr_entries = 0;
+	}
 }
 EXPORT_SYMBOL(__sb_end_write);
 
@@ -1169,10 +1213,22 @@ EXPORT_SYMBOL(__sb_end_write);
  */
 int __sb_start_write(struct super_block *sb, int level, bool wait)
 {
+	struct percpu_rw_semaphore *sem = sb->s_writers.rw_sem + level-1;
+	struct ltrace *lt = current->ltrace + level-1;
 	bool force_trylock = false;
 	int ret = 1;
 
+	WARN_ON(lt->lock == sem && !percpu_rwsem_is_held(sem));
+
 #ifdef CONFIG_LOCKDEP
+	if (wait && lt->lock == sem) {
+		pr_crit("XXXXX %s:%d lev=%d\n", current->comm, current->pid, level);
+		dump_stack();
+		debug_show_held_locks(current);
+		pr_crit("Prev Trace:\n");
+		print_stack_trace(&lt->trace, 0);
+	}
+
 	/*
 	 * We want lockdep to tell us about possible deadlocks with freezing
 	 * but it's it bit tricky to properly instrument it. Getting a freeze
@@ -1198,6 +1254,10 @@ int __sb_start_write(struct super_block *sb, int level, bool wait)
 		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
 
 	WARN_ON(force_trylock & !ret);
+	if (ret && !lt->lock) {
+		lt->lock = sem;
+		save_stack_trace(&lt->trace);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(__sb_start_write);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 08d4fe4..33bf46a 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -614,7 +614,7 @@ xfs_log_mount(
 	int		min_logfsbs;
 
 	if (!(mp->m_flags & XFS_MOUNT_NORECOVERY)) {
-		xfs_notice(mp, "Mounting V%d Filesystem",
+		if (0) xfs_notice(mp, "Mounting V%d Filesystem",
 			   XFS_SB_VERSION_NUM(&mp->m_sb));
 	} else {
 		xfs_notice(mp,
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 480ebba..ed241dd 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -4607,7 +4607,7 @@ xlog_recover_finish(
 						     : "internal");
 		log->l_flags &= ~XLOG_RECOVERY_NEEDED;
 	} else {
-		xfs_info(log->l_mp, "Ending clean mount");
+		if (0) xfs_info(log->l_mp, "Ending clean mount");
 	}
 	return 0;
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1fb16562..f680f3c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1573,7 +1573,7 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
-	xfs_notice(mp, "Unmounting Filesystem");
+	if (0) xfs_notice(mp, "Unmounting Filesystem");
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ce356f6..8514e65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1385,10 +1385,8 @@ extern struct timespec current_fs_time(struct super_block *sb);
 void


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-08-21 18:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-18 14:49 __sb_start_write() && force_trylock hack Oleg Nesterov
2015-08-18 15:18 ` Oleg Nesterov
2015-08-19  3:25   ` Dave Chinner
2015-08-19  3:11 ` Dave Chinner
2015-08-19 15:00   ` Oleg Nesterov
2015-08-19 23:26     ` Dave Chinner
2015-08-21 18:30       ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).