linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Avoiding the dentry d_lock on final dput(), part deux: transactional memory
@ 2013-09-30 19:29 Linus Torvalds
  2013-09-30 20:01 ` Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-09-30 19:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Waiman Long, Benjamin Herrenschmidt
  Cc: Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev

[-- Attachment #1: Type: text/plain, Size: 3956 bytes --]

So with all the lockref work, we now avoid the dentry d_lock for
almost all normal cases.

There is one single remaining common case, though: the final dput()
when the dentry count goes down to zero, and we need to check if we
are supposed to get rid of the dentry (or at least put it on the LRU
lists etc).

And that's something lockref itself cannot really help us with unless
we start adding status bits to it (eg some kind of "enable slow-case"
bit in the lock part of the lockref). Which sounds like a clever but
very fragile approach.

However, I did get myself a i7-4770S exactly because I was
forward-thinking, and wanted to try using transactional memory for
these kinds of things.

Quite frankly, from all I've seen so far, the kernel is not going to
have very good luck with things like lock elision, because we're
really fine-grained already, and at least the Intel lock-elision
(don't know about POWER8) basically requires software to do prediction
on whether the transaction will succeed or not, dynamically based on
aborts etc. And quite frankly, by the time you have to do things like
that, you've already lost. We're better off just using our normal
locks.

So as far as I'm concerned, transactional memory is going to be useful
- *if* it is useful - only for specialized code. Some of that might be
architecture-internal lock implementations, other things might be
exactly the dput() kind of situation.

And the thing is, *normally* dput() doesn't need to do anything at
all, except decrement the dentry reference count. However, for that
normal case to be true, we need to atomically check:

 - that the dentry lock isn't held (same as lockref)
 - that we are already on the LRU list and don't need to add ourselves to it
 - that we already have the DCACHE_REFERENCED bit set and don't need to set it
 - that the dentry isn't unhashed and needs to be killed.

Additionally, we need to check that it's not a dentry that has a
"d_delete()" operation, but that's a static attribute of a dentry, so
that's not something that we need to check atomically wrt the other
things.

ANYWAY. With all that out of the way, the basic point is that this is
really simple, and fits very well with even very limited transactional
memory. We literally need to do just a single write, and something
like three reads from memory. And we already have a trivial fallback,
namely the old code using the lockrefs.  IOW, it's literally ten
straight-line instructions between the xbegin and the xend for me.

So here's a patch that works for me. It requires gcc to know "asm
goto", and it requires binutils that know about xbegin/xabort. And it
requires a CPU that supports the intel RTM extensions.

But I'm cc'ing the POWER people, because I don't know the POWER8
interfaces, and I don't want to necessarily call this "xbegin"/"xend"
when I actually wrap it in some helper functions.

Anyway, profiles with this look beautiful (I'm using "make -j" on a
fully built allmodconfig kernel tree as the source of profile data).
There's no spinlocks from dput at all, and the dput() profile itself
shows basically 1% in anything but the fastpath (probably the _very_
occasional first accesses where we need to add things to the LRU
lists).

And the patch is small, but is obviously totally lacking any test for
CPU support or anything like that. Or portability. But I thought I'd
get the ball rolling, because I doubt the intel TSX patches are going
to be useful (if they were, Intel would be crowing about performance
numbers now that the CPU's are out, and they aren't).

I don't know if the people doing HP performance testing have
TSX-enabled machines, but hey, maybe. So you guys are cc'd too.

I also didn't actually check if performance is affected. I doubt it is
measurable on this machine, especially on "make -j" that spends 90% of
its time in user space. But the profile comparison really does make it
look good..

Comments?

                         Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 1614 bytes --]

 fs/dcache.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 41000305d716..c988806b941e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -603,6 +603,9 @@ relock:
  * Real recursion would eat up our stack space.
  */
 
+#define is_simple_dput(dentry) \
+	(((dentry)->d_flags & (DCACHE_REFERENCED |DCACHE_LRU_LIST)) == (DCACHE_REFERENCED |DCACHE_LRU_LIST))
+
 /*
  * dput - release a dentry
  * @dentry: dentry to release 
@@ -617,6 +620,35 @@ void dput(struct dentry *dentry)
 	if (unlikely(!dentry))
 		return;
 
+	/*
+	 * Try RTM for the trivial - and common - case.
+	 *
+	 * We don't do this for DCACHE_OP_DELETE (which is a static flag,
+	 * so check it outside the transaction), and we require that the
+	 * dentry is already marked referenced and on the LRU list.
+	 *
+	 * If that is true, and the dentry is not locked, we can just
+	 * decrement the usage count.
+	 *
+	 * This is kind of a special super-case of lockref_put(), but
+	 * atomically testing the dentry flags to make sure that there
+	 * is nothing else we need to look at.
+	 */
+	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE))
+		goto repeat;
+	asm goto("xbegin %l[repeat]": : :"memory","ax":repeat);
+	if (unlikely(d_unhashed(dentry)))
+		goto xabort;
+	if (unlikely(!is_simple_dput(dentry)))
+		goto xabort;
+	if (unlikely(!arch_spin_value_unlocked(dentry->d_lock.rlock.raw_lock)))
+		goto xabort;
+	dentry->d_lockref.count--;
+	asm volatile("xend");
+	return;
+
+xabort:
+	asm volatile("xabort $0");
 repeat:
 	if (lockref_put_or_lock(&dentry->d_lockref))
 		return;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-10-02 14:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-30 19:29 Avoiding the dentry d_lock on final dput(), part deux: transactional memory Linus Torvalds
2013-09-30 20:01 ` Waiman Long
2013-09-30 20:04   ` Linus Torvalds
2013-10-02 14:56     ` Andi Kleen
2013-09-30 22:52 ` Benjamin Herrenschmidt
2013-10-01  0:36   ` Michael Neuling
2013-10-01  0:56     ` Linus Torvalds
2013-10-01  2:05       ` Benjamin Herrenschmidt
2013-10-01  3:13         ` Paul E. McKenney
2013-10-01  4:52           ` Michael Neuling
2013-10-01 12:16             ` Paul E. McKenney
2013-10-01 13:42               ` Paul E. McKenney
2013-10-01  1:05 ` spinlock contention of files->file_lock Eric Dumazet
2013-10-01  1:44   ` Linus Torvalds
2013-10-01  2:18     ` Eric Dumazet
2013-10-01 21:41     ` Eric Dumazet
2013-10-01 22:04       ` Al Viro
2013-10-01 22:21         ` Eric Dumazet
2013-10-02  5:13         ` Ingo Molnar
2013-10-02 10:20           ` Al Viro
2013-10-02 10:56             ` Ingo Molnar
2013-10-01  1:53   ` Al Viro
2013-10-01  2:02     ` Linus Torvalds
2013-10-01  3:27       ` Al Viro
2013-10-01  3:36         ` Eric Dumazet
2013-10-01  5:12           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).