Avoiding the dentry d_lock on final dput(), part deux: transactional memory

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Avoiding the dentry d_lock on final dput(), part deux: transactional memory
@ 2013-09-30 19:29 Linus Torvalds
  2013-09-30 20:01 ` Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-09-30 19:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Waiman Long, Benjamin Herrenschmidt
  Cc: Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev

[-- Attachment #1: Type: text/plain, Size: 3956 bytes --]

So with all the lockref work, we now avoid the dentry d_lock for
almost all normal cases.

There is one single remaining common case, though: the final dput()
when the dentry count goes down to zero, and we need to check if we
are supposed to get rid of the dentry (or at least put it on the LRU
lists etc).

And that's something lockref itself cannot really help us with unless
we start adding status bits to it (eg some kind of "enable slow-case"
bit in the lock part of the lockref). Which sounds like a clever but
very fragile approach.

However, I did get myself a i7-4770S exactly because I was
forward-thinking, and wanted to try using transactional memory for
these kinds of things.

Quite frankly, from all I've seen so far, the kernel is not going to
have very good luck with things like lock elision, because we're
really fine-grained already, and at least the Intel lock-elision
(don't know about POWER8) basically requires software to do prediction
on whether the transaction will succeed or not, dynamically based on
aborts etc. And quite frankly, by the time you have to do things like
that, you've already lost. We're better off just using our normal
locks.

So as far as I'm concerned, transactional memory is going to be useful
- *if* it is useful - only for specialized code. Some of that might be
architecture-internal lock implementations, other things might be
exactly the dput() kind of situation.

And the thing is, *normally* dput() doesn't need to do anything at
all, except decrement the dentry reference count. However, for that
normal case to be true, we need to atomically check:

 - that the dentry lock isn't held (same as lockref)
 - that we are already on the LRU list and don't need to add ourselves to it
 - that we already have the DCACHE_REFERENCED bit set and don't need to set it
 - that the dentry isn't unhashed and needs to be killed.

Additionally, we need to check that it's not a dentry that has a
"d_delete()" operation, but that's a static attribute of a dentry, so
that's not something that we need to check atomically wrt the other
things.

ANYWAY. With all that out of the way, the basic point is that this is
really simple, and fits very well with even very limited transactional
memory. We literally need to do just a single write, and something
like three reads from memory. And we already have a trivial fallback,
namely the old code using the lockrefs.  IOW, it's literally ten
straight-line instructions between the xbegin and the xend for me.

So here's a patch that works for me. It requires gcc to know "asm
goto", and it requires binutils that know about xbegin/xabort. And it
requires a CPU that supports the intel RTM extensions.

But I'm cc'ing the POWER people, because I don't know the POWER8
interfaces, and I don't want to necessarily call this "xbegin"/"xend"
when I actually wrap it in some helper functions.

Anyway, profiles with this look beautiful (I'm using "make -j" on a
fully built allmodconfig kernel tree as the source of profile data).
There's no spinlocks from dput at all, and the dput() profile itself
shows basically 1% in anything but the fastpath (probably the _very_
occasional first accesses where we need to add things to the LRU
lists).

And the patch is small, but is obviously totally lacking any test for
CPU support or anything like that. Or portability. But I thought I'd
get the ball rolling, because I doubt the intel TSX patches are going
to be useful (if they were, Intel would be crowing about performance
numbers now that the CPU's are out, and they aren't).

I don't know if the people doing HP performance testing have
TSX-enabled machines, but hey, maybe. So you guys are cc'd too.

I also didn't actually check if performance is affected. I doubt it is
measurable on this machine, especially on "make -j" that spends 90% of
its time in user space. But the profile comparison really does make it
look good..

Comments?

                         Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 1614 bytes --]

 fs/dcache.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 41000305d716..c988806b941e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -603,6 +603,9 @@ relock:
  * Real recursion would eat up our stack space.
  */

+#define is_simple_dput(dentry) \
+	(((dentry)->d_flags & (DCACHE_REFERENCED |DCACHE_LRU_LIST)) == (DCACHE_REFERENCED |DCACHE_LRU_LIST))
+
 /*
  * dput - release a dentry
  * @dentry: dentry to release 
@@ -617,6 +620,35 @@ void dput(struct dentry *dentry)
 	if (unlikely(!dentry))
 		return;

+	/*
+	 * Try RTM for the trivial - and common - case.
+	 *
+	 * We don't do this for DCACHE_OP_DELETE (which is a static flag,
+	 * so check it outside the transaction), and we require that the
+	 * dentry is already marked referenced and on the LRU list.
+	 *
+	 * If that is true, and the dentry is not locked, we can just
+	 * decrement the usage count.
+	 *
+	 * This is kind of a special super-case of lockref_put(), but
+	 * atomically testing the dentry flags to make sure that there
+	 * is nothing else we need to look at.
+	 */
+	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE))
+		goto repeat;
+	asm goto("xbegin %l[repeat]": : :"memory","ax":repeat);
+	if (unlikely(d_unhashed(dentry)))
+		goto xabort;
+	if (unlikely(!is_simple_dput(dentry)))
+		goto xabort;
+	if (unlikely(!arch_spin_value_unlocked(dentry->d_lock.rlock.raw_lock)))
+		goto xabort;
+	dentry->d_lockref.count--;
+	asm volatile("xend");
+	return;
+
+xabort:
+	asm volatile("xabort $0");
 repeat:
 	if (lockref_put_or_lock(&dentry->d_lockref))
 		return;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-09-30 19:29 Avoiding the dentry d_lock on final dput(), part deux: transactional memory Linus Torvalds
@ 2013-09-30 20:01 ` Waiman Long
  2013-09-30 20:04   ` Linus Torvalds
  2013-09-30 22:52 ` Benjamin Herrenschmidt
  2013-10-01  1:05 ` spinlock contention of files->file_lock Eric Dumazet
  2 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2013-09-30 20:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Benjamin Herrenschmidt,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev

On 09/30/2013 03:29 PM, Linus Torvalds wrote:
> So with all the lockref work, we now avoid the dentry d_lock for
> almost all normal cases.
>
> There is one single remaining common case, though: the final dput()
> when the dentry count goes down to zero, and we need to check if we
> are supposed to get rid of the dentry (or at least put it on the LRU
> lists etc).
>
> And that's something lockref itself cannot really help us with unless
> we start adding status bits to it (eg some kind of "enable slow-case"
> bit in the lock part of the lockref). Which sounds like a clever but
> very fragile approach.
>
> However, I did get myself a i7-4770S exactly because I was
> forward-thinking, and wanted to try using transactional memory for
> these kinds of things.
>
> Quite frankly, from all I've seen so far, the kernel is not going to
> have very good luck with things like lock elision, because we're
> really fine-grained already, and at least the Intel lock-elision
> (don't know about POWER8) basically requires software to do prediction
> on whether the transaction will succeed or not, dynamically based on
> aborts etc. And quite frankly, by the time you have to do things like
> that, you've already lost. We're better off just using our normal
> locks.
>
> So as far as I'm concerned, transactional memory is going to be useful
> - *if* it is useful - only for specialized code. Some of that might be
> architecture-internal lock implementations, other things might be
> exactly the dput() kind of situation.
>
> And the thing is, *normally* dput() doesn't need to do anything at
> all, except decrement the dentry reference count. However, for that
> normal case to be true, we need to atomically check:
>
>   - that the dentry lock isn't held (same as lockref)
>   - that we are already on the LRU list and don't need to add ourselves to it
>   - that we already have the DCACHE_REFERENCED bit set and don't need to set it
>   - that the dentry isn't unhashed and needs to be killed.
>
> Additionally, we need to check that it's not a dentry that has a
> "d_delete()" operation, but that's a static attribute of a dentry, so
> that's not something that we need to check atomically wrt the other
> things.
>
> ANYWAY. With all that out of the way, the basic point is that this is
> really simple, and fits very well with even very limited transactional
> memory. We literally need to do just a single write, and something
> like three reads from memory. And we already have a trivial fallback,
> namely the old code using the lockrefs.  IOW, it's literally ten
> straight-line instructions between the xbegin and the xend for me.
>
> So here's a patch that works for me. It requires gcc to know "asm
> goto", and it requires binutils that know about xbegin/xabort. And it
> requires a CPU that supports the intel RTM extensions.
>
> But I'm cc'ing the POWER people, because I don't know the POWER8
> interfaces, and I don't want to necessarily call this "xbegin"/"xend"
> when I actually wrap it in some helper functions.
>
> Anyway, profiles with this look beautiful (I'm using "make -j" on a
> fully built allmodconfig kernel tree as the source of profile data).
> There's no spinlocks from dput at all, and the dput() profile itself
> shows basically 1% in anything but the fastpath (probably the _very_
> occasional first accesses where we need to add things to the LRU
> lists).
>
> And the patch is small, but is obviously totally lacking any test for
> CPU support or anything like that. Or portability. But I thought I'd
> get the ball rolling, because I doubt the intel TSX patches are going
> to be useful (if they were, Intel would be crowing about performance
> numbers now that the CPU's are out, and they aren't).
>
> I don't know if the people doing HP performance testing have
> TSX-enabled machines, but hey, maybe. So you guys are cc'd too.

The Xeon class CPUs are typically about one year behind the consumer CPU 
chips. We are testing large NUMA machine with IvyBridge-EX CPUs right 
now. Haswell-EX has to be at least one year out. So we don't have the 
hardware to do the testing at the moment.

> I also didn't actually check if performance is affected. I doubt it is
> measurable on this machine, especially on "make -j" that spends 90% of
> its time in user space. But the profile comparison really does make it
> look good..
>
> Comments?
>
>                           Linus

I think this patch is worth a trial if relevant hardware is more widely 
available. The TSX code certainly need to be moved to an architecture 
specific area and should be runtime enabled using a static key. We also 
need more TSX support infrastructure in place first.

-Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-09-30 20:01 ` Waiman Long
@ 2013-09-30 20:04   ` Linus Torvalds
  2013-10-02 14:56     ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-09-30 20:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Benjamin Herrenschmidt,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev

On Mon, Sep 30, 2013 at 1:01 PM, Waiman Long <waiman.long@hp.com> wrote:
>
> I think this patch is worth a trial if relevant hardware is more widely
> available. The TSX code certainly need to be moved to an architecture
> specific area and should be runtime enabled using a static key. We also need
> more TSX support infrastructure in place first.

I think we can pick that up from Andi's patches, he should have that.
Although that did have very x86-specific naming (ie "xbegin"). And I
don't think he used "asm goto" to quite the same advantage as this -
and I think we do want to make sure that the overhead is minimal.

                  Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-09-30 20:04   ` Linus Torvalds
@ 2013-10-02 14:56     ` Andi Kleen
  0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2013-10-02 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Peter Zijlstra, Benjamin Herrenschmidt,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Sep 30, 2013 at 1:01 PM, Waiman Long <waiman.long@hp.com> wrote:
>>
>> I think this patch is worth a trial if relevant hardware is more widely
>> available. The TSX code certainly need to be moved to an architecture
>> specific area and should be runtime enabled using a static key. We also need
>> more TSX support infrastructure in place first.
>
> I think we can pick that up from Andi's patches, he should have that.
> Although that did have very x86-specific naming (ie "xbegin"). And I
> don't think he used "asm goto" to quite the same advantage as this -
> and I think we do want to make sure that the overhead is minimal.

FWIW my version #0 used asm goto directly, but I later switched to not
using it to support more compilers and higher level abstractions (locks
etc.) and use the same intrinsics as the user level guys are using.

The two extra instructions from not using asm goto for xbegin
don't matter all that much in the end.

That's the old asm goto stuff I wrote originally
(user level version):

https://github.com/andikleen/tsx-tools/blob/master/include/rtm-goto.h

There was also a kernel version of it that patched, right 
now this is done in the main TSX patchkit like this:

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle312/rtm-base&id=9190346d57a9bc89e746aee774d07e54cd1e6e75

Essentially without RTM it just becomes and unconditional jump
to the abort handler, xabort is a nop, and xtest always returns 0.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-09-30 19:29 Avoiding the dentry d_lock on final dput(), part deux: transactional memory Linus Torvalds
  2013-09-30 20:01 ` Waiman Long
@ 2013-09-30 22:52 ` Benjamin Herrenschmidt
  2013-10-01  0:36   ` Michael Neuling
  2013-10-01  1:05 ` spinlock contention of files->file_lock Eric Dumazet
  2 siblings, 1 reply; 26+ messages in thread
From: Benjamin Herrenschmidt @ 2013-09-30 22:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Waiman Long,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev,
	Michael Neuling

On Mon, 2013-09-30 at 12:29 -0700, Linus Torvalds wrote:
> 
> But I'm cc'ing the POWER people, because I don't know the POWER8
> interfaces, and I don't want to necessarily call this "xbegin"/"xend"
> when I actually wrap it in some helper functions.

The main problem with powerpc TM is that we need to handle interrupts
happening while in transactional state. We currently only handle that
for userspace.

Mikey, how hard would it be to detect the in-kernel TM case and just
simulate an abort in the exception entry ? (return to tbegin basically)

Transactions in kernel make me nervous because of the PC jumping around
on aborts and how easy we can get that stuff wrong :-) They also have
interesting ordering semantics vs. locks, we need to be a tad careful
(as long as we don't access a lock variable transactionally we should be
ok. If we do, then spin_unlock needs a stronger barrier).

The basic semantic for us is tbegin. [...] tend instructions. If the
transaction fails, control returns to tbegin. (can happen at any point)
which returns a CC code indicating success or failure.

Things get really complicated if you take an interrupt however, the
transaction gets into some special "suspended" state, it doesn't just
die so we need to handle things in our interrupt entry (even if it's
just to make the transaction abort cleanly) and right now we don't when
coming from kernel space.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-09-30 22:52 ` Benjamin Herrenschmidt
@ 2013-10-01  0:36   ` Michael Neuling
  2013-10-01  0:56     ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Neuling @ 2013-10-01  0:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev

Ben,

> On Mon, 2013-09-30 at 12:29 -0700, Linus Torvalds wrote:
> > 
> > But I'm cc'ing the POWER people, because I don't know the POWER8
> > interfaces, and I don't want to necessarily call this "xbegin"/"xend"
> > when I actually wrap it in some helper functions.
> 
> The main problem with powerpc TM is that we need to handle interrupts
> happening while in transactional state. We currently only handle that
> for userspace.

Yep.  

> Mikey, how hard would it be to detect the in-kernel TM case and just
> simulate an abort in the exception entry ? (return to tbegin basically)

It's doable.

The scary part is that we to make all register volatile.  You were not
that keen on doing this as there are a lot of places in exception
entry/exit where we only save/restore a subset of the registers.  We'd
need to catch all these.

> Transactions in kernel make me nervous because of the PC jumping
> around on aborts and how easy we can get that stuff wrong :-) 

The same applies for userspace.  We are pretty careful not to screw that
up though.

It's also one of the reason we don't do software rollback of userspace
transactions even in transactional (non-suspended) mode.  We always save
and restore all state and let the HW deal with the PC and state jumping
around.

> They also have interesting ordering semantics vs. locks, we need to be
> a tad careful (as long as we don't access a lock variable
> transactionally we should be ok. If we do, then spin_unlock needs a
> stronger barrier).

Yep.

> The basic semantic for us is tbegin. [...] tend instructions. If the
> transaction fails, control returns to tbegin. (can happen at any
> point) which returns a CC code indicating success or failure.

FWIW eg.

     tbegin
     beq abort /* passes first time through */
     ....
     transactional stuff
     ....
     tend
     b pass
     
abort:

pass:

> Things get really complicated if you take an interrupt however, the
> transaction gets into some special "suspended" state, it doesn't just
> die so we need to handle things in our interrupt entry (even if it's
> just to make the transaction abort cleanly) and right now we don't
> when coming from kernel space.

Yep, but we could check easily enough.

Mikey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-10-01  0:36   ` Michael Neuling
@ 2013-10-01  0:56     ` Linus Torvalds
  2013-10-01  2:05       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-10-01  0:56 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Waiman Long, Peter Zijlstra, George Spelvin,
	Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	Norton, Scott J, linux-fsdevel, ppc-dev, Ingo Molnar

On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling <mikey@neuling.org> wrote:
>
> The scary part is that we to make all register volatile.  You were not
> that keen on doing this as there are a lot of places in exception
> entry/exit where we only save/restore a subset of the registers.  We'd
> need to catch all these.

Ugh. It's very possible it's not worth using for the kernel then. The
example I posted is normally fine *without* any transactional support,
since it's a very local per-dentry lock, and since we only take that
lock when the last reference drops (so it's not some common directory
dentry, it's a end-point file dentry). In fact, on ARM they just made
the cmpxchg much faster by making it entirely non-serializing (since
it only updates a reference count, there is no locking involved apart
from checking that the lock state is unlocked)

So there is basically never any contention, and the transaction needs
to basically be pretty much the same cost as a "cmpxchg". It's not
clear if the intel TSX is good enough for that, and if you have to
save a lot of registers in order to use transactions on POWER8, I
doubt it's worthwhile.

We have very few - if any - locks where contention or even cache
bouncing is common or normal. Sure, we have a few particular loads
that can trigger it, but even that is becoming rare. So from a
performance standpoint, the target always needs to be "comparable to
hot spinlock in local cache".

>> They also have interesting ordering semantics vs. locks, we need to be
>> a tad careful (as long as we don't access a lock variable
>> transactionally we should be ok. If we do, then spin_unlock needs a
>> stronger barrier).
>
> Yep.

Well, just about any kernel transaction will at least read the state
of a lock. Without that, it's generally totally useless. My dput()
example sequence very much verified that the lock was not held, for
example.

I'm not sure how that affects anything. The actual transaction had
better not be visible inside the locked region (ie as far as any lock
users go, transactions better all happen fully before or after the
lock, if they read the lock and see it being unlocked).

That said, I cannot see how POWER8 could possibly violate that rule.
The whole "transactions are atomic" is kind of the whole and only
point of a transaction. So I'm not sure what odd lock restrictions
POWER8 could have.

> FWIW eg.
>
>      tbegin
>      beq abort /* passes first time through */
>      ....
>      transactional stuff
>      ....
>      tend
>      b pass
>
> abort:
>
> pass:

That's fine, and matches the x86 semantics fairly closely, except
"xbegin" kind of "contains" that "jump to abort address". But we could
definitely use the same models. Call it
"transaction_begin/abort/end()", and it should be architecture-neutral
naming-wise.

Of course, if tbegin then acts basically like some crazy
assembly-level setjmp (I'm guessing it does exactly, and presumably
precisely that kind of compiler support - ie a function with
"__attribute((returns_twice))" in gcc-speak), the overhead of doing it
may kill it.

            Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-10-01  0:56     ` Linus Torvalds
@ 2013-10-01  2:05       ` Benjamin Herrenschmidt
  2013-10-01  3:13         ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin Herrenschmidt @ 2013-10-01  2:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Neuling, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	Linux Kernel Mailing List, linux-fsdevel, ppc-dev,
	Paul E. McKenney

On Mon, 2013-09-30 at 17:56 -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling <mikey@neuling.org> wrote:
> >
> > The scary part is that we to make all register volatile.  You were not
> > that keen on doing this as there are a lot of places in exception
> > entry/exit where we only save/restore a subset of the registers.  We'd
> > need to catch all these.
> 
> Ugh. It's very possible it's not worth using for the kernel then. The
> example I posted is normally fine *without* any transactional support,
> since it's a very local per-dentry lock, and since we only take that
> lock when the last reference drops (so it's not some common directory
> dentry, it's a end-point file dentry). In fact, on ARM they just made
> the cmpxchg much faster by making it entirely non-serializing (since
> it only updates a reference count, there is no locking involved apart
> from checking that the lock state is unlocked)
> 
> So there is basically never any contention, and the transaction needs
> to basically be pretty much the same cost as a "cmpxchg". It's not
> clear if the intel TSX is good enough for that, and if you have to
> save a lot of registers in order to use transactions on POWER8, I
> doubt it's worthwhile.

Well we don't have to, I think Mikey wasn't totally clear about that
"making all registers volatile" business :-) This is just something we
need to handle in assembly if we are going to reclaim the suspended
transaction.

So basically, what we need is something along the lines of
enable_kernel_tm() which checks if there's a suspended user transaction
and if yes, kills/reclaims it.

Then we also need to handle in our interrupt handlers that we have an
active/suspended transaction from a kernel state, which we don't deal
with at this point, and do whatever has to be done to kill it... we
might get away with something simple if we can state that we only allow
kernel transactions at task level and not from interrupt/softirq
contexts, at least initially.

> We have very few - if any - locks where contention or even cache
> bouncing is common or normal. Sure, we have a few particular loads
> that can trigger it, but even that is becoming rare. So from a
> performance standpoint, the target always needs to be "comparable to
> hot spinlock in local cache".

I am not quite familiar with the performance profile of our
transactional hardware. I think we should definitely try to hack
something together for that dput() case and measure it.

> >> They also have interesting ordering semantics vs. locks, we need to be
> >> a tad careful (as long as we don't access a lock variable
> >> transactionally we should be ok. If we do, then spin_unlock needs a
> >> stronger barrier).
> >
> > Yep.
> 
> Well, just about any kernel transaction will at least read the state
> of a lock. Without that, it's generally totally useless. My dput()
> example sequence very much verified that the lock was not held, for
> example.
> 
> I'm not sure how that affects anything. The actual transaction had
> better not be visible inside the locked region (ie as far as any lock
> users go, transactions better all happen fully before or after the
> lock, if they read the lock and see it being unlocked).
> 
> That said, I cannot see how POWER8 could possibly violate that rule.
> The whole "transactions are atomic" is kind of the whole and only
> point of a transaction. So I'm not sure what odd lock restrictions
> POWER8 could have.

Has to do with the memory model :-(

I dug the whole story from my mbox and the situation is indeed as dire
as feared. If the transaction reads the lock, then the corresponding
spin_lock must have a full sync barrier in it instead of the current
lighter one.

Now I believe we are already "on the fence" with our locks today since
technically speaking, our unlock + lock sequence is *not* exactly a full
barrier (it is only if it's the same lock I think)

CC'ing Paul McKenney here who's been chasing that issue. In the end, we
might end up having to turn our locks into sync anyway

Yay ! The isanity^Wjoy of an OO memory model !

> > FWIW eg.
> >
> >      tbegin
> >      beq abort /* passes first time through */
> >      ....
> >      transactional stuff
> >      ....
> >      tend
> >      b pass
> >
> > abort:
> >
> > pass:
> 
> That's fine, and matches the x86 semantics fairly closely, except
> "xbegin" kind of "contains" that "jump to abort address". But we could
> definitely use the same models. Call it
> "transaction_begin/abort/end()", and it should be architecture-neutral
> naming-wise.

Right.

> Of course, if tbegin then acts basically like some crazy
> assembly-level setjmp (I'm guessing it does exactly, and presumably
> precisely that kind of compiler support - ie a function with
> "__attribute((returns_twice))" in gcc-speak), the overhead of doing it
> may kill it.

Well, all the registers are checkpointed so we *should* be ok but gcc
always makes me nervous in those circumstances ...

Ben.


>             Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-10-01  2:05       ` Benjamin Herrenschmidt
@ 2013-10-01  3:13         ` Paul E. McKenney
  2013-10-01  4:52           ` Michael Neuling
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2013-10-01  3:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Michael Neuling, Ingo Molnar, Peter Zijlstra,
	Waiman Long, Chandramouleeswaran, Aswin, Norton, Scott J,
	George Spelvin, Linux Kernel Mailing List, linux-fsdevel, ppc-dev

On Tue, Oct 01, 2013 at 12:05:03PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-09-30 at 17:56 -0700, Linus Torvalds wrote:
> > On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling <mikey@neuling.org> wrote:
> > >
> > > The scary part is that we to make all register volatile.  You were not
> > > that keen on doing this as there are a lot of places in exception
> > > entry/exit where we only save/restore a subset of the registers.  We'd
> > > need to catch all these.
> > 
> > Ugh. It's very possible it's not worth using for the kernel then. The
> > example I posted is normally fine *without* any transactional support,
> > since it's a very local per-dentry lock, and since we only take that
> > lock when the last reference drops (so it's not some common directory
> > dentry, it's a end-point file dentry). In fact, on ARM they just made
> > the cmpxchg much faster by making it entirely non-serializing (since
> > it only updates a reference count, there is no locking involved apart
> > from checking that the lock state is unlocked)

A memory-barrier-free cmpxchg() would be easy on Power as well.

> > So there is basically never any contention, and the transaction needs
> > to basically be pretty much the same cost as a "cmpxchg". It's not
> > clear if the intel TSX is good enough for that, and if you have to
> > save a lot of registers in order to use transactions on POWER8, I
> > doubt it's worthwhile.
> 
> Well we don't have to, I think Mikey wasn't totally clear about that
> "making all registers volatile" business :-) This is just something we
> need to handle in assembly if we are going to reclaim the suspended
> transaction.
> 
> So basically, what we need is something along the lines of
> enable_kernel_tm() which checks if there's a suspended user transaction
> and if yes, kills/reclaims it.
> 
> Then we also need to handle in our interrupt handlers that we have an
> active/suspended transaction from a kernel state, which we don't deal
> with at this point, and do whatever has to be done to kill it... we
> might get away with something simple if we can state that we only allow
> kernel transactions at task level and not from interrupt/softirq
> contexts, at least initially.

Call me a coward, but this is starting to sound a bit scary.  ;-)

> > We have very few - if any - locks where contention or even cache
> > bouncing is common or normal. Sure, we have a few particular loads
> > that can trigger it, but even that is becoming rare. So from a
> > performance standpoint, the target always needs to be "comparable to
> > hot spinlock in local cache".
> 
> I am not quite familiar with the performance profile of our
> transactional hardware. I think we should definitely try to hack
> something together for that dput() case and measure it.
> 
> > >> They also have interesting ordering semantics vs. locks, we need to be
> > >> a tad careful (as long as we don't access a lock variable
> > >> transactionally we should be ok. If we do, then spin_unlock needs a
> > >> stronger barrier).
> > >
> > > Yep.
> > 
> > Well, just about any kernel transaction will at least read the state
> > of a lock. Without that, it's generally totally useless. My dput()
> > example sequence very much verified that the lock was not held, for
> > example.
> > 
> > I'm not sure how that affects anything. The actual transaction had
> > better not be visible inside the locked region (ie as far as any lock
> > users go, transactions better all happen fully before or after the
> > lock, if they read the lock and see it being unlocked).
> > 
> > That said, I cannot see how POWER8 could possibly violate that rule.
> > The whole "transactions are atomic" is kind of the whole and only
> > point of a transaction. So I'm not sure what odd lock restrictions
> > POWER8 could have.
> 
> Has to do with the memory model :-(
> 
> I dug the whole story from my mbox and the situation is indeed as dire
> as feared. If the transaction reads the lock, then the corresponding
> spin_lock must have a full sync barrier in it instead of the current
> lighter one.
> 
> Now I believe we are already "on the fence" with our locks today since
> technically speaking, our unlock + lock sequence is *not* exactly a full
> barrier (it is only if it's the same lock I think)
> 
> CC'ing Paul McKenney here who's been chasing that issue. In the end, we
> might end up having to turn our locks into sync anyway

Well, there have been a lot of fixed software bugs since the last
suspicious sighting, but on the other hand, I am just now getting my
RCU testing going again on Power.  I would certainly feel better
about things if unlock-lock was really truly a full barrier, but
this clearly needs a clean sighting.

> Yay ! The isanity^Wjoy of an OO memory model !

;-) ;-) ;-)

							Thanx, Paul

> > > FWIW eg.
> > >
> > >      tbegin
> > >      beq abort /* passes first time through */
> > >      ....
> > >      transactional stuff
> > >      ....
> > >      tend
> > >      b pass
> > >
> > > abort:
> > >
> > > pass:
> > 
> > That's fine, and matches the x86 semantics fairly closely, except
> > "xbegin" kind of "contains" that "jump to abort address". But we could
> > definitely use the same models. Call it
> > "transaction_begin/abort/end()", and it should be architecture-neutral
> > naming-wise.
> 
> Right.
> 
> > Of course, if tbegin then acts basically like some crazy
> > assembly-level setjmp (I'm guessing it does exactly, and presumably
> > precisely that kind of compiler support - ie a function with
> > "__attribute((returns_twice))" in gcc-speak), the overhead of doing it
> > may kill it.
> 
> Well, all the registers are checkpointed so we *should* be ok but gcc
> always makes me nervous in those circumstances ...
> 
> Ben.
> 
> 
> >             Linus
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-10-01  3:13         ` Paul E. McKenney
@ 2013-10-01  4:52           ` Michael Neuling
  2013-10-01 12:16             ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Neuling @ 2013-10-01  4:52 UTC (permalink / raw)
  To: paulmck
  Cc: Waiman Long, ppc-dev, Peter Zijlstra, George Spelvin,
	Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	Norton, Scott J, linux-fsdevel, Linus Torvalds, Ingo Molnar

>> Well we don't have to, I think Mikey wasn't totally clear about that
>> "making all registers volatile" business :-) This is just something we
>> need to handle in assembly if we are going to reclaim the suspended
>> transaction.

Yeah, sorry.  The slow path with all registers as volatile is only
needed if we get pre-empted during the transaction.

>>
>> So basically, what we need is something along the lines of
>> enable_kernel_tm() which checks if there's a suspended user transaction
>> and if yes, kills/reclaims it.
>>
>> Then we also need to handle in our interrupt handlers that we have an
>> active/suspended transaction from a kernel state, which we don't deal
>> with at this point, and do whatever has to be done to kill it... we
>> might get away with something simple if we can state that we only allow
>> kernel transactions at task level and not from interrupt/softirq
>> contexts, at least initially.
>
> Call me a coward, but this is starting to sound a bit scary.  ;-)

We are just wanting to prototype it for now to see if we could make it
go faster.  If it's worth it, then we'd consider the additional
complexity this would bring.

I don't think it'll be that bad, but I'd certainly want to make sure
it's worth it before trying :-)

Mikey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-10-01  4:52           ` Michael Neuling
@ 2013-10-01 12:16             ` Paul E. McKenney
  2013-10-01 13:42               ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2013-10-01 12:16 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Benjamin Herrenschmidt, Linus Torvalds, Ingo Molnar,
	Peter Zijlstra, Waiman Long, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin, Linux Kernel Mailing List,
	linux-fsdevel, ppc-dev

On Tue, Oct 01, 2013 at 02:52:28PM +1000, Michael Neuling wrote:
> >> Well we don't have to, I think Mikey wasn't totally clear about that
> >> "making all registers volatile" business :-) This is just something we
> >> need to handle in assembly if we are going to reclaim the suspended
> >> transaction.
> 
> Yeah, sorry.  The slow path with all registers as volatile is only
> needed if we get pre-empted during the transaction.
> 
> >>
> >> So basically, what we need is something along the lines of
> >> enable_kernel_tm() which checks if there's a suspended user transaction
> >> and if yes, kills/reclaims it.
> >>
> >> Then we also need to handle in our interrupt handlers that we have an
> >> active/suspended transaction from a kernel state, which we don't deal
> >> with at this point, and do whatever has to be done to kill it... we
> >> might get away with something simple if we can state that we only allow
> >> kernel transactions at task level and not from interrupt/softirq
> >> contexts, at least initially.
> >
> > Call me a coward, but this is starting to sound a bit scary.  ;-)
> 
> We are just wanting to prototype it for now to see if we could make it
> go faster.  If it's worth it, then we'd consider the additional
> complexity this would bring.
> 
> I don't think it'll be that bad, but I'd certainly want to make sure
> it's worth it before trying :-)

OK, fair point.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
  2013-10-01 12:16             ` Paul E. McKenney
@ 2013-10-01 13:42               ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-10-01 13:42 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Benjamin Herrenschmidt, Linus Torvalds, Ingo Molnar,
	Peter Zijlstra, Waiman Long, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin, Linux Kernel Mailing List,
	linux-fsdevel, ppc-dev

On Tue, Oct 01, 2013 at 05:16:54AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 01, 2013 at 02:52:28PM +1000, Michael Neuling wrote:
> > >> Well we don't have to, I think Mikey wasn't totally clear about that
> > >> "making all registers volatile" business :-) This is just something we
> > >> need to handle in assembly if we are going to reclaim the suspended
> > >> transaction.
> > 
> > Yeah, sorry.  The slow path with all registers as volatile is only
> > needed if we get pre-empted during the transaction.
> > 
> > >>
> > >> So basically, what we need is something along the lines of
> > >> enable_kernel_tm() which checks if there's a suspended user transaction
> > >> and if yes, kills/reclaims it.
> > >>
> > >> Then we also need to handle in our interrupt handlers that we have an
> > >> active/suspended transaction from a kernel state, which we don't deal
> > >> with at this point, and do whatever has to be done to kill it... we
> > >> might get away with something simple if we can state that we only allow
> > >> kernel transactions at task level and not from interrupt/softirq
> > >> contexts, at least initially.
> > >
> > > Call me a coward, but this is starting to sound a bit scary.  ;-)
> > 
> > We are just wanting to prototype it for now to see if we could make it
> > go faster.  If it's worth it, then we'd consider the additional
> > complexity this would bring.
> > 
> > I don't think it'll be that bad, but I'd certainly want to make sure
> > it's worth it before trying :-)
> 
> OK, fair point.  ;-)

That is, a fair point -assuming- that we also try the memory-barrier-free
cmpxchg that Linus suggested...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* spinlock contention of files->file_lock
  2013-09-30 19:29 Avoiding the dentry d_lock on final dput(), part deux: transactional memory Linus Torvalds
  2013-09-30 20:01 ` Waiman Long
  2013-09-30 22:52 ` Benjamin Herrenschmidt
@ 2013-10-01  1:05 ` Eric Dumazet
  2013-10-01  1:44   ` Linus Torvalds
  2013-10-01  1:53   ` Al Viro
  2 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2013-10-01  1:05 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Ingo Molnar, Peter Zijlstra, Waiman Long, Benjamin Herrenschmidt,
	Chandramouleeswaran, Aswin, Linux Kernel Mailing List,
	linux-fsdevel

Speaking of spinlock contention, the files->file_lock is a good example.

Multi threaded programs hit this spinlock three times per fd : 

alloc_fd() / fd_install() / close() 

I think fd_install() could be done without this spinlock.

we want to protect fd_install() from a concurrent resize of the fd
table.

We could either :

- Add a spinlock used for resize, and fd_install().
  A bit suboptimal but easy to code and review.

- Add a seqcount, and cmpxchg() to synchronize fd_install() with the
potential resizer. (Might need additional RCU locking)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  1:05 ` spinlock contention of files->file_lock Eric Dumazet
@ 2013-10-01  1:44   ` Linus Torvalds
  2013-10-01  2:18     ` Eric Dumazet
  2013-10-01 21:41     ` Eric Dumazet
  2013-10-01  1:53   ` Al Viro
  1 sibling, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-10-01  1:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Sep 30, 2013 at 6:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Speaking of spinlock contention, the files->file_lock is a good example.
>
> Multi threaded programs hit this spinlock three times per fd :

.. do you actually have programs that see contention?

> alloc_fd() / fd_install() / close()
>
> I think fd_install() could be done without this spinlock.

Yes, I haven't thought much about it, but from a quick look it should
be trivial to do fd_install(). We already free the fdtable using rcu
(because we look things up without locking), so we could just get the
rcu read-lock, do fd_install() without any locking at all, and then
verify that the fd-table is still the same. Rinse and repeat if not.

> - Add a seqcount, and cmpxchg() to synchronize fd_install() with the
> potential resizer. (Might need additional RCU locking)

I don't think you even need that. alloc_fd() has already reserved the
spot, so I think you really can just assign without any fancy cmpxchg
at all. If you write to the old fdtable, who cares? Probably just
something like

   do {
      fdt = files_fdtable(files);
      rcu_assign_pointer(fdt->fd[fd], file);
      smp_mb();
   } while (fdt != files_fdtable(files));

or similar. Under the RCU lock to guarantee the allocations don't
disappear from under you, but with no other locking.

Maybe I'm missing something, but it really doesn't look that bad.

Now, that only gets rid of fd_install(), but I suspect you could do
something similar for put_unused_fd() (that one does need cmpxchg for
the "next_fd" thing, though). We'd have to replace the non-atomic
bitops on open_fds[] with atomic ones, just to make sure adjacent bit
clearings don't screw up concurrent adjacent bit values, but that
looks fairly straightforward too.

Now, getting rid of the spinlock for alloc_fd() would be more work,
and you'd likely want to keep it for actually resizing (just to keep
that case more straightforward), but you could probably do the
non-resizing cases fairly easily there too without holding the lock.
Scan for a free bit optimistically and without any locking, and then
be somewhat careful with setting that open_fds[] bit - not only using
an atomic test_and_set() for it, but also do the same "let's check
that the fdtable base pointers haven't changed in the meantime and
repeat".

On the whole the fd table looks like if it really is a contention
point, it would be fairly straightforward to fix without any huge
contortions. Sure, lots of care and small details to look out for, but
nothing looks really all that fundamentally difficult.

But is it really a contention point on any real load?

                 Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  1:44   ` Linus Torvalds
@ 2013-10-01  2:18     ` Eric Dumazet
  2013-10-01 21:41     ` Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2013-10-01  2:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, 2013-09-30 at 18:44 -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2013 at 6:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Speaking of spinlock contention, the files->file_lock is a good example.
> >
> > Multi threaded programs hit this spinlock three times per fd :
> 
> .. do you actually have programs that see contention?

Yes, Google-Bug-Id 9072743 for Googlers ;)

Basically, we have (many) servers with 10,000,000+ sockets, where this
lock contention is a real problem.

We use a SOCK_FD_FASTALLOC accept4()/socket() flag to relax POSIX
requirements and not having to parse huge bitmap in
find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd)  (O_FD_FASTALLOC
uses a random starting point)

I am writing a patch using cmpxchg() to see if I can solve the dup2()
'problem'



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  1:44   ` Linus Torvalds
  2013-10-01  2:18     ` Eric Dumazet
@ 2013-10-01 21:41     ` Eric Dumazet
  2013-10-01 22:04       ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2013-10-01 21:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

From: Eric Dumazet <edumazet@google.com>

On Mon, 2013-09-30 at 18:44 -0700, Linus Torvalds wrote:

> Now, that only gets rid of fd_install(), but I suspect you could do
> something similar for put_unused_fd() (that one does need cmpxchg for
> the "next_fd" thing, though). We'd have to replace the non-atomic
> bitops on open_fds[] with atomic ones, just to make sure adjacent bit
> clearings don't screw up concurrent adjacent bit values, but that
> looks fairly straightforward too.

While looking at this (exciting) stuff, I found following bug.

Maybe I am missing something obvious ?

Thanks

[PATCH] fs: fix a race in do_close_on_exec()

commit 6a6d27de ("take close-on-exec logics to fs/file.c, clean it up a
bit") added a possible race, as another thread could resize file table
once we released files->file_lock.

We must reload fdt after getting the lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 fs/file.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 4a78f98..b614f13 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -616,10 +616,10 @@ void do_close_on_exec(struct files_struct *files)
 
 	/* exec unshares first */
 	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
 	for (i = 0; ; i++) {
 		unsigned long set;
 		unsigned fd = i * BITS_PER_LONG;
-		fdt = files_fdtable(files);
 		if (fd >= fdt->max_fds)
 			break;
 		set = fdt->close_on_exec[i];
@@ -639,6 +639,9 @@ void do_close_on_exec(struct files_struct *files)
 			filp_close(file, files);
 			cond_resched();
 			spin_lock(&files->file_lock);
+
+			/* We released files->file_lock, we must reload fdt */
+			fdt = files_fdtable(files);
 		}
 
 	}

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01 21:41     ` Eric Dumazet
@ 2013-10-01 22:04       ` Al Viro
  2013-10-01 22:21         ` Eric Dumazet
  2013-10-02  5:13         ` Ingo Molnar
  0 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2013-10-01 22:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Oct 01, 2013 at 02:41:58PM -0700, Eric Dumazet wrote:
> Maybe I am missing something obvious ?

Yes.  do_execve_common() starts with unshare_files(); there can be
no other thread capable of modifying that descriptor table.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01 22:04       ` Al Viro
@ 2013-10-01 22:21         ` Eric Dumazet
  2013-10-02  5:13         ` Ingo Molnar
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2013-10-01 22:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, 2013-10-01 at 23:04 +0100, Al Viro wrote:
> On Tue, Oct 01, 2013 at 02:41:58PM -0700, Eric Dumazet wrote:
> > Maybe I am missing something obvious ?
> 
> Yes.  do_execve_common() starts with unshare_files(); there can be
> no other thread capable of modifying that descriptor table.

Hmm, then what's the point of using spin_lock() here ?

This gives wrong hints ;)

diff --git a/fs/file.c b/fs/file.c
index 4a78f98..cdbca0d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -615,11 +615,10 @@ void do_close_on_exec(struct files_struct *files)
 	struct fdtable *fdt;
 
 	/* exec unshares first */
-	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
 	for (i = 0; ; i++) {
 		unsigned long set;
 		unsigned fd = i * BITS_PER_LONG;
-		fdt = files_fdtable(files);
 		if (fd >= fdt->max_fds)
 			break;
 		set = fdt->close_on_exec[i];
@@ -635,14 +634,11 @@ void do_close_on_exec(struct files_struct *files)
 				continue;
 			rcu_assign_pointer(fdt->fd[fd], NULL);
 			__put_unused_fd(files, fd);
-			spin_unlock(&files->file_lock);
 			filp_close(file, files);
 			cond_resched();
-			spin_lock(&files->file_lock);
 		}
 
 	}
-	spin_unlock(&files->file_lock);
 }
 
 struct file *fget(unsigned int fd)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01 22:04       ` Al Viro
  2013-10-01 22:21         ` Eric Dumazet
@ 2013-10-02  5:13         ` Ingo Molnar
  2013-10-02 10:20           ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2013-10-02  5:13 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, Linus Torvalds, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel


* Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Tue, Oct 01, 2013 at 02:41:58PM -0700, Eric Dumazet wrote:
> > Maybe I am missing something obvious ?
> 
> Yes.  do_execve_common() starts with unshare_files(); there can be
> no other thread capable of modifying that descriptor table.

Btw., might the Android Binder:

  drivers/staging/android/binder.c:       struct files_struct *files = proc->files;
...
  drivers/staging/android/binder.c:               __fd_install(proc->files, fd, file);
...
  drivers/staging/android/binder.c:       retval = __close_fd(proc->files, fd);

violate that assumption?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-02  5:13         ` Ingo Molnar
@ 2013-10-02 10:20           ` Al Viro
  2013-10-02 10:56             ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2013-10-02 10:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Linus Torvalds, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Oct 02, 2013 at 07:13:19AM +0200, Ingo Molnar wrote:
> 
> * Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Tue, Oct 01, 2013 at 02:41:58PM -0700, Eric Dumazet wrote:
> > > Maybe I am missing something obvious ?
> > 
> > Yes.  do_execve_common() starts with unshare_files(); there can be
> > no other thread capable of modifying that descriptor table.
> 
> Btw., might the Android Binder:
> 
>   drivers/staging/android/binder.c:       struct files_struct *files = proc->files;
> ...
>   drivers/staging/android/binder.c:               __fd_install(proc->files, fd, file);
> ...
>   drivers/staging/android/binder.c:       retval = __close_fd(proc->files, fd);
> 
> violate that assumption?

Not unless your thread has managed to call an ioctl between entering
do_execve_common() and calling do_close_on_exec() ;-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-02 10:20           ` Al Viro
@ 2013-10-02 10:56             ` Ingo Molnar
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2013-10-02 10:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, Linus Torvalds, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel


* Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Wed, Oct 02, 2013 at 07:13:19AM +0200, Ingo Molnar wrote:
> > 
> > * Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > 
> > > On Tue, Oct 01, 2013 at 02:41:58PM -0700, Eric Dumazet wrote:
> > > > Maybe I am missing something obvious ?
> > > 
> > > Yes.  do_execve_common() starts with unshare_files(); there can be
> > > no other thread capable of modifying that descriptor table.
> > 
> > Btw., might the Android Binder:
> > 
> >   drivers/staging/android/binder.c:       struct files_struct *files = proc->files;
> > ...
> >   drivers/staging/android/binder.c:               __fd_install(proc->files, fd, file);
> > ...
> >   drivers/staging/android/binder.c:       retval = __close_fd(proc->files, fd);
> > 
> > violate that assumption?
> 
> Not unless your thread has managed to call an ioctl between entering
> do_execve_common() and calling do_close_on_exec() ;-)

Indeed - while the binder interface appears to allow the insertion of fds 
into other task's file tables, it refcounts its task->files access and 
only ever receives it via get_files_struct(current), so it cannot possibly 
interfere with a private file table resulting from unshare_files().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  1:05 ` spinlock contention of files->file_lock Eric Dumazet
  2013-10-01  1:44   ` Linus Torvalds
@ 2013-10-01  1:53   ` Al Viro
  2013-10-01  2:02     ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2013-10-01  1:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Sep 30, 2013 at 06:05:03PM -0700, Eric Dumazet wrote:
> Speaking of spinlock contention, the files->file_lock is a good example.
> 
> Multi threaded programs hit this spinlock three times per fd : 
> 
> alloc_fd() / fd_install() / close() 
> 
> I think fd_install() could be done without this spinlock.

The problem is dup2(), not table resize...  I'm not saying it can't be
done, but it won't be trivial and you seem to be looking for potential
trouble in the wrong place.  dup2() semantics is the real bitch.
What we want is
	* dup2() over unused descriptor => insert and be done with that
	* dup2() over opened one => replace and do equivalent of close()
for what had been there before
The trouble is, what to do with dup2() over reserved, but still not
installed descriptor?  We handle that as -EBUSY and we use ->file_lock
to get atomicity wrt transitions between those states.

And yes, dup2() is a lousy API in multithreaded situation.  It's still
there...  See comments in do_dup2() for some amusement.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  1:53   ` Al Viro
@ 2013-10-01  2:02     ` Linus Torvalds
  2013-10-01  3:27       ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-10-01  2:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Sep 30, 2013 at 6:53 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The problem is dup2()

Shouldn't a cmpxchg() in just the dup2 code solve that?

If the old value was NULL, you'd have to repeat and go back and see if
the open_fds[] bit had been cleared in the meantime (ie it's NULL not
because somebody else is busy installing it, but because somebody just
uninstalled it).

But yeah, I do agree that that sounds nasty and a complication I
hadn't even thought about. dup2() does violate our normal "let's
pre-allocate the fd slot" rule. Ugh.

              Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  2:02     ` Linus Torvalds
@ 2013-10-01  3:27       ` Al Viro
  2013-10-01  3:36         ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2013-10-01  3:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Sep 30, 2013 at 07:02:23PM -0700, Linus Torvalds wrote:

> Shouldn't a cmpxchg() in just the dup2 code solve that?
> 
> If the old value was NULL, you'd have to repeat and go back and see if
> the open_fds[] bit had been cleared in the meantime (ie it's NULL not
> because somebody else is busy installing it, but because somebody just
> uninstalled it).

Yechh...  Under ->file_lock (in do_dup2()), hopefully?  Or you'll get
all kinds of fun with close() thrown into the game, as well...

> But yeah, I do agree that that sounds nasty and a complication I
> hadn't even thought about. dup2() does violate our normal "let's
> pre-allocate the fd slot" rule. Ugh.

Hell knows...  Descriptor handling *is* pretty well isolated these
days, so it just might be doable without disrupting the living hell
out of anything else.  fs/file.c is pretty much it - everything else
goes through it.

I've enough on my plate at the moment with fs/namespace.c and fs/namei.c,
though, and praying hard fs/inode.c doesn't enter the game.  I _know_
that fs/notify will and I'm not enjoying that for a second.  BTW, has
eparis resurfaced with any fixes for *notify/umount races?  I don't seem
to have anything related in the mailbox, but...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  3:27       ` Al Viro
@ 2013-10-01  3:36         ` Eric Dumazet
  2013-10-01  5:12           ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2013-10-01  3:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, 2013-10-01 at 04:27 +0100, Al Viro wrote:
> On Mon, Sep 30, 2013 at 07:02:23PM -0700, Linus Torvalds wrote:
> 
> > Shouldn't a cmpxchg() in just the dup2 code solve that?
> > 
> > If the old value was NULL, you'd have to repeat and go back and see if
> > the open_fds[] bit had been cleared in the meantime (ie it's NULL not
> > because somebody else is busy installing it, but because somebody just
> > uninstalled it).
> 
> Yechh...  Under ->file_lock (in do_dup2()), hopefully?  Or you'll get
> all kinds of fun with close() thrown into the game, as well...
> 
> > But yeah, I do agree that that sounds nasty and a complication I
> > hadn't even thought about. dup2() does violate our normal "let's
> > pre-allocate the fd slot" rule. Ugh.
> 
> Hell knows...  Descriptor handling *is* pretty well isolated these
> days, so it just might be doable without disrupting the living hell
> out of anything else.  fs/file.c is pretty much it - everything else
> goes through it.

I have a patch mostly working here, and pretty short.

I'll do proper tests before posting it tomorrow.

 fs/fcntl.c              |    7 ++-----
 fs/file.c               |   17 +++++++++++++----
 fs/open.c               |   21 +++++++++++++++++----
 include/linux/fdtable.h |    1 +
 4 files changed, 33 insertions(+), 13 deletions(-)

Thanks !



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: spinlock contention of files->file_lock
  2013-10-01  3:36         ` Eric Dumazet
@ 2013-10-01  5:12           ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2013-10-01  5:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Benjamin Herrenschmidt, Chandramouleeswaran, Aswin,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, 2013-09-30 at 20:36 -0700, Eric Dumazet wrote:

> I have a patch mostly working here, and pretty short.

Here is the RFC patch. Unfortunately I cant really give performance
numbers, as a patch like this would need weeks before being tested here.

 fs/file.c               |   66 ++++++++++++++++++++------------------
 include/linux/fdtable.h |    1 
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 4a78f98..b511f14 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -62,15 +62,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
  */
-static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
+static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt,
+			 bool shared)
 {
-	unsigned int cpy, set;
+	unsigned int cpy, set, i;
 
 	BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
 	cpy = ofdt->max_fds * sizeof(struct file *);
 	set = (nfdt->max_fds - ofdt->max_fds) * sizeof(struct file *);
-	memcpy(nfdt->fd, ofdt->fd, cpy);
+
+	if (shared) {
+		/* see fd_install() why we need xchg() */
+		for (i = 0; i < ofdt->max_fds; i++)
+			nfdt->fd[i] = xchg(&ofdt->fd[i], NULL);
+	} else {
+		memcpy(nfdt->fd, ofdt->fd, cpy);
+	}
 	memset((char *)(nfdt->fd) + cpy, 0, set);
 
 	cpy = ofdt->max_fds / BITS_PER_BYTE;
@@ -167,8 +175,12 @@ static int expand_fdtable(struct files_struct *files, int nr)
 	cur_fdt = files_fdtable(files);
 	if (nr >= cur_fdt->max_fds) {
 		/* Continue as planned */
-		copy_fdtable(new_fdt, cur_fdt);
+
+		write_seqlock(&files->resize);
+		copy_fdtable(new_fdt, cur_fdt, atomic_read(&files->count) > 1);
 		rcu_assign_pointer(files->fdt, new_fdt);
+		write_sequnlock(&files->resize);
+
 		if (cur_fdt != &files->fdtab)
 			call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
 	} else {
@@ -258,6 +270,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	atomic_set(&newf->count, 1);
 
 	spin_lock_init(&newf->file_lock);
+	seqlock_init(&newf->resize);
 	newf->next_fd = 0;
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
@@ -453,6 +466,7 @@ struct files_struct init_files = {
 		.open_fds	= init_files.open_fds_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
+	.resize		= __SEQLOCK_UNLOCKED(init_task.resize),
 };
 
 /*
@@ -569,11 +583,24 @@ void __fd_install(struct files_struct *files, unsigned int fd,
 		struct file *file)
 {
 	struct fdtable *fdt;
-	spin_lock(&files->file_lock);
+	struct file *old;
+	unsigned int seq;
+
+	rcu_read_lock();
+retry:
+	seq = read_seqbegin(&files->resize);
 	fdt = files_fdtable(files);
-	BUG_ON(fdt->fd[fd] != NULL);
-	rcu_assign_pointer(fdt->fd[fd], file);
-	spin_unlock(&files->file_lock);
+	old = xchg(&fdt->fd[fd], file);
+	if (unlikely(old)) {
+		/* dup2() race */
+		fput(old);
+	}
+	if (unlikely(read_seqretry(&files->resize, seq))) {
+		if (cmpxchg(&fdt->fd[fd], file, NULL) == file)
+			goto retry;
+		/* 'file' was consumed by resizer or dup2(), we are done */
+	}
+	rcu_read_unlock();
 }
 
 void fd_install(unsigned int fd, struct file *file)
@@ -783,26 +810,9 @@ static int do_dup2(struct files_struct *files,
 	struct file *tofree;
 	struct fdtable *fdt;
 
-	/*
-	 * We need to detect attempts to do dup2() over allocated but still
-	 * not finished descriptor.  NB: OpenBSD avoids that at the price of
-	 * extra work in their equivalent of fget() - they insert struct
-	 * file immediately after grabbing descriptor, mark it larval if
-	 * more work (e.g. actual opening) is needed and make sure that
-	 * fget() treats larval files as absent.  Potentially interesting,
-	 * but while extra work in fget() is trivial, locking implications
-	 * and amount of surgery on open()-related paths in VFS are not.
-	 * FreeBSD fails with -EBADF in the same situation, NetBSD "solution"
-	 * deadlocks in rather amusing ways, AFAICS.  All of that is out of
-	 * scope of POSIX or SUS, since neither considers shared descriptor
-	 * tables and this condition does not arise without those.
-	 */
 	fdt = files_fdtable(files);
-	tofree = fdt->fd[fd];
-	if (!tofree && fd_is_open(fd, fdt))
-		goto Ebusy;
 	get_file(file);
-	rcu_assign_pointer(fdt->fd[fd], file);
+	tofree = xchg(&fdt->fd[fd], file);
 	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
@@ -814,10 +824,6 @@ static int do_dup2(struct files_struct *files,
 		filp_close(tofree, files);
 
 	return fd;
-
-Ebusy:
-	spin_unlock(&files->file_lock);
-	return -EBUSY;
 }
 
 int replace_fd(unsigned fd, struct file *file, unsigned flags)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 085197b..893c60f 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -49,6 +49,7 @@ struct files_struct {
 	atomic_t count;
 	struct fdtable __rcu *fdt;
 	struct fdtable fdtab;
+	seqlock_t resize;
   /*
    * written part on a separate cache line in SMP
    */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-10-02 14:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-30 19:29 Avoiding the dentry d_lock on final dput(), part deux: transactional memory Linus Torvalds
2013-09-30 20:01 ` Waiman Long
2013-09-30 20:04   ` Linus Torvalds
2013-10-02 14:56     ` Andi Kleen
2013-09-30 22:52 ` Benjamin Herrenschmidt
2013-10-01  0:36   ` Michael Neuling
2013-10-01  0:56     ` Linus Torvalds
2013-10-01  2:05       ` Benjamin Herrenschmidt
2013-10-01  3:13         ` Paul E. McKenney
2013-10-01  4:52           ` Michael Neuling
2013-10-01 12:16             ` Paul E. McKenney
2013-10-01 13:42               ` Paul E. McKenney
2013-10-01  1:05 ` spinlock contention of files->file_lock Eric Dumazet
2013-10-01  1:44   ` Linus Torvalds
2013-10-01  2:18     ` Eric Dumazet
2013-10-01 21:41     ` Eric Dumazet
2013-10-01 22:04       ` Al Viro
2013-10-01 22:21         ` Eric Dumazet
2013-10-02  5:13         ` Ingo Molnar
2013-10-02 10:20           ` Al Viro
2013-10-02 10:56             ` Ingo Molnar
2013-10-01  1:53   ` Al Viro
2013-10-01  2:02     ` Linus Torvalds
2013-10-01  3:27       ` Al Viro
2013-10-01  3:36         ` Eric Dumazet
2013-10-01  5:12           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).