Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
       [not found]                 ` <1445991236.7476.59.camel@edumazet-glaptop2.roam.corp.google.com>
@ 2015-10-28 12:35                   ` Al Viro
  2015-10-28 13:24                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-28 12:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

[Linus and Dave added, Solaris and NetBSD folks dropped from Cc]

On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> 
> > 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.  I think I see a way to improve it
> > without slowing the normal case down, but I'll need to experiment a
> > bit before I post patches.  Anybody with examples of real-world loads
> > that make our descriptor allocator to degrade is very welcome to post
> > the reproducers...
> 
> Well, I do have real-world loads, but quite hard to setup in a lab :(
> 
> Note that we also hit the 'struct cred'->usage refcount for every
> open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> cache line really helps, as current_fsuid() and current_fsgid() no
> longer forces a pingpong.
> 
> I moved seldom used fields on the first cache line, so that overall
> memory usage did not change (192 bytes on 64 bit arches)

[snip]

Makes sense, but there's a funny thing about that refcount - the part
coming from ->f_cred is the most frequently changed *and* almost all
places using ->f_cred are just looking at its fields and do not manipulate
its refcount.  The only exception (do_process_acct()) is easy to eliminate
just by storing a separate reference to the current creds of acct(2) caller
and using it instead of looking at ->f_cred.  What's more, the place where we
grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
most of the time such a reference is there for dropping ->f_cred (in
file_free()/file_free_rcu()).

With that change in kernel/acct.c done, we could do the following:
	a) split the cred refcount into the normal and percpu parts and
add a spinlock in there.
	b) have put_cred() do this:
		if (atomic_dec_and_test(&cred->usage)) {
			this_cpu_add(&cred->f_cred_usage, 1);
			call_rcu(&cred->rcu, put_f_cred_rcu);
		}
	c) have get_empty_filp() increment current_cred ->f_cred_usage with
this_cpu_add()
	d) have file_free() do
		percpu_counter_dec(&nr_files);
		rcu_read_lock();
		if (likely(atomic_read(&f->f_cred->usage))) {
			this_cpu_add(&f->f_cred->f_cred_usage, -1);
			rcu_read_unlock();
			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
		} else {
			rcu_read_unlock();
			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
		}
file_free_rcu() being
static void file_free_rcu(struct rcu_head *head)
{
        struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
        put_f_cred(&f->f_cred->rcu);
        kmem_cache_free(filp_cachep, f);
}
and file_free_rcu_light() - the same sans put_f_cred();

with put_f_cred() doing
	spin_lock cred->lock
	this_cpu_add(&cred->f_cred_usage, -1);
	find the sum of cred->f_cred_usage
	spin_unlock cred->lock
	if the sum has not reached 0
		return
	current put_cred_rcu(cred)

IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
(most of) ->f_cred dropping.

Note that there are two paths leading to put_f_cred() in the above - via
call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
&f->f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
different rcu_head are used.

atomic_read() check in file_free() might give false positives if it comes
just before put_cred() on another CPU kills the last non-f_cred reference.
It's not a problem, since put_f_cred() from that put_cred() won't be
executed until we drop rcu_read_lock(), so we can safely decrement the
cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
be dropping the last of that - the same put_cred() would've incremented
->f_cred_usage).

Does anybody see problems with that approach?  I'm going to grab some sleep
(only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
field-reordering patch when I get up...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 12:35                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
@ 2015-10-28 13:24                     ` Eric Dumazet
  2015-10-28 14:47                       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-10-28 13:24 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, 2015-10-28 at 12:35 +0000, Al Viro wrote:
> [Linus and Dave added, Solaris and NetBSD folks dropped from Cc]
> 
> On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> > 
> > > 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > > inevitable, unless we accept considerably heavier memory footprint,
> > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > > to trigger - close(3);open(...); will have the next open() after that
> > > scanning the entire in-use bitmap.  I think I see a way to improve it
> > > without slowing the normal case down, but I'll need to experiment a
> > > bit before I post patches.  Anybody with examples of real-world loads
> > > that make our descriptor allocator to degrade is very welcome to post
> > > the reproducers...
> > 
> > Well, I do have real-world loads, but quite hard to setup in a lab :(
> > 
> > Note that we also hit the 'struct cred'->usage refcount for every
> > open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> > cache line really helps, as current_fsuid() and current_fsgid() no
> > longer forces a pingpong.
> > 
> > I moved seldom used fields on the first cache line, so that overall
> > memory usage did not change (192 bytes on 64 bit arches)
> 
> [snip]
> 
> Makes sense, but there's a funny thing about that refcount - the part
> coming from ->f_cred is the most frequently changed *and* almost all
> places using ->f_cred are just looking at its fields and do not manipulate
> its refcount.  The only exception (do_process_acct()) is easy to eliminate
> just by storing a separate reference to the current creds of acct(2) caller
> and using it instead of looking at ->f_cred.  What's more, the place where we
> grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
> most of the time such a reference is there for dropping ->f_cred (in
> file_free()/file_free_rcu()).
> 
> With that change in kernel/acct.c done, we could do the following:
> 	a) split the cred refcount into the normal and percpu parts and
> add a spinlock in there.
> 	b) have put_cred() do this:
> 		if (atomic_dec_and_test(&cred->usage)) {
> 			this_cpu_add(&cred->f_cred_usage, 1);
> 			call_rcu(&cred->rcu, put_f_cred_rcu);
> 		}
> 	c) have get_empty_filp() increment current_cred ->f_cred_usage with
> this_cpu_add()
> 	d) have file_free() do
> 		percpu_counter_dec(&nr_files);
> 		rcu_read_lock();
> 		if (likely(atomic_read(&f->f_cred->usage))) {
> 			this_cpu_add(&f->f_cred->f_cred_usage, -1);
> 			rcu_read_unlock();
> 			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
> 		} else {
> 			rcu_read_unlock();
> 			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> 		}
> file_free_rcu() being
> static void file_free_rcu(struct rcu_head *head)
> {
>         struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
>         put_f_cred(&f->f_cred->rcu);
>         kmem_cache_free(filp_cachep, f);
> }
> and file_free_rcu_light() - the same sans put_f_cred();
> 
> with put_f_cred() doing
> 	spin_lock cred->lock
> 	this_cpu_add(&cred->f_cred_usage, -1);
> 	find the sum of cred->f_cred_usage
> 	spin_unlock cred->lock
> 	if the sum has not reached 0
> 		return
> 	current put_cred_rcu(cred)
> 
> IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
> (most of) ->f_cred dropping.
> 
> Note that there are two paths leading to put_f_cred() in the above - via
> call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
> &f->f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
> different rcu_head are used.
> 
> atomic_read() check in file_free() might give false positives if it comes
> just before put_cred() on another CPU kills the last non-f_cred reference.
> It's not a problem, since put_f_cred() from that put_cred() won't be
> executed until we drop rcu_read_lock(), so we can safely decrement the
> cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
> be dropping the last of that - the same put_cred() would've incremented
> ->f_cred_usage).
> 
> Does anybody see problems with that approach?  I'm going to grab some sleep
> (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
> field-reordering patch when I get up...

Before I take a deep look at your suggestion, are you sure plain use of
include/linux/percpu-refcount.h infra is not possible for struct cred ?

Thanks !

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 13:24                     ` Eric Dumazet
@ 2015-10-28 14:47                       ` Eric Dumazet
  2015-10-28 21:13                         ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-10-28 14:47 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:

> Before I take a deep look at your suggestion, are you sure plain use of
> include/linux/percpu-refcount.h infra is not possible for struct cred ?

BTW, I am not convinced we need to spend so much energy and per-cpu
memory for struct cred refcount.

The big problem is fd array spinlock of course and bitmap search for
POSIX compliance.

The cache line trashing in struct cred is a minor one ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 14:47                       ` Eric Dumazet
@ 2015-10-28 21:13                         ` Al Viro
  2015-10-28 21:44                           ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-28 21:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> 
> > Before I take a deep look at your suggestion, are you sure plain use of
> > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> 
> BTW, I am not convinced we need to spend so much energy and per-cpu
> memory for struct cred refcount.
> 
> The big problem is fd array spinlock of course and bitmap search for
> POSIX compliance.
> 
> The cache line trashing in struct cred is a minor one ;)

percpu-refcount isn't convenient - the only such candidate for ref_kill in
there is "all other references are gone", and that can happen in
interesting locking environments.  I doubt that it would be a good fit, TBH...

Cacheline pingpong on the descriptors bitmap is probably inevitable, but
it's not the only problem in the existing implementation - close a small
descriptor when you've got a lot of them and look for the second open
after that.  _That_ can lead to thousands of cachelines being read through,
all under the table spinlock.  It's literally orders of magnitude worse.
And if the first open after that close happens to be for a short-living
descriptor, you'll get the same situation back in your face as soon as you
close it.

I think we can seriously improve that without screwing the fast path by
adding "summary" bitmaps once the primary grows past the cacheline worth
of bits.  With bits in the summary bitmap corresponding to cacheline-sized
chunks of the primary, being set iff all bits in the corresponding chunk
are set.  If the summary map grows larger than one cacheline, add the
second-order one (that happens at quarter million descriptors and serves
until 128 million; adding the third-order map is probably worthless).

I want to maintain the same kind of "everything below this is known to be
in use" thing as we do now.  Allocation would start with looking into the
same place in primary bitmap where we'd looked now and similar search
forward for zero bit.  _However_, it would stop at cacheline boundary.
If nothing had been found, we look in the corresponding place in the
summary bitmap and search for zero bit there.  Again, no more than up
to the cacheline boundary.  If something is found, we've got a chunk in
the primary known to contain a zero bit; if not - go to the second-level
and search there, etc.

When a zero bit in the primary had been found, check if it's within the
rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
that bit.  If there are zero bits left in the same word - we are done,
otherwise check the still unread words in the cacheline and see if all
of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.

Normal case is exactly the same as now - one cacheline accessed and modified.
We might end up touching more than that, but it's going to be rare and
the cases when it happens are very likely to lead to much worse amount of
memory traffic with the current code.

Freeing is done by zeroing the bit in primary, checking for other zero bits
nearby and buggering off if there are such.  If the entire cacheline used
to be all-bits-set, clear the bit in summary and, if there's a second-order
summary, get the bit in there clear as well - it's probably not worth
bothering with checking that all the cacheline in summary bitmap had been
all-bits-set.  Again, the normal case is the same as now.

It'll need profiling and tuning, but AFAICS it's doable without making the
things worse than they are now, and it should get rid of those O(N) fetches
under spinlock cases.  And yes, those are triggerable and visible in
profiles.  IMO it's worth trying to fix...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 21:13                         ` Al Viro
@ 2015-10-28 21:44                           ` Eric Dumazet
  2015-10-28 22:33                             ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-10-28 21:44 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, 2015-10-28 at 21:13 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> > 
> > > Before I take a deep look at your suggestion, are you sure plain use of
> > > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> > 
> > BTW, I am not convinced we need to spend so much energy and per-cpu
> > memory for struct cred refcount.
> > 
> > The big problem is fd array spinlock of course and bitmap search for
> > POSIX compliance.
> > 
> > The cache line trashing in struct cred is a minor one ;)
> 
> percpu-refcount isn't convenient - the only such candidate for ref_kill in
> there is "all other references are gone", and that can happen in
> interesting locking environments.  I doubt that it would be a good fit, TBH...

OK then ...

> Cacheline pingpong on the descriptors bitmap is probably inevitable, but
> it's not the only problem in the existing implementation - close a small
> descriptor when you've got a lot of them and look for the second open
> after that.  _That_ can lead to thousands of cachelines being read through,
> all under the table spinlock.  It's literally orders of magnitude worse.
> And if the first open after that close happens to be for a short-living
> descriptor, you'll get the same situation back in your face as soon as you
> close it.
> 
> I think we can seriously improve that without screwing the fast path by
> adding "summary" bitmaps once the primary grows past the cacheline worth
> of bits.  With bits in the summary bitmap corresponding to cacheline-sized
> chunks of the primary, being set iff all bits in the corresponding chunk
> are set.  If the summary map grows larger than one cacheline, add the
> second-order one (that happens at quarter million descriptors and serves
> until 128 million; adding the third-order map is probably worthless).
> 
> I want to maintain the same kind of "everything below this is known to be
> in use" thing as we do now.  Allocation would start with looking into the
> same place in primary bitmap where we'd looked now and similar search
> forward for zero bit.  _However_, it would stop at cacheline boundary.
> If nothing had been found, we look in the corresponding place in the
> summary bitmap and search for zero bit there.  Again, no more than up
> to the cacheline boundary.  If something is found, we've got a chunk in
> the primary known to contain a zero bit; if not - go to the second-level
> and search there, etc.
> 
> When a zero bit in the primary had been found, check if it's within the
> rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
> that bit.  If there are zero bits left in the same word - we are done,
> otherwise check the still unread words in the cacheline and see if all
> of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.
> 
> Normal case is exactly the same as now - one cacheline accessed and modified.
> We might end up touching more than that, but it's going to be rare and
> the cases when it happens are very likely to lead to much worse amount of
> memory traffic with the current code.
> 
> Freeing is done by zeroing the bit in primary, checking for other zero bits
> nearby and buggering off if there are such.  If the entire cacheline used
> to be all-bits-set, clear the bit in summary and, if there's a second-order
> summary, get the bit in there clear as well - it's probably not worth
> bothering with checking that all the cacheline in summary bitmap had been
> all-bits-set.  Again, the normal case is the same as now.
> 
> It'll need profiling and tuning, but AFAICS it's doable without making the
> things worse than they are now, and it should get rid of those O(N) fetches
> under spinlock cases.  And yes, those are triggerable and visible in
> profiles.  IMO it's worth trying to fix...
> 

Well, all this complexity goes away with a O_FD_FASTALLOC /
SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
tells the kernel we do not care getting the lowest possible fd as POSIX
mandates.

With this bit set, the bitmap search can start at a random point, and we
find a lot in O(1) : one cache line miss, if you have at least one free
bit/slot per 512 bits (64 bytes cache line).

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

... // active sockets
socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0);
... // passive sockets
accept4(sockfd, ..., SOCK_FD_FASTALLOC);
...

Except for legacy stuff and stdin/stdout/stderr games, I really doubt
lot of applications absolutely rely on the POSIX thing...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 21:44                           ` Eric Dumazet
@ 2015-10-28 22:33                             ` Al Viro
  2015-10-28 23:08                               ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-28 22:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:

> Well, all this complexity goes away with a O_FD_FASTALLOC /
> SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> tells the kernel we do not care getting the lowest possible fd as POSIX
> mandates.

... which won't do a damn thing for existing userland.

> Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> lot of applications absolutely rely on the POSIX thing...

We obviously can't turn that into default behaviour, though.  BTW, what
distribution do you have in mind for those random descriptors?  Uniform
on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
memory footprint pretty soon...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 22:33                             ` Al Viro
@ 2015-10-28 23:08                               ` Eric Dumazet
  2015-10-29  0:15                                 ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-10-28 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, 2015-10-28 at 22:33 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:
> 
> > Well, all this complexity goes away with a O_FD_FASTALLOC /
> > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> > tells the kernel we do not care getting the lowest possible fd as POSIX
> > mandates.
> 
> ... which won't do a damn thing for existing userland.

For the userland that need +5,000,000 socket, I can tell you they are
using this flag as soon they are aware it exists ;)

> 
> > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > lot of applications absolutely rely on the POSIX thing...
> 
> We obviously can't turn that into default behaviour, though.  BTW, what
> distribution do you have in mind for those random descriptors?  Uniform
> on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> memory footprint pretty soon...

Simply [0 , fdt->max_fds] is working well in most cases.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 23:08                               ` Eric Dumazet
@ 2015-10-29  0:15                                 ` Al Viro
  2015-10-29  3:29                                   ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-29  0:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > lot of applications absolutely rely on the POSIX thing...
> > 
> > We obviously can't turn that into default behaviour, though.  BTW, what
> > distribution do you have in mind for those random descriptors?  Uniform
> > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > memory footprint pretty soon...
> 
> Simply [0 , fdt->max_fds] is working well in most cases.

Umm...  So first you dup2() to establish the ->max_fds you want, then
do such opens?  What used/unused ratio do you expect to deal with?
And what kind of locking are you going to use?  Keep in mind that
e.g. dup2() is dependent on the lack of allocations while it's working,
so it's not as simple as "we don't need no stinkin' ->files_lock"...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29  0:15                                 ` Al Viro
@ 2015-10-29  3:29                                   ` Eric Dumazet
  2015-10-29  4:16                                     ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-10-29  3:29 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Thu, 2015-10-29 at 00:15 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > > lot of applications absolutely rely on the POSIX thing...
> > > 
> > > We obviously can't turn that into default behaviour, though.  BTW, what
> > > distribution do you have in mind for those random descriptors?  Uniform
> > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > > memory footprint pretty soon...
> > 
> > Simply [0 , fdt->max_fds] is working well in most cases.
> 
> Umm...  So first you dup2() to establish the ->max_fds you want, then
> do such opens?

Yes, dup2() is done at program startup, knowing the expected max load
(in term of concurrent fd) + ~10 % (actual fd array size can be more
than this because of power of two rounding in alloc_fdtable() )

But this is an optimization : If you do not use the initial dup2(), the
fd array can be automatically expanded if needed (all slots are in use)

>   What used/unused ratio do you expect to deal with?
> And what kind of locking are you going to use?  Keep in mind that
> e.g. dup2() is dependent on the lack of allocations while it's working,
> so it's not as simple as "we don't need no stinkin' ->files_lock"...

No locking change. files->file_lock is still taken.

We only want to minimize time to find an empty slot.

The trick is to not start bitmap search at files->next_fd, but a random
point. This is a win if we assume there are enough holes.

low = start;
if (low < files->next_fd)
    low = files->next_fd;

res = -1;
if (flags & O_FD_FASTALLOC) {
	random_point = pick_random_between(low, fdt->max_fds);

	res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
				random_point);
	/* No empty slot found, try the other range */
	if (res >= fdt->max_fds) {
		res = find_next_zero_bit(fdt->open_fds,
				low, random_point);
		if (res >= random_point)
			res = -1;
	}
}
...





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29  3:29                                   ` Eric Dumazet
@ 2015-10-29  4:16                                     ` Al Viro
  2015-10-29 12:35                                       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-29  4:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Wed, Oct 28, 2015 at 08:29:41PM -0700, Eric Dumazet wrote:

> But this is an optimization : If you do not use the initial dup2(), the
> fd array can be automatically expanded if needed (all slots are in use)

Whee...

> No locking change. files->file_lock is still taken.
> 
> We only want to minimize time to find an empty slot.

Then I'd say that my variant is going to win.  It *will* lead to
cacheline pingpong in more cases than yours, but I'm quite sure that
it will be a win as far as the total amount of cachelines accessed.

> The trick is to not start bitmap search at files->next_fd, but a random
> point. This is a win if we assume there are enough holes.
> 
> low = start;
> if (low < files->next_fd)
>     low = files->next_fd;
> 
> res = -1;
> if (flags & O_FD_FASTALLOC) {
> 	random_point = pick_random_between(low, fdt->max_fds);
> 
> 	res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
> 				random_point);
> 	/* No empty slot found, try the other range */
> 	if (res >= fdt->max_fds) {
> 		res = find_next_zero_bit(fdt->open_fds,
> 				low, random_point);
> 		if (res >= random_point)
> 			res = -1;
> 	}
> }

Have you tried to experiment with that in userland?  I mean, emulate that
thing in normal userland code, count the cacheline accesses and drive it
with the use patterns collected from actual applications.

I can sit down and play with math expectations, but I suspect that it's
easier to experiment.  It's nothing but an intuition (I hadn't seriously
done probability theory in quite a while, and my mathematical tastes run
more to geometry and topology anyway), but... I would expect it to degrade
badly when the bitmap is reasonably dense.

Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
not cheap - it's done via #PF triggered in kernel mode, with handler
noticing that the faulting address is in vmalloc range and doing the
right thing.  IOW, if your bitmap is very sparse, the price of page faults
needs to be taken into account.

AFAICS, the only benefit of that thing is keeping dirtied cachelines far
from each other.  Which might be a win overall, but I'm not convinced that
the rest won't offset the effect of that...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29  4:16                                     ` Al Viro
@ 2015-10-29 12:35                                       ` Eric Dumazet
  2015-10-29 13:48                                         ` Eric Dumazet
  2015-10-30 17:18                                         ` Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-10-29 12:35 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Thu, 2015-10-29 at 04:16 +0000, Al Viro wrote:

> Have you tried to experiment with that in userland?  I mean, emulate that
> thing in normal userland code, count the cacheline accesses and drive it
> with the use patterns collected from actual applications.

Sure.

> 
> I can sit down and play with math expectations, but I suspect that it's
> easier to experiment.  It's nothing but an intuition (I hadn't seriously
> done probability theory in quite a while, and my mathematical tastes run
> more to geometry and topology anyway), but... I would expect it to degrade
> badly when the bitmap is reasonably dense.
> 
> Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
> not cheap - it's done via #PF triggered in kernel mode, with handler
> noticing that the faulting address is in vmalloc range and doing the
> right thing.  IOW, if your bitmap is very sparse, the price of page faults
> needs to be taken into account.

This vmalloc PF is pure noise.
This only matters for the very first allocations.

We target programs opening zillions of fd in their lifetime ;)

Not having to expand a 4,000,000 slots fd array while fully loaded also
removes a latency spike that is very often not desirable.


> 
> AFAICS, the only benefit of that thing is keeping dirtied cachelines far
> from each other.  Which might be a win overall, but I'm not convinced that
> the rest won't offset the effect of that...

Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
It also lowers time we hold the fd array spinlock while doing fd alloc.

User land test program I wrote few months back

Current kernel :

    64.98%  [kernel]          [k] queued_spin_lock_slowpath    
    14.88%  opensock          [.] memset    // this part simulates user land actual work ;)                   
    11.15%  [kernel]          [k] _find_next_bit.part.0        
     0.69%  [kernel]          [k] _raw_spin_lock               
     0.46%  [kernel]          [k] memset_erms                  
     0.38%  [kernel]          [k] sk_alloc                     
     0.37%  [kernel]          [k] kmem_cache_alloc             
     0.33%  [kernel]          [k] get_empty_filp               
     0.31%  [kernel]          [k] kmem_cache_free              
     0.26%  [kernel]          [k] __alloc_fd                   
     0.26%  opensock          [.] child_function               
     0.18%  [kernel]          [k] inode_init_always            
     0.17%  opensock          [.] __random_r                   


/*
 * test for b/9072743 : fd scaling on gigantic process (with ~ 10,000,000 TCP sockets)
 * - Size fd arrays in kernel to avoid resizings that kill latencies.
 * - Then launch xx threads doing
 *    populate the fd array of the process, opening 'max' files.
 *    
 *    - Loop : close(randomfd()), socket(AF_INET, SOCK_STREAM, 0);
 *
 * Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]
 */

#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

unsigned int count;
int skflags;

#define NBTHREADS_MAX 4096
pthread_t tid[NBTHREADS_MAX];
int nbthreads;
int nbthreads_req = 24;
int stop_all;

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

/* expand kernel fd array for optimal perf.
 * This could be done by doing a loop on dup(),
 * or can be done using dup2()
 */
int expand_fd_array(int max)
{
	int target, res;
	int fd = socket(AF_INET, SOCK_STREAM, 0);

	if (fd == -1) {
		perror("socket()");
		return -1;
	}
	for (;;) {
		count = max;
		target = count;
		if (skflags & SOCK_FD_FASTALLOC)
			target += count/10;
		res = dup2(fd, target);
		if (res != -1) {
			close(res);
			break;
		}
		max -= max/10;
	}
	printf("count=%u (check/increase ulimit -n)\n", count);
	return 0;
}

static char state[32] = {
	0, 1, 2, 3, 4, 5, 6, 7,
	8, 9, 10, 11, 12, 13, 14, 15,
	16, 17, 18, 19, 20, 21, 22, 23,
	24, 25, 26, 27, 28, 29, 30, 31
};

/* each thread is using ~400 KB of data per unit of work */
#define WORKING_SET_SIZE 400000

static void *child_function(void *arg)
{
	unsigned int max = count / nbthreads_req;
	struct random_data buf;
	unsigned int idx;
	int *tab;
	unsigned long iter = 0;
	unsigned long *work_set = malloc(WORKING_SET_SIZE);
	int i;

	if (!work_set)
		return NULL;
	tab = malloc(max * sizeof(int));
	if (!tab) {
		free(work_set);
		return NULL;
	}
	memset(tab, 255, max * sizeof(int));
	
	initstate_r(getpid(), state, sizeof(state), &buf);

	tab[0] = socket(AF_INET, SOCK_STREAM | skflags, 0);
	for (i = 1; i < max; i++)
		tab[i] = dup(tab[0]);

	while (!stop_all) {
		random_r(&buf, &idx);
		idx = idx % max;
		close(tab[idx]);

		/* user space needs typically to use a bit of the memory. */
		memset(work_set, idx, WORKING_SET_SIZE);

		tab[idx] = socket(AF_INET, SOCK_STREAM | skflags, 0);
		if (tab[idx] == -1) {
			perror("socket");
			break;
		}
		iter++;
	}
	for (i = 0; i < max; i++)
		close(tab[idx]);
	free(tab);
	free(work_set);
	printf("%lu\n", iter);
	return NULL;
}

static int launch_threads(void)
{
	int i, err;

	for (i = 0; i < nbthreads_req; i++) {
		err = pthread_create(&tid[i], NULL, child_function, NULL);
		if (err)
			return err;
		nbthreads++;
	}
	return 0;
}

static void wait_end(void)
{
	int i;
	for (i = 0; i < nbthreads; i++)
		pthread_join(tid[i], NULL);
}

static void usage(int code)
{
	fprintf(stderr, "Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]\n");
	exit(code);
}

int main(int argc, char *argv[])
{
	int c;
	int max = 1000000;
	int duration = 10;

	while ((c = getopt(argc, argv, "fn:t:l:")) != -1) {
		switch (c) {
		case 'f':
			skflags = SOCK_FD_FASTALLOC;
			break;
		case 'n':
			max = atoi(optarg);
			break;
		case 't':
			nbthreads_req = atoi(optarg);
			if (nbthreads_req > NBTHREADS_MAX)
				usage(1);
			break;
		case 'l':
			duration = atoi(optarg);
			break;
		default:
			usage(1);
		}
	}
	system("sysctl -w fs.file-max=8000000");
	expand_fd_array(max);
	launch_threads();
	sleep(duration);
	stop_all = 1;
	wait_end();
}



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 12:35                                       ` Eric Dumazet
@ 2015-10-29 13:48                                         ` Eric Dumazet
  2015-10-30 17:18                                         ` Linus Torvalds
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-10-29 13:48 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells,
	linux-fsdevel

On Thu, 2015-10-29 at 05:35 -0700, Eric Dumazet wrote:

> Current kernel :
> 
>     64.98%  [kernel]          [k] queued_spin_lock_slowpath    
>     14.88%  opensock          [.] memset    // this part simulates user land actual work ;)                   
>     11.15%  [kernel]          [k] _find_next_bit.part.0        
>      0.69%  [kernel]          [k] _raw_spin_lock               
>      0.46%  [kernel]          [k] memset_erms                  
>      0.38%  [kernel]          [k] sk_alloc                     
>      0.37%  [kernel]          [k] kmem_cache_alloc             
>      0.33%  [kernel]          [k] get_empty_filp               
>      0.31%  [kernel]          [k] kmem_cache_free              
>      0.26%  [kernel]          [k] __alloc_fd                   
>      0.26%  opensock          [.] child_function               
>      0.18%  [kernel]          [k] inode_init_always            
>      0.17%  opensock          [.] __random_r                   

With attached prototype patch we get this profile instead :

You can see we no longer hit the spinlock issue and cache waste
in find_next_bit.

Userland can really progress _much_ faster.

    76.86%  opensock          [.] memset                        
     1.31%  [kernel]          [k] _raw_spin_lock                
     1.15%  assd              [.] 0x000000000056f32c            
     1.08%  [kernel]          [k] kmem_cache_free               
     0.97%  [kernel]          [k] kmem_cache_alloc              
     0.83%  [kernel]          [k] sk_alloc                      
     0.72%  [kernel]          [k] memset_erms                   
     0.70%  opensock          [.] child_function                
     0.67%  [kernel]          [k] get_empty_filp                
     0.65%  [kernel]          [k] __alloc_fd                    
     0.58%  [kernel]          [k] __close_fd                    
     0.49%  [kernel]          [k] queued_spin_lock_slowpath     

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..eabb9a626259 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/workqueue.h>
+#include <linux/random.h>
 
 int sysctl_nr_open __read_mostly = 1024*1024;
 int sysctl_nr_open_min = BITS_PER_LONG;
@@ -471,6 +472,19 @@ int __alloc_fd(struct files_struct *files,
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
+
+	if (unlikely(flags & O_FD_FASTALLOC)) {
+		u32 rnd, limit = min(end, fdt->max_fds);
+
+		/*
+		 * Note: do not bother with files->next_fd,
+		 * this is for POSIX lovers...
+		 */
+		rnd = ((u64)prandom_u32() * limit) >> 32;
+		fd = find_next_zero_bit(fdt->open_fds, limit, rnd);
+		if (fd < limit)
+			goto ok;
+	}
 	fd = start;
 	if (fd < files->next_fd)
 		fd = files->next_fd;
@@ -499,7 +513,7 @@ repeat:
 
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
-
+ok:
 	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
diff --git a/include/linux/net.h b/include/linux/net.h
index 70ac5e28e6b7..3823d082af4c 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -76,6 +76,7 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
+#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063effe0cc1..badd421dd9f4 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_FD_FASTALLOC
+#define O_FD_FASTALLOC 0x40000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/net/socket.c b/net/socket.c
index 9963a0b53a64..6dde02b2eaf9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1227,9 +1227,10 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_FD_FASTALLOC & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1240,7 +1241,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 	if (retval < 0)
 		goto out;
 
-	retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK | O_FD_FASTALLOC));
 	if (retval < 0)
 		goto out_release;
 
@@ -1266,7 +1267,7 @@ SYSCALL_DEFINE4(socketpair, int, family, int, type, int, protocol,
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1436,7 +1437,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
 	int err, len, newfd, fput_needed;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 12:35                                       ` Eric Dumazet
  2015-10-29 13:48                                         ` Eric Dumazet
@ 2015-10-30 17:18                                         ` Linus Torvalds
  2015-10-30 21:02                                           ` Al Viro
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-10-30 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Thu, Oct 29, 2015 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
> find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
> It also lowers time we hold the fd array spinlock while doing fd alloc.

I do wonder if we couldn't just speed up the bitmap allocator by an
order of magnitude. It would be nicer to be able to make existing
loads faster without any new "don't bother with POSIX semantics" flag.

We could, for example, extend the "open_fds" bitmap with a
"second-level" bitmap that has information on when the first level is
full. We traverse the open_fd's bitmap one word at a time anyway, we
could have a second-level bitmap that has one bit per word to say
whether that word is already full.

So you'd basically have:

 - open_fds: array with the real bitmap (exactly like we do now)
 - full_fds_bits: array of bits that shows which of the "unsigned
longs" in 'open_fs' are already full.

and then you can basically walk 4096 fd's at a time by just checking
one single word in the full_fds_bits[] array.

So in __alloc_fd(), instead of using find_next_zero_bit(), you'd use
"find_next_fd()", which woulf do something like

    #define NR_BITS_PER_BIT (64*sizeof(long)*sizeof(long))

    unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
    {
        while (start < fdt->max_fds) {
            unsigned long startbit = start / BITS_PER_LONG;
            unsigned long bitbit = startbit / BITS_PER_LONG;
            unsigned long bitmax = (bitbit+1) * BITS_PER_LONG * BITS_PER_LONG;

            if (bitmax > max_fds)
                bitmax = fdt->max_fds;

            // Are all the bits in all the bitbit arrays already set?
            if (!~fds->full_fds_bits[bitbit]) {
                start = bitmax;
                continue;
            }

            fd = find_next_zero_bit(fdt->open_fds, bitmax, start);
            if (fd < bitmax)
                return fd;
        }
        return fdt->max_fds;
    }

which really should cut down on the bit traversal by a factor of 64.

Of course, then you do have to maintain that bitmap-of-bits in
__set_open_fd() and __clear_open_fd(), but that should be pretty
trivial too:

 - __clear_open_fd() would become just

        __clear_bit(fd, fdt->open_fds);
        __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);

 - and __set_open_fd() would become

        __set_bit(fd, fdt->open_fds);
        fd /= BITS_PER_LONG;
        if (!~fdt->open_fds[fd])
            __set_bit(fd, fdt->full_fds_bits);

or something.

NOTE NOTE NOTE! The above is all very much written without any testing
in the email client, so while I tried to make it look like "real
code", consider the above just pseudo-code.

The advantage of the above is that it should just work for existing
binaries. It may not be quite as optimal as just introducing a new
"don't care about POSIX" feature, but quite frankly, if it cuts down
the bad case of "find_next_zero_bit()" by a factror of 64 (and then
adds a *small* expense factor on top of that), I suspect it should be
"good enough" even for your nasty case.

What do you think? Willing to try the above approach (with any
inevitable bug-fixes) and see how it compares?

Obviously in addition to any fixes to my pseudo-code above you'd need
to add the allocations for the new "full_fds_bits" etc, but I think it
should be easy to make the full_fds_bit allocation be *part* of the
"open_fds" allocation, so you wouldn't need a new allocation in
alloc_fdtable(). We already do that whole "use a single allocation" to
combine open_fds with close_on_exec into one single allocation.

                    Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 17:18                                         ` Linus Torvalds
@ 2015-10-30 21:02                                           ` Al Viro
  2015-10-30 21:23                                             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-30 21:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 10:18:12AM -0700, Linus Torvalds wrote:

> I do wonder if we couldn't just speed up the bitmap allocator by an
> order of magnitude. It would be nicer to be able to make existing
> loads faster without any new "don't bother with POSIX semantics" flag.
> 
> We could, for example, extend the "open_fds" bitmap with a
> "second-level" bitmap that has information on when the first level is
> full. We traverse the open_fd's bitmap one word at a time anyway, we
> could have a second-level bitmap that has one bit per word to say
> whether that word is already full.

Your variant has 1:64 ratio; obviously better than now, but we can actually
do 1:bits-per-cacheline quite easily.

I've been playing with a variant that has more than two bitmaps, and
AFAICS it
	a) does not increase the amount of cacheline pulled and
	b) keeps it well-bound even in the absolutely worst case
(128M-odd descriptors filled, followed by close(0);dup2(1,0); -
in that case it ends up accessing the 7 cachelines worth of
bitmaps; your variant will read through 4000-odd cachelines of
the summary bitmap alone; the mainline is even worse).

> The advantage of the above is that it should just work for existing
> binaries. It may not be quite as optimal as just introducing a new
> "don't care about POSIX" feature, but quite frankly, if it cuts down
> the bad case of "find_next_zero_bit()" by a factror of 64 (and then
> adds a *small* expense factor on top of that), I suspect it should be
> "good enough" even for your nasty case.
> 
> What do you think? Willing to try the above approach (with any
> inevitable bug-fixes) and see how it compares?
> 
> Obviously in addition to any fixes to my pseudo-code above you'd need
> to add the allocations for the new "full_fds_bits" etc, but I think it
> should be easy to make the full_fds_bit allocation be *part* of the
> "open_fds" allocation, so you wouldn't need a new allocation in
> alloc_fdtable(). We already do that whole "use a single allocation" to
> combine open_fds with close_on_exec into one single allocation.

I'll finish testing what I've got and post it; it costs 3 extra pointers
in the files_struct and a bit fatter bitmap allocation (less than 0.2%
extra).  All the arguments regarding the unmodified binaries apply, of
course, and so far it looks fairly compact...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:02                                           ` Al Viro
@ 2015-10-30 21:23                                             ` Linus Torvalds
  2015-10-30 21:50                                               ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-10-30 21:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Your variant has 1:64 ratio; obviously better than now, but we can actually
> do 1:bits-per-cacheline quite easily.

Ok, but in that case you end up needing a counter for each cacheline
too (to count how many bits, in order to know when to say "cacheline
is entirely full").

Because otherwise you'll end up having to check the whole cacheline
(rather than check just one word, or check the "bit counter") when you
set a bit.

So I suspect a "bitmap of full words" ends up being much simpler.  But
hey, if you can make your more complex thing work, go right ahead.

               Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:23                                             ` Linus Torvalds
@ 2015-10-30 21:50                                               ` Linus Torvalds
  2015-10-30 22:33                                                 ` Al Viro
  2015-10-31  1:07                                                 ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-10-30 21:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> Your variant has 1:64 ratio; obviously better than now, but we can actually
>> do 1:bits-per-cacheline quite easily.
>
> Ok, but in that case you end up needing a counter for each cacheline
> too (to count how many bits, in order to know when to say "cacheline
> is entirely full").

So here's a largely untested version of my "one bit per word"
approach. It seems to work, but looking at it, I'm unhappy with a few
things:

 - using kmalloc() for the .full_fds_bits[] array is simple, but
disgusting, since 99% of all programs just have a single word.

   I know I talked about just adding the allocation to the same one
that allocates the bitmaps themselves, but I got lazy and didn't do
it. Especially since that code seems to try fairly hard to make the
allocations nice powers of two, according to the comments. That may
actually matter from an allocation standpoint.

 - Maybe we could just use that "full_fds_bits_init" field for when a
single word is sufficient, and avoid the kmalloc that way?

Anyway. This is a pretty simple patch, and I actually think that we
could just get rid of the "next_fd" logic entirely with this. That
would make this *patch* be more complicated, but it would make the
resulting *code* be simpler.

Hmm? Want to play with this? Eric, what does this do to your test-case?

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 4966 bytes --]

 fs/file.c               | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/fdtable.h |  2 ++
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..0b89a97cabb5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -48,6 +48,7 @@ static void __free_fdtable(struct fdtable *fdt)
 {
 	kvfree(fdt->fd);
 	kvfree(fdt->open_fds);
+	kfree(fdt->full_fds_bits);
 	kfree(fdt);
 }
 
@@ -56,6 +57,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +81,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memset((char *)(nfdt->open_fds) + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+	cpy = BITBIT_SIZE(ofdt->max_fds);
+	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+	memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -122,8 +131,21 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 
+	/*
+	 * The "bitmap of bitmaps" has a bit set for each word in
+	 * the open_fds array that is full. We initialize it to
+	 * zero both at allocation and at copying, because it is
+	 * important that it never have a bit set for a non-full
+	 * word, but it doesn't have to be exact otherwise.
+	 */
+	fdt->full_fds_bits = kzalloc(BITBIT_SIZE(nr), GFP_KERNEL);
+	if (!fdt->full_fds_bits)
+		goto out_nofull;
+
 	return fdt;
 
+out_nofull:
+	kvfree(fdt->open_fds);
 out_arr:
 	kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +251,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
+	fd /= BITS_PER_LONG;
+	if (!~fdt->open_fds[fd])
+		__set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
+	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +306,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
 	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -323,6 +350,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files));
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -454,10 +482,25 @@ struct files_struct init_files = {
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
 		.open_fds	= init_files.open_fds_init,
+		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
+static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
+{
+	unsigned long maxfd = fdt->max_fds;
+	unsigned long maxbit = maxfd / BITS_PER_LONG;
+	unsigned long bitbit = start / BITS_PER_LONG;
+
+	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
+	if (bitbit > maxfd)
+		return maxfd;
+	if (bitbit > start)
+		start = bitbit;
+	return find_next_zero_bit(fdt->open_fds, maxfd, start);
+}
+
 /*
  * allocate a file descriptor, mark it busy.
  */
@@ -476,7 +519,7 @@ repeat:
 		fd = files->next_fd;
 
 	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
+		fd = find_next_fd(fdt, fd);
 
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e226465..5295535b60c6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -26,6 +26,7 @@ struct fdtable {
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
+	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
 };
 
@@ -59,6 +60,7 @@ struct files_struct {
 	int next_fd;
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
+	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
 };
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:50                                               ` Linus Torvalds
@ 2015-10-30 22:33                                                 ` Al Viro
  2015-10-30 23:52                                                   ` Linus Torvalds
  2015-10-31  1:07                                                 ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-30 22:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:

> Anyway. This is a pretty simple patch, and I actually think that we
> could just get rid of the "next_fd" logic entirely with this. That
> would make this *patch* be more complicated, but it would make the
> resulting *code* be simpler.

Dropping next_fd would screw you in case of strictly sequential allocations...

Your point re power-of-two allocations is well-taken, but then I'm not
sure that kzalloc() is good enough here.  Look: you have a bit for every
64 descriptors, i.e. byte per 512.  On 10M case Eric had been talking
about that'll yield 32Kb worth of your secondary bitmap.  It's right on
the edge of the range where vmalloc() becomes attractive; for something
bigger it gets even worse...

Currently we go for vmalloc (on bitmaps) once we are past 128K descriptors
(two bitmaps packed together => 256Kbit = 32Kb).  kmalloc() is very
sensitive to size being a power of two, but IIRC vmalloc() isn't...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 22:33                                                 ` Al Viro
@ 2015-10-30 23:52                                                   ` Linus Torvalds
  2015-10-31  0:09                                                     ` Al Viro
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-10-30 23:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

On Fri, Oct 30, 2015 at 3:33 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:
>
>> Anyway. This is a pretty simple patch, and I actually think that we
>> could just get rid of the "next_fd" logic entirely with this. That
>> would make this *patch* be more complicated, but it would make the
>> resulting *code* be simpler.
>
> Dropping next_fd would screw you in case of strictly sequential allocations...

I don't think it would matter in real life, since I don't really think
you have lots of fd's with strictly sequential behavior.

That said, the trivial "open lots of fds" benchmark would show it, so
I guess we can just keep it. The next_fd logic is not expensive or
complex, after all.

> Your point re power-of-two allocations is well-taken, but then I'm not
> sure that kzalloc() is good enough here.

Attached is an updated patch that just uses the regular bitmap
allocator and extends it to also have the bitmap of bitmaps. It
actually simplifies the patch, so I guess it's better this way.

Anyway, I've tested it all a bit more, and for a trivial worst-case
stress program that explicitly kills the next_fd logic by doing

    for (i = 0; i < 1000000; i++) {
        close(3);
        dup2(0,3);
        if (dup(0))
            break;
    }

it takes it down from roughly 10s to 0.2s. So the patch is quite
noticeable on that kind of pattern.

NOTE! You'll obviously need to increase your limits to actually be
able to do the above with lots of file descriptors.

I ran Eric's test-program too, and find_next_zero_bit() dropped to a
fraction of a percent. It's not entirely gone, but it's down in the
noise.

I really suspect this patch is "good enough" in reality, and I would
*much* rather do something like this than add a new non-POSIX flag
that people have to update their binaries for. I agree with Eric that
*some* people will do so, but it's still the wrong thing to do. Let's
just make performance with the normal semantics be good enough that we
don't need to play odd special games.

Eric?

                       Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 4590 bytes --]

 fs/file.c               | 39 +++++++++++++++++++++++++++++++++++----
 include/linux/fdtable.h |  2 ++
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..6f6eb2b03af5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memset((char *)(nfdt->open_fds) + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+	cpy = BITBIT_SIZE(ofdt->max_fds);
+	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+	memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = alloc_fdmem(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
+				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
 	fdt->open_fds = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
+	data += nr / BITS_PER_BYTE;
+	fdt->full_fds_bits = data;
 
 	return fdt;
 
@@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
+	fd /= BITS_PER_LONG;
+	if (!~fdt->open_fds[fd])
+		__set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
+	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
 	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files));
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -454,10 +470,25 @@ struct files_struct init_files = {
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
 		.open_fds	= init_files.open_fds_init,
+		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
+static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
+{
+	unsigned long maxfd = fdt->max_fds;
+	unsigned long maxbit = maxfd / BITS_PER_LONG;
+	unsigned long bitbit = start / BITS_PER_LONG;
+
+	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
+	if (bitbit > maxfd)
+		return maxfd;
+	if (bitbit > start)
+		start = bitbit;
+	return find_next_zero_bit(fdt->open_fds, maxfd, start);
+}
+
 /*
  * allocate a file descriptor, mark it busy.
  */
@@ -476,7 +507,7 @@ repeat:
 		fd = files->next_fd;
 
 	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
+		fd = find_next_fd(fdt, fd);
 
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e226465..5295535b60c6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -26,6 +26,7 @@ struct fdtable {
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
+	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
 };
 
@@ -59,6 +60,7 @@ struct files_struct {
 	int next_fd;
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
+	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
 };
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 23:52                                                   ` Linus Torvalds
@ 2015-10-31  0:09                                                     ` Al Viro
  2015-10-31 15:59                                                     ` Eric Dumazet
  2015-10-31 19:34                                                     ` Al Viro
  2 siblings, 0 replies; 32+ messages in thread
From: Al Viro @ 2015-10-31  0:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote:

> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

IIRC, at least a part of what Eric used to complain about was that on
seriously multithreaded processes doing a lot of e.g. socket(2) we end up
a lot of bouncing the cacheline containing the first free bits in the bitmap.
But looking at the whole thing, I really wonder if the tons of threads asking
for random bytes won't get at least as bad cacheline bouncing while
getting said bytes, so I'm not sure if that rationale has survived.

PS: this problem obviously exists in Linus' variant as well as in mine;
the question is whether Eric's approach manages to avoid it in the first place.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:50                                               ` Linus Torvalds
  2015-10-30 22:33                                                 ` Al Viro
@ 2015-10-31  1:07                                                 ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-10-31  1:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Fri, 2015-10-30 at 14:50 -0700, Linus Torvalds wrote:
> On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >> Your variant has 1:64 ratio; obviously better than now, but we can actually
> >> do 1:bits-per-cacheline quite easily.
> >
> > Ok, but in that case you end up needing a counter for each cacheline
> > too (to count how many bits, in order to know when to say "cacheline
> > is entirely full").
> 
> So here's a largely untested version of my "one bit per word"
> approach. It seems to work, but looking at it, I'm unhappy with a few
> things:
> 
>  - using kmalloc() for the .full_fds_bits[] array is simple, but
> disgusting, since 99% of all programs just have a single word.
> 
>    I know I talked about just adding the allocation to the same one
> that allocates the bitmaps themselves, but I got lazy and didn't do
> it. Especially since that code seems to try fairly hard to make the
> allocations nice powers of two, according to the comments. That may
> actually matter from an allocation standpoint.
> 
>  - Maybe we could just use that "full_fds_bits_init" field for when a
> single word is sufficient, and avoid the kmalloc that way?

At least make sure the allocation uses a cache line, so that multiple
processes do not share same cache line for this possibly hot field

fdt->full_fds_bits = kzalloc(max_t(size_t,
				   L1_CACHE_BYTES,
				   BITBIT_SIZE(nr)),
			     GFP_KERNEL);

> 
> Anyway. This is a pretty simple patch, and I actually think that we
> could just get rid of the "next_fd" logic entirely with this. That
> would make this *patch* be more complicated, but it would make the
> resulting *code* be simpler.
> 
> Hmm? Want to play with this? Eric, what does this do to your test-case?

Excellent results so far Linus, 500 % increase, thanks a lot !

Tested using 16 threads, 8 on Socket0, 8 on Socket1 

Before patch :

# ulimit -n 12000000
# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10   
count=10000000 (check/increase ulimit -n)
total = 636870

After patch :

taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
count=10000000 (check/increase ulimit -n)
total = 3845134   (6 times better)

Your patch out-performs the O_FD_FASTALLOC one on this particular test
by ~ 9 % :

taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 -f
count=10000000 (check/increase ulimit -n)
total = 3505252

If I raise to 48 threads, the FAST_ALLOC wins by 5 % (3752087 instead of
3546666).

Oh, and 48 threads without any patches : 383027

-> program runs one order of magnitude faster, congrats !



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 23:52                                                   ` Linus Torvalds
  2015-10-31  0:09                                                     ` Al Viro
@ 2015-10-31 15:59                                                     ` Eric Dumazet
  2015-10-31 19:34                                                     ` Al Viro
  2 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-10-31 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Fri, 2015-10-30 at 16:52 -0700, Linus Torvalds wrote: sequential
allocations...
> 
> I don't think it would matter in real life, since I don't really think
> you have lots of fd's with strictly sequential behavior.
> 
> That said, the trivial "open lots of fds" benchmark would show it, so
> I guess we can just keep it. The next_fd logic is not expensive or
> complex, after all.

+1


> Attached is an updated patch that just uses the regular bitmap
> allocator and extends it to also have the bitmap of bitmaps. It
> actually simplifies the patch, so I guess it's better this way.
> 
> Anyway, I've tested it all a bit more, and for a trivial worst-case
> stress program that explicitly kills the next_fd logic by doing
> 
>     for (i = 0; i < 1000000; i++) {
>         close(3);
>         dup2(0,3);
>         if (dup(0))
>             break;
>     }
> 
> it takes it down from roughly 10s to 0.2s. So the patch is quite
> noticeable on that kind of pattern.
> 
> NOTE! You'll obviously need to increase your limits to actually be
> able to do the above with lots of file descriptors.
> 
> I ran Eric's test-program too, and find_next_zero_bit() dropped to a
> fraction of a percent. It's not entirely gone, but it's down in the
> noise.
> 
> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

I absolutely agree a generic solution is far better, especially when
its performance is in par.

Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>

Note that a non-POSIX flag (or a thread personality hints)
would still allow the kernel to do proper NUMA affinity placement : Say
the fd_array and bitmaps are split on the 2 nodes (or more, but most
servers nowadays have 2 sockets really).

Then at fd allocation time, we can prefer to pick an fd for which memory
holding various bits and the file pointer are in the local node

This speeds up subsequent fd system call on programs that constantly
blow away cpu caches, saving QPI transactions.

Thanks a lot Linus.

lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
count=10000000 (check/increase ulimit -n)
total = 3992764

lpaa24:~# ./opensock -t 48 -n 10000000 -l 10
count=10000000 (check/increase ulimit -n)
total = 3545249

Profile with 16 threads :

    69.55%  opensock          [.] memset                       
    11.83%  [kernel]          [k] queued_spin_lock_slowpath    
     1.91%  [kernel]          [k] _find_next_bit.part.0        
     1.68%  [kernel]          [k] _raw_spin_lock               
     0.99%  [kernel]          [k] kmem_cache_alloc             
     0.99%  [kernel]          [k] memset_erms                  
     0.95%  [kernel]          [k] get_empty_filp               
     0.82%  [kernel]          [k] __close_fd                   
     0.73%  [kernel]          [k] __alloc_fd                   
     0.65%  [kernel]          [k] sk_alloc                     
     0.63%  opensock          [.] child_function               
     0.56%  [kernel]          [k] fput                         
     0.35%  [kernel]          [k] sock_alloc                   
     0.31%  [kernel]          [k] kmem_cache_free              
     0.31%  [kernel]          [k] inode_init_always            
     0.28%  [kernel]          [k] d_set_d_op                   
     0.27%  [kernel]          [k] entry_SYSCALL_64_after_swapgs

Profile with 48 threads :

    57.92%  [kernel]          [k] queued_spin_lock_slowpath    
    32.14%  opensock          [.] memset                       
     0.81%  [kernel]          [k] _find_next_bit.part.0        
     0.51%  [kernel]          [k] _raw_spin_lock               
     0.45%  [kernel]          [k] kmem_cache_alloc             
     0.38%  [kernel]          [k] kmem_cache_free              
     0.34%  [kernel]          [k] __close_fd                   
     0.32%  [kernel]          [k] memset_erms                  
     0.25%  [kernel]          [k] __alloc_fd                   
     0.24%  [kernel]          [k] get_empty_filp               
     0.23%  opensock          [.] child_function               
     0.18%  [kernel]          [k] __d_alloc                    
     0.17%  [kernel]          [k] inode_init_always            
     0.16%  [kernel]          [k] sock_alloc                   
     0.16%  [kernel]          [k] del_timer                    
     0.15%  [kernel]          [k] entry_SYSCALL_64_after_swapgs
     0.15%  perf              [.] 0x000000000004d924           
     0.15%  [kernel]          [k] tcp_close                    





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 23:52                                                   ` Linus Torvalds
  2015-10-31  0:09                                                     ` Al Viro
  2015-10-31 15:59                                                     ` Eric Dumazet
@ 2015-10-31 19:34                                                     ` Al Viro
  2015-10-31 19:54                                                       ` Linus Torvalds
  2 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-31 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote:
> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

... and here's the current variant of mine.  FWIW, it seems to survive
LTP and xfstests + obvious "let's torture the allocator".  On the
"million dups" test it seems to be about 25% faster than the one Linus
had posted, at ten millions - about 80%.  On opensock results seem to be
about 20% better than with the variant Linus has posted, but I'm not sure
if the testbox is anywhere near the expected, so I'd appreciate if you'd
given it a spin on your setups.

It obviously needs saner comments, tuning, etc.  BTW, another obvious
low-hanging fruit with this data structure is count_open_files() (and
that goes for 1:64 bitmap Linus uses) - dup2(0, 10000000); close(10000000);
fork(); and count_open_files() is chewing through the damn bitmap from
about 16M down to low tens.  While holding ->files_lock, at that...
I'm not saying it's critical, and it's definitely a followup patch fodder
in either approach, but it's easy enough to do.

diff --git a/fs/file.c b/fs/file.c
index 6c672ad..fa43cbe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -30,6 +30,8 @@ int sysctl_nr_open_min = BITS_PER_LONG;
 int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) &
 			 -BITS_PER_LONG;
 
+#define BITS_PER_CHUNK 512
+
 static void *alloc_fdmem(size_t size)
 {
 	/*
@@ -46,8 +48,10 @@ static void *alloc_fdmem(size_t size)
 
 static void __free_fdtable(struct fdtable *fdt)
 {
+	int i;
 	kvfree(fdt->fd);
-	kvfree(fdt->open_fds);
+	for (i = 0; i <= 3; i++)
+		kvfree(fdt->bitmaps[i]);
 	kfree(fdt);
 }
 
@@ -62,7 +66,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  */
 static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 {
-	unsigned int cpy, set;
+	unsigned int cpy, set, to, from, level, n;
 
 	BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
@@ -71,18 +75,53 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)(nfdt->fd) + cpy, 0, set);
 
-	cpy = ofdt->max_fds / BITS_PER_BYTE;
-	set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE;
-	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-	memset((char *)(nfdt->open_fds) + cpy, 0, set);
+	cpy = ofdt->max_fds / 8;
+	set = (nfdt->max_fds - ofdt->max_fds) / 8;
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+	if (likely(!nfdt->bitmaps[1])) {
+		// flat to flat
+		memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy);
+		memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set);
+		return;
+	}
+	to = round_up(nfdt->max_fds, BITS_PER_CHUNK);
+	set = (to - ofdt->max_fds) / 8;
+	// copy and pad the primary
+	memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8);
+	memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set);
+	// copy and pad the old secondaries
+	from = round_up(ofdt->max_fds, BITS_PER_CHUNK);
+	for (level = 1; level <= 3; level++) {
+		if (!ofdt->bitmaps[level])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8);
+		memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8);
+	}
+	// zero the new ones (if any)
+	for (n = level; n <= 3; n++) {
+		if (!nfdt->bitmaps[n])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memset(nfdt->bitmaps[n], 0, to / 8);
+	}
+	// and maybe adjust bit 0 in the first new one.
+	if (unlikely(n != level)) {
+		unsigned long *p = nfdt->bitmaps[level - 1];
+		for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++)
+			if (~p[n])
+				return;
+		__set_bit(0, nfdt->bitmaps[level]);
+	}
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
 	struct fdtable *fdt;
 	void *data;
+	int level = 0;
 
 	/*
 	 * Figure out how many fds we actually want to support in this fdtable.
@@ -114,16 +153,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 		goto out_fdt;
 	fdt->fd = data;
 
+	if (nr > BITS_PER_CHUNK)
+		nr = round_up(nr, BITS_PER_CHUNK);
 	data = alloc_fdmem(max_t(size_t,
 				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
-	fdt->open_fds = data;
+	fdt->bitmaps[0] = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
-
+	fdt->bitmaps[1] = fdt->bitmaps[2] = fdt->bitmaps[3] = NULL;
+	while (unlikely(nr > BITS_PER_CHUNK)) {
+		nr = round_up(nr / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		data = alloc_fdmem(nr);
+		if (!data)
+			goto out_bitmaps;
+		fdt->bitmaps[++level] = data;
+	}
 	return fdt;
 
+out_bitmaps:
+	while (level >= 0)
+		kvfree(fdt->bitmaps[level--]);
 out_arr:
 	kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +280,47 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
+static bool set_and_check(unsigned long *start, unsigned n)
+{
+	int i;
+	start += (n & -BITS_PER_CHUNK) / BITS_PER_LONG;
+	n %= BITS_PER_CHUNK;
+	__set_bit(n, start);
+	for (i = 0; i < BITS_PER_CHUNK / BITS_PER_LONG; i++)
+		if (~*start++)
+			return true;
+	return false;
+}
+
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
 {
-	__set_bit(fd, fdt->open_fds);
+	int level;
+	for (level = 0; ; level++, fd /= BITS_PER_CHUNK) {
+		if (level == 3 || !fdt->bitmaps[level + 1]) {
+			__set_bit(fd, fdt->bitmaps[level]);
+			break;
+		}
+		if (likely(set_and_check(fdt->bitmaps[level], fd)))
+			break;
+	}
 }
 
 static inline void __clear_open_fd(int fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->open_fds);
+	int level;
+	unsigned long *p = fdt->bitmaps[0] + fd / BITS_PER_LONG, v;
+	v = *p;
+	__clear_bit(fd % BITS_PER_LONG, p);
+	if (~v)		// quick test to avoid looking at other cachelines
+		return;
+	for (level = 1; level <= 3; level++) {
+		if (!fdt->bitmaps[level])
+			break;
+		fd /= BITS_PER_CHUNK;
+		if (!test_bit(fd, fdt->bitmaps[level]))
+			break;
+		__clear_bit(fd, fdt->bitmaps[level]);
+	}
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -246,7 +330,7 @@ static int count_open_files(struct fdtable *fdt)
 
 	/* Find the last open fd */
 	for (i = size / BITS_PER_LONG; i > 0; ) {
-		if (fdt->open_fds[--i])
+		if (fdt->bitmaps[0][--i])
 			break;
 	}
 	i = (i + 1) * BITS_PER_LONG;
@@ -262,7 +346,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 {
 	struct files_struct *newf;
 	struct file **old_fds, **new_fds;
-	int open_files, size, i;
+	int open_files, size, i, n;
 	struct fdtable *old_fdt, *new_fdt;
 
 	*errorp = -ENOMEM;
@@ -279,7 +363,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
-	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->bitmaps[0] = newf->open_fds_init;
+	new_fdt->bitmaps[1] = new_fdt->bitmaps[2] = new_fdt->bitmaps[3] = NULL;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -321,8 +406,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
-	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->bitmaps[0], old_fdt->bitmaps[0], open_files / 8);
+
+	n = round_up(open_files, BITS_PER_CHUNK);
+	for (i = 1; i <= 3; i++) {
+		if (!new_fdt->bitmaps[i])
+			break;
+		n /= BITS_PER_CHUNK;
+		n = round_up(n, BITS_PER_CHUNK);
+		memcpy(new_fdt->bitmaps[i], old_fdt->bitmaps[i], n / 8);
+	}
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -351,10 +445,24 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		int left = (new_fdt->max_fds - open_files) / 8;
 		int start = open_files / BITS_PER_LONG;
 
-		memset(&new_fdt->open_fds[start], 0, left);
-		memset(&new_fdt->close_on_exec[start], 0, left);
+		memset(new_fdt->close_on_exec + start, 0, left);
+		if (likely(!new_fdt->bitmaps[1])) {
+			memset(new_fdt->bitmaps[0] + start, 0, left);
+			goto done;
+		}
+		start = round_up(open_files, BITS_PER_CHUNK);
+		n = round_up(new_fdt->max_fds, BITS_PER_CHUNK);
+		for (i = 0 ; i <= 3; i++) {
+			char *p = (void *)new_fdt->bitmaps[i];
+			if (!p)
+				break;
+			n = round_up(n / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			start = round_up(start / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			memset(p + start / 8, 0, (n - start) / 8);
+		}
 	}
 
+done:
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
 	return newf;
@@ -380,7 +488,7 @@ static struct fdtable *close_files(struct files_struct * files)
 		i = j * BITS_PER_LONG;
 		if (i >= fdt->max_fds)
 			break;
-		set = fdt->open_fds[j++];
+		set = fdt->bitmaps[0][j++];
 		while (set) {
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
@@ -453,70 +561,146 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
-		.open_fds	= init_files.open_fds_init,
+		.bitmaps[0]	= init_files.open_fds_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
-/*
- * allocate a file descriptor, mark it busy.
- */
+/* search for the next zero bit in cacheline */
+static unsigned scan(unsigned long *start, unsigned size, unsigned from,
+		int check_zeroes)
+{
+	unsigned long *p = start + from / BITS_PER_LONG, *q = p, *end;
+	unsigned bit = from % BITS_PER_LONG, res;
+	unsigned long v = *p, w = v + (1UL<<bit);
+
+	start += (from & -BITS_PER_CHUNK) / BITS_PER_LONG;
+	end = start + size / BITS_PER_LONG;
+
+	if (unlikely(!(w & ~v))) {
+		while (likely(++q < end)) {
+			v = *q;
+			w = v + 1;
+			if (likely(w))
+				goto got_it;
+		}
+		return size;	// not in this chunk
+	}
+got_it:
+	res = __ffs(w & ~v);		// log2, really - it's a power of 2
+	res += (q - start) * BITS_PER_LONG;
+	if (!check_zeroes)
+		return res;
+	if (likely(~(v | w)))		// would zeroes remain in *q?
+		return res;
+	if (p == q || !bit)		// was *p fully checked?
+		p--;
+	while (++q < end)		// any zeros in the tail?
+		if (likely(~*q))
+			return res;
+	if (unlikely(check_zeroes > 1))
+		for (q = start; q <= p; q++)
+			if (~*q)
+				return res;
+	return res | (1U<<31);
+}
+
 int __alloc_fd(struct files_struct *files,
 	       unsigned start, unsigned end, unsigned flags)
 {
-	unsigned int fd;
+	unsigned int fd, base;
 	int error;
 	struct fdtable *fdt;
-
+	unsigned count, level, n;
+	int summary;
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
+	count = fdt->max_fds;
+	summary = 2;
 	fd = start;
-	if (fd < files->next_fd)
+	if (likely(fd <= files->next_fd)) {
 		fd = files->next_fd;
-
-	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
-
-	/*
-	 * N.B. For clone tasks sharing a files structure, this test
-	 * will limit the total number of files that can be opened.
-	 */
-	error = -EMFILE;
-	if (fd >= end)
-		goto out;
-
-	error = expand_files(files, fd);
-	if (error < 0)
+		summary = 1;
+	}
+	base = fd;
+	if (unlikely(base >= count))
+		goto expand2;
+	if (likely(!fdt->bitmaps[1])) {
+		base = scan(fdt->bitmaps[0], count, base, 0);
+		if (unlikely(base == count))
+			goto expand;
+		if (unlikely(base >= end)) {
+			error = -EMFILE;
+			goto out;
+		}
+		fd = base;
+		__set_bit(fd, fdt->bitmaps[0]);
+		goto found;
+	}
+	n = scan(fdt->bitmaps[0], BITS_PER_CHUNK, base, summary);
+	base &= -BITS_PER_CHUNK;
+	base += n & ~(1U<<31);
+	if (unlikely(n == BITS_PER_CHUNK)) {
+		int bits[3];
+		level = 0;
+		do {
+			bits[level] = count;
+			count = DIV_ROUND_UP(count, BITS_PER_CHUNK);
+			base /= BITS_PER_CHUNK;
+			n = scan(fdt->bitmaps[++level], BITS_PER_CHUNK, base, 0);
+			base &= -BITS_PER_CHUNK;
+			base += n;
+			if (unlikely(base >= count))
+				goto expand;
+		} while (unlikely(n == BITS_PER_CHUNK));
+		while (level--) {
+			base *= BITS_PER_CHUNK;
+			n = scan(fdt->bitmaps[level], BITS_PER_CHUNK, base, !level);
+			if (WARN_ON(n == BITS_PER_CHUNK)) {
+				error = -EINVAL;
+				goto out;
+			}
+			base += n & ~(1U<<31);
+			if (unlikely(base >= bits[level]))
+				goto expand;
+		}
+	}
+	if (unlikely(base >= end)) {
+		error = -EMFILE;
 		goto out;
-
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
-
-	if (start <= files->next_fd)
+	}
+	fd = base;
+	__set_bit(fd, fdt->bitmaps[0]);
+	if (unlikely(n & (1U << 31))) {
+		for (level = 1; ; level++) {
+			base /= BITS_PER_CHUNK;
+			if (level == 3 || !fdt->bitmaps[level + 1]) {
+				__set_bit(base, fdt->bitmaps[level]);
+				break;
+			}
+			if (likely(set_and_check(fdt->bitmaps[level], base)))
+				break;
+		}
+	}
+found:
+	if (summary == 1)
 		files->next_fd = fd + 1;
-
-	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
-
 out:
 	spin_unlock(&files->file_lock);
 	return error;
+expand:
+	base = fdt->max_fds;
+expand2:
+	error = expand_files(files, base);
+	if (error < 0)
+		goto out;
+	goto repeat;
 }
 
 static int alloc_fd(unsigned start, unsigned flags)
@@ -809,7 +993,8 @@ __releases(&files->file_lock)
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
-	__set_open_fd(fd, fdt);
+	if (!tofree)
+		__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
diff --git a/fs/select.c b/fs/select.c
index 0155473..670f542 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -350,7 +350,7 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds)
 	set = ~(~0UL << (n & (BITS_PER_LONG-1)));
 	n /= BITS_PER_LONG;
 	fdt = files_fdtable(current->files);
-	open_fds = fdt->open_fds + n;
+	open_fds = fdt->bitmaps[0] + n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e2..6ef5274 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -25,7 +25,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
-	unsigned long *open_fds;
+	unsigned long *bitmaps[4];
 	struct rcu_head rcu;
 };
 
@@ -36,7 +36,7 @@ static inline bool close_on_exec(int fd, const struct fdtable *fdt)
 
 static inline bool fd_is_open(int fd, const struct fdtable *fdt)
 {
-	return test_bit(fd, fdt->open_fds);
+	return test_bit(fd, fdt->bitmaps[0]);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 19:34                                                     ` Al Viro
@ 2015-10-31 19:54                                                       ` Linus Torvalds
  2015-10-31 20:29                                                         ` Al Viro
  2015-10-31 20:45                                                         ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-10-31 19:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... and here's the current variant of mine.

Ugh. I really liked how simple mine ended up being. Yours is definitely not.

And based on the profiles from Eric, finding the fd is no longer the
problem even with my simpler patch. The problem ends up being the
contention on the file_lock spinlock.

Eric, I assume that's not "expand_fdtable", since your test-program
seems to expand the fd array at the beginning. So it's presumably all
from the __alloc_fd() use, but we should double-check.. Eric, can you
do a callgraph profile and see which caller is the hottest?

                 Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 19:54                                                       ` Linus Torvalds
@ 2015-10-31 20:29                                                         ` Al Viro
  2015-11-02  0:24                                                           ` Al Viro
  2015-10-31 20:45                                                         ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Al Viro @ 2015-10-31 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote:
> On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... and here's the current variant of mine.
> 
> Ugh. I really liked how simple mine ended up being. Yours is definitely not.

Note that it's not the final variant - just something that should be
testable.  There are all kinds of things that still need cleaning/simplifying
in there - e.g. scan() is definitely more complex than needed (if nothing
else, the "small bitmap" case is simply find_next_zero_bit(), and the
rest all have size equal to full cacheline; moreover, I'd overdone the
"... and check if there are other zero bits left" thing - its callers
used to use that a lot, and with the execption of two of them it was
absolutely worthless.  So it ended up more generic than necessary and
I'm going to trim that crap down.

It's still going to end up more complex than yours, obviously, but not as
badly as it is now.  I'm not sure if the speedup will be worth the
extra complexity, thus asking for testing...  On _some_ loads it is
considerably faster (at least by factor of 5), but whether those loads
resemble anything that occurs on real systems...

BTW, considerable amount of unpleasantness is due to ragged-right-end kind
of problems - take /proc/sys/fs/nr-open to something other than a power of
two and a whole lot of fun issues start happening.  I went for "if there
are secondary bitmaps at all, pad all bitmaps to a multiple of cacheline",
which at least somewhat mitigates that mess; hell knows, there might be
a clever way to sidestep it entirely...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 19:54                                                       ` Linus Torvalds
  2015-10-31 20:29                                                         ` Al Viro
@ 2015-10-31 20:45                                                         ` Eric Dumazet
  2015-10-31 21:23                                                           ` Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-10-31 20:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Sat, 2015-10-31 at 12:54 -0700, Linus Torvalds wrote:
> On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... and here's the current variant of mine.
> 
> Ugh. I really liked how simple mine ended up being. Yours is definitely not.
> 
> And based on the profiles from Eric, finding the fd is no longer the
> problem even with my simpler patch. The problem ends up being the
> contention on the file_lock spinlock.
> 
> Eric, I assume that's not "expand_fdtable", since your test-program
> seems to expand the fd array at the beginning. So it's presumably all
> from the __alloc_fd() use, but we should double-check.. Eric, can you
> do a callgraph profile and see which caller is the hottest?

Sure : profile taken while test runs using 16 threads (Since this is
probably not a too biased micro benchmark...)

# hostname : lpaa24
# os release : 4.3.0-smp-DEV
# perf version : 3.12.0-6-GOOGLE
# arch : x86_64
# nrcpus online : 48
# nrcpus avail : 48
# cpudesc : Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 264126320 kB
# cmdline : /usr/bin/perf record -a -g sleep 4 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, att
# CPU_TOPOLOGY info available, use -I to display
# NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, msr = 38, uncore_cbox_10 = 35, uncore_cbox_11 = 36, software = 1, power = 7, uncore_irp = 8, uncore_pcu = 37, tracepoint = 2, uncore_
# Samples: 260K of event 'cycles'
# Event count (approx.): 196742182232
#
# Overhead        Command        Shared Object                                                                                                                
# ........  .............  ...................  ..............................................................................................................
#
    67.15%       opensock  opensock             [.] memset                                                                                                    
                 |
                 --- memset

    13.84%       opensock  [kernel.kallsyms]    [k] queued_spin_lock_slowpath                                                                                 
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.97%-- _raw_spin_lock
                    |          |          
                    |          |--53.03%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--46.83%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.13%-- [...]
                     --0.03%-- [...]

     1.84%       opensock  [kernel.kallsyms]    [k] _find_next_bit.part.0                                                                                     
                 |
                 --- _find_next_bit.part.0
                    |          
                    |--65.97%-- find_next_zero_bit
                    |          __alloc_fd
                    |          get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--34.01%-- __alloc_fd
                    |          get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          |          
                    |           --100.00%-- 0x0
                     --0.02%-- [...]

     1.59%       opensock  [kernel.kallsyms]    [k] _raw_spin_lock                                                                                            
                 |
                 --- _raw_spin_lock
                    |          
                    |--28.78%-- get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--26.53%-- sys_close
                    |          entry_SYSCALL_64_fastpath
                    |          __libc_close
                    |          
                    |--13.95%-- cache_alloc_refill
                    |          |          
                    |          |--99.48%-- kmem_cache_alloc
                    |          |          |          
                    |          |          |--81.20%-- sk_prot_alloc
                    |          |          |          sk_alloc
                    |          |          |          inet_create
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--8.43%-- sock_alloc_inode
                    |          |          |          alloc_inode
                    |          |          |          new_inode_pseudo
                    |          |          |          sock_alloc
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--5.80%-- __d_alloc
                    |          |          |          d_alloc_pseudo
                    |          |          |          sock_alloc_file
                    |          |          |          sock_map_fd
                    |          |          |          sys_socket



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 20:45                                                         ` Eric Dumazet
@ 2015-10-31 21:23                                                           ` Linus Torvalds
  2015-10-31 21:51                                                             ` Al Viro
  2015-10-31 22:34                                                             ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-10-31 21:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 1:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>     13.84%       opensock  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
>                  |
>                  --- queued_spin_lock_slowpath
>                     |
>                     |--99.97%-- _raw_spin_lock
>                     |          |
>                     |          |--53.03%-- __close_fd
>                     |          |
>                     |          |--46.83%-- __alloc_fd

Interesting. "__close_fd" actually looks more expensive than
allocation. They presumably get called equally often, so it's probably
some cache effect.

__close_fd() doesn't do anything even remotely interesting as far as I
can tell, but it strikes me that we probably take a *lot* of cache
misses on the stupid "close-on-exec" flags, which are probably always
zero anyway.

Mind testing something really stupid, and making the __clear_bit() in
__clear_close_on_exec() conditiona, something like this:

     static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
     {
    -       __clear_bit(fd, fdt->close_on_exec);
    +       if (test_bit(fd, fdt->close_on_exec)
    +               __clear_bit(fd, fdt->close_on_exec);
     }

and see if it makes a difference.

This is the kind of thing that a single-threaded (or even
single-socket) test will never actually show, because it caches well
enough. But for two sockets, I could imagine the unnecessary dirtying
of cachelines and ping-pong being noticeable.

The other stuff we probably can't do all that much about. Unless we
decide to go for some complicated lockless optimistic file descriptor
allocation scheme with retry-on-failure instead of locks. Which I'm
sure is possible, but I'm equally sure is painful.

                    Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 21:23                                                           ` Linus Torvalds
@ 2015-10-31 21:51                                                             ` Al Viro
  2015-10-31 22:34                                                             ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Al Viro @ 2015-10-31 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 02:23:31PM -0700, Linus Torvalds wrote:

> The other stuff we probably can't do all that much about. Unless we
> decide to go for some complicated lockless optimistic file descriptor
> allocation scheme with retry-on-failure instead of locks. Which I'm
> sure is possible, but I'm equally sure is painful.

The interesting part is dup2() - we'd have to do something like
	serialize against other dup2
	was_claimed = atomically set and test bit in bitmap
	if was_claimed
		tofree = fdt->fd[fd];
		if (!tofree)
			fail with EBUSY
	install into ->fd[...]
	end of critical area
in there; __alloc_fd() could be made retry-on-failure, but I don't see
how to cope with dup2 vs. dup2 without an explicit exclusion.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 21:23                                                           ` Linus Torvalds
  2015-10-31 21:51                                                             ` Al Viro
@ 2015-10-31 22:34                                                             ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-10-31 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Sat, 2015-10-31 at 14:23 -0700, Linus Torvalds wrote:

> Mind testing something really stupid, and making the __clear_bit() in
> __clear_close_on_exec() conditiona, something like this:
> 
>      static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
>      {
>     -       __clear_bit(fd, fdt->close_on_exec);
>     +       if (test_bit(fd, fdt->close_on_exec)
>     +               __clear_bit(fd, fdt->close_on_exec);
>      }
> 
> and see if it makes a difference.

It does ;)

About 4 % qps increase

3 runs : 
lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
total = 4176651
total = 4178012
total = 4105226

instead of :
total = 3910620
total = 3874567
total = 3971028

Perf profile :

    69.12%       opensock  opensock             [.] memset                                       
                 |
                 --- memset

    12.37%       opensock  [kernel.kallsyms]    [k] queued_spin_lock_slowpath                    
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.99%-- _raw_spin_lock
                    |          |          
                    |          |--51.99%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--47.79%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.21%-- [...]
                     --0.01%-- [...]

     1.92%       opensock  [kernel.kallsyms]    [k] _find_next_bit.part.0                        
                 |
                 --- _find_next_bit.part.0
                    |          
                    |--66.93%-- find_next_zero_bit
                    |          __alloc_fd
                    |          get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                     --33.07%-- __alloc_fd
                               get_unused_fd_flags
                               sock_map_fd
                               sys_socket
                               entry_SYSCALL_64_fastpath
                               __socket
                               |          
                                --100.00%-- 0x0

     1.63%       opensock  [kernel.kallsyms]    [k] _raw_spin_lock                               
                 |
                 --- _raw_spin_lock
                    |          
                    |--28.66%-- get_unused_fd_flags
                    |          sock_map_fd



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 20:29                                                         ` Al Viro
@ 2015-11-02  0:24                                                           ` Al Viro
  2015-11-02  0:59                                                             ` Linus Torvalds
  2015-11-02  2:14                                                             ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Al Viro @ 2015-11-02  0:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 08:29:29PM +0000, Al Viro wrote:
> On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote:
> > On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > ... and here's the current variant of mine.
> > 
> > Ugh. I really liked how simple mine ended up being. Yours is definitely not.
> 
> Note that it's not the final variant - just something that should be
> testable.  There are all kinds of things that still need cleaning/simplifying
> in there - e.g. scan() is definitely more complex than needed (if nothing
> else, the "small bitmap" case is simply find_next_zero_bit(), and the
> rest all have size equal to full cacheline; moreover, I'd overdone the
> "... and check if there are other zero bits left" thing - its callers
> used to use that a lot, and with the execption of two of them it was
> absolutely worthless.  So it ended up more generic than necessary and
> I'm going to trim that crap down.
> 
> It's still going to end up more complex than yours, obviously, but not as
> badly as it is now.  I'm not sure if the speedup will be worth the
> extra complexity, thus asking for testing...  On _some_ loads it is
> considerably faster (at least by factor of 5), but whether those loads
> resemble anything that occurs on real systems...

This ought to be a bit cleaner.  Eric, could you test the variant below on your
setup?

diff --git a/fs/file.c b/fs/file.c
index 6c672ad..0144920 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -30,6 +30,9 @@ int sysctl_nr_open_min = BITS_PER_LONG;
 int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) &
 			 -BITS_PER_LONG;
 
+#define BITS_PER_CHUNK 512
+#define BYTES_PER_CHUNK (BITS_PER_CHUNK / 8)
+
 static void *alloc_fdmem(size_t size)
 {
 	/*
@@ -46,8 +49,10 @@ static void *alloc_fdmem(size_t size)
 
 static void __free_fdtable(struct fdtable *fdt)
 {
+	int i;
 	kvfree(fdt->fd);
-	kvfree(fdt->open_fds);
+	for (i = 0; i <= 3; i++)
+		kvfree(fdt->bitmaps[i]);
 	kfree(fdt);
 }
 
@@ -56,13 +61,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+static inline bool cacheline_full(unsigned long *p)
+{
+	int n;
+
+	for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++)
+		if (likely(~p[n]))
+			return false;
+	return true;
+}
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
  */
 static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 {
-	unsigned int cpy, set;
+	unsigned int cpy, set, to, from, level, n;
 
 	BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
@@ -71,18 +86,48 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)(nfdt->fd) + cpy, 0, set);
 
-	cpy = ofdt->max_fds / BITS_PER_BYTE;
-	set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE;
-	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-	memset((char *)(nfdt->open_fds) + cpy, 0, set);
+	cpy = ofdt->max_fds / 8;
+	set = (nfdt->max_fds - ofdt->max_fds) / 8;
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+	if (likely(!nfdt->bitmaps[1])) {
+		// flat to flat
+		memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy);
+		memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set);
+		return;
+	}
+	to = round_up(nfdt->max_fds, BITS_PER_CHUNK);
+	set = (to - ofdt->max_fds) / 8;
+	// copy and pad the primary
+	memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8);
+	memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set);
+	// copy and pad the old secondaries
+	from = round_up(ofdt->max_fds, BITS_PER_CHUNK);
+	for (level = 1; level <= 3; level++) {
+		if (!ofdt->bitmaps[level])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8);
+		memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8);
+	}
+	// zero the new ones (if any)
+	for (n = level; n <= 3; n++) {
+		if (!nfdt->bitmaps[n])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memset(nfdt->bitmaps[n], 0, to / 8);
+	}
+	// and maybe adjust bit 0 in the first new one.
+	if (unlikely(n != level && cacheline_full(nfdt->bitmaps[level - 1])))
+		__set_bit(0, nfdt->bitmaps[level]);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
 	struct fdtable *fdt;
 	void *data;
+	int level = 0;
 
 	/*
 	 * Figure out how many fds we actually want to support in this fdtable.
@@ -114,16 +159,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 		goto out_fdt;
 	fdt->fd = data;
 
+	if (nr > BITS_PER_CHUNK)
+		nr = round_up(nr, BITS_PER_CHUNK);
 	data = alloc_fdmem(max_t(size_t,
 				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
-	fdt->open_fds = data;
+	fdt->bitmaps[0] = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
-
+	fdt->bitmaps[1] = fdt->bitmaps[2] = fdt->bitmaps[3] = NULL;
+	while (unlikely(nr > BITS_PER_CHUNK)) {
+		nr = round_up(nr / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		data = alloc_fdmem(nr);
+		if (!data)
+			goto out_bitmaps;
+		fdt->bitmaps[++level] = data;
+	}
 	return fdt;
 
+out_bitmaps:
+	while (level >= 0)
+		kvfree(fdt->bitmaps[level--]);
 out_arr:
 	kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +286,41 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned fd, struct fdtable *fdt)
 {
-	__set_bit(fd, fdt->open_fds);
+	int level;
+	for (level = 0; ; level++) {
+		unsigned long *p;
+
+		__set_bit(fd, fdt->bitmaps[level]);
+
+		if (level == 3 || !fdt->bitmaps[level + 1])
+			break;
+
+		fd /= BITS_PER_CHUNK;
+
+		p = fdt->bitmaps[level] + BIT_WORD(fd * BITS_PER_CHUNK);
+		if (likely(!cacheline_full(p)))
+			break;
+	}
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->open_fds);
+	int level;
+	unsigned long *p = fdt->bitmaps[0] + BIT_WORD(fd);
+	unsigned long v = *p;
+	__clear_bit(fd % BITS_PER_LONG, p);
+	if (~v)		// quick test to avoid looking at other cachelines
+		return;
+	for (level = 1; level <= 3; level++) {
+		if (!fdt->bitmaps[level])
+			break;
+		fd /= BITS_PER_CHUNK;
+		if (!test_bit(fd, fdt->bitmaps[level]))
+			break;
+		__clear_bit(fd, fdt->bitmaps[level]);
+	}
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -246,7 +330,7 @@ static int count_open_files(struct fdtable *fdt)
 
 	/* Find the last open fd */
 	for (i = size / BITS_PER_LONG; i > 0; ) {
-		if (fdt->open_fds[--i])
+		if (fdt->bitmaps[0][--i])
 			break;
 	}
 	i = (i + 1) * BITS_PER_LONG;
@@ -262,7 +346,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 {
 	struct files_struct *newf;
 	struct file **old_fds, **new_fds;
-	int open_files, size, i;
+	int open_files, size, i, n;
 	struct fdtable *old_fdt, *new_fdt;
 
 	*errorp = -ENOMEM;
@@ -279,7 +363,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
-	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->bitmaps[0] = newf->open_fds_init;
+	new_fdt->bitmaps[1] = new_fdt->bitmaps[2] = new_fdt->bitmaps[3] = NULL;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -321,8 +406,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
-	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->bitmaps[0], old_fdt->bitmaps[0], open_files / 8);
+
+	n = round_up(open_files, BITS_PER_CHUNK);
+	for (i = 1; i <= 3; i++) {
+		if (!new_fdt->bitmaps[i])
+			break;
+		n /= BITS_PER_CHUNK;
+		n = round_up(n, BITS_PER_CHUNK);
+		memcpy(new_fdt->bitmaps[i], old_fdt->bitmaps[i], n / 8);
+	}
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -351,10 +445,24 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		int left = (new_fdt->max_fds - open_files) / 8;
 		int start = open_files / BITS_PER_LONG;
 
-		memset(&new_fdt->open_fds[start], 0, left);
-		memset(&new_fdt->close_on_exec[start], 0, left);
+		memset(new_fdt->close_on_exec + start, 0, left);
+		if (likely(!new_fdt->bitmaps[1])) {
+			memset(new_fdt->bitmaps[0] + start, 0, left);
+			goto done;
+		}
+		start = round_up(open_files, BITS_PER_CHUNK);
+		n = round_up(new_fdt->max_fds, BITS_PER_CHUNK);
+		for (i = 0 ; i <= 3; i++) {
+			char *p = (void *)new_fdt->bitmaps[i];
+			if (!p)
+				break;
+			n = round_up(n / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			start = round_up(start / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			memset(p + start / 8, 0, (n - start) / 8);
+		}
 	}
 
+done:
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
 	return newf;
@@ -380,7 +488,7 @@ static struct fdtable *close_files(struct files_struct * files)
 		i = j * BITS_PER_LONG;
 		if (i >= fdt->max_fds)
 			break;
-		set = fdt->open_fds[j++];
+		set = fdt->bitmaps[0][j++];
 		while (set) {
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
@@ -453,70 +561,128 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
-		.open_fds	= init_files.open_fds_init,
+		.bitmaps[0]	= init_files.open_fds_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
-/*
- * allocate a file descriptor, mark it busy.
- */
+/* search for the next zero bit in cacheline */
+#define NO_FREE (1ULL<<32)
+#define LAST_FREE (2ULL<<32)
+#define CHUNK_ALIGNED(p) IS_ALIGNED((uintptr_t)p, BYTES_PER_CHUNK)
+
+static __u64 scan(struct fdtable *fdt, int level, unsigned from)
+{
+	unsigned long *p = fdt->bitmaps[level] + BIT_WORD(from);
+	unsigned long v = *p, w = v + BIT_MASK(from);
+
+	from = round_down(from, BITS_PER_CHUNK);
+
+	if (unlikely(!(w & ~v))) {
+		while (!CHUNK_ALIGNED(++p)) {
+			v = *p;
+			w = v + 1;
+			if (likely(w))
+				goto got_it;
+		}
+		return NO_FREE | (from + BITS_PER_CHUNK);
+	}
+got_it:
+	from += __ffs(w & ~v);		// log2, really - it's a power of 2
+	from += 8 * ((uintptr_t)p % BYTES_PER_CHUNK);
+	if (level)			// don't bother with looking for more
+		return from;
+	if (likely(~(v | w)))		// would zeroes remain in *p?
+		return from;
+	while (!CHUNK_ALIGNED(++p))	// any zeros in the tail?
+		if (likely(~*p))
+			return from;
+	return LAST_FREE | from;
+}
+
 int __alloc_fd(struct files_struct *files,
 	       unsigned start, unsigned end, unsigned flags)
 {
 	unsigned int fd;
+	__u64 base;
 	int error;
 	struct fdtable *fdt;
+	unsigned count;
 
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
+	count = fdt->max_fds;
 	fd = start;
 	if (fd < files->next_fd)
 		fd = files->next_fd;
-
-	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
-
-	/*
-	 * N.B. For clone tasks sharing a files structure, this test
-	 * will limit the total number of files that can be opened.
-	 */
-	error = -EMFILE;
-	if (fd >= end)
-		goto out;
-
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
-
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
+	if (unlikely(fd >= count)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
 		goto repeat;
-
+	}
+	if (likely(!fdt->bitmaps[1])) {
+		base = find_next_zero_bit(fdt->bitmaps[0], count, fd);
+		if (unlikely(base == count))
+			goto expand;
+		if (unlikely(base >= end)) {
+			error = -EMFILE;
+			goto out;
+		}
+		fd = base;
+		__set_bit(fd, fdt->bitmaps[0]);
+		goto found;
+	}
+	base = scan(fdt, 0, fd);
+	if (unlikely(base & NO_FREE)) {
+		int bits[3];
+		int level = 0;
+		do {
+			if (unlikely((u32)base >= count))
+				goto expand;
+			bits[level] = count;
+			count = DIV_ROUND_UP(count, BITS_PER_CHUNK);
+			base = scan(fdt, ++level, (u32)base / BITS_PER_CHUNK);
+		} while (unlikely(base & NO_FREE));
+		while (level) {
+			if (unlikely((u32)base >= count))
+				goto expand;
+			base = scan(fdt, --level, (u32)base * BITS_PER_CHUNK);
+			if (WARN_ON(base & NO_FREE)) {
+				error = -EINVAL;
+				goto out;
+			}
+			count = bits[level];
+		}
+		if (unlikely((u32)base >= count))
+			goto expand;
+	}
+	fd = base;
+	if (unlikely(fd >= end)) {
+		error = -EMFILE;
+		goto out;
+	}
+	if (likely(!(base & LAST_FREE)))
+		__set_bit(fd, fdt->bitmaps[0]);
+	else
+		__set_open_fd(fd, fdt);
+found:
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
-
-	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
-
 out:
 	spin_unlock(&files->file_lock);
 	return error;
+expand:
+	error = expand_files(files, fdt->max_fds);
+	if (error < 0)
+		goto out;
+	goto repeat;
 }
 
 static int alloc_fd(unsigned start, unsigned flags)
@@ -809,7 +975,8 @@ __releases(&files->file_lock)
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
-	__set_open_fd(fd, fdt);
+	if (!tofree)
+		__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
diff --git a/fs/select.c b/fs/select.c
index 0155473..670f542 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -350,7 +350,7 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds)
 	set = ~(~0UL << (n & (BITS_PER_LONG-1)));
 	n /= BITS_PER_LONG;
 	fdt = files_fdtable(current->files);
-	open_fds = fdt->open_fds + n;
+	open_fds = fdt->bitmaps[0] + n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e2..6ef5274 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -25,7 +25,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
-	unsigned long *open_fds;
+	unsigned long *bitmaps[4];
 	struct rcu_head rcu;
 };
 
@@ -36,7 +36,7 @@ static inline bool close_on_exec(int fd, const struct fdtable *fdt)
 
 static inline bool fd_is_open(int fd, const struct fdtable *fdt)
 {
-	return test_bit(fd, fdt->open_fds);
+	return test_bit(fd, fdt->bitmaps[0]);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02  0:24                                                           ` Al Viro
@ 2015-11-02  0:59                                                             ` Linus Torvalds
  2015-11-02  2:14                                                             ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-11-02  0:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sun, Nov 1, 2015 at 4:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> This ought to be a bit cleaner.  Eric, could you test the variant below on your
> setup?

I'd love to see the numbers, but at the same time I really can't say I
love your patch.

I've merged my own two patches for now (not actually pushed out - I
don't want to distract people from just testing 4.3 for a while),
because I felt that those had a unreasonably high bang-for-the-buck
(ie big speedups for something that still keeps the code very simple).

I'm definitely open to improving this further, even go as far as your
patch, but just looking at your version of __alloc_fd(), I just don't
get the warm and fuzzies.

                  Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02  0:24                                                           ` Al Viro
  2015-11-02  0:59                                                             ` Linus Torvalds
@ 2015-11-02  2:14                                                             ` Eric Dumazet
  2015-11-02  6:22                                                               ` Al Viro
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-11-02  2:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Mon, 2015-11-02 at 00:24 +0000, Al Viro wrote:

> This ought to be a bit cleaner.  Eric, could you test the variant below on your
> setup?

Sure !

5 runs of :
lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10

total = 4386311
total = 4560402
total = 4437309
total = 4516227
total = 4478778


With 48 threads :

./opensock -t 48 -n 10000000 -l 10
total = 4940245
total = 4848513
total = 4813153
total = 4813946
total = 5127804

Perf output taken on the 16 threads run :

    74.71%       opensock  opensock           [.] memset                                    
                 |
                 --- memset

     5.64%       opensock  [kernel.kallsyms]  [k] queued_spin_lock_slowpath                 
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.89%-- _raw_spin_lock
                    |          |          
                    |          |--52.74%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--46.97%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.30%-- [...]
                     --0.11%-- [...]

     1.69%       opensock  [kernel.kallsyms]  [k] _raw_spin_lock                            
                 |
                 --- _raw_spin_lock
                    |          
                    |--27.37%-- get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--22.40%-- sys_close
                    |          entry_SYSCALL_64_fastpath
                    |          __libc_close
                    |          
                    |--15.79%-- cache_alloc_refill
                    |          |          
                    |          |--99.27%-- kmem_cache_alloc
                    |          |          |          
                    |          |          |--81.25%-- sk_prot_alloc
                    |          |          |          sk_alloc
                    |          |          |          inet_create
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--9.08%-- sock_alloc_inode
                    |          |          |          alloc_inode
                    |          |          |          new_inode_pseudo
                    |          |          |          sock_alloc
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--4.98%-- __d_alloc
                    |          |          |          d_alloc_pseudo
                    |          |          |          sock_alloc_file
                    |          |          |          sock_map_fd
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |           --4.69%-- get_empty_filp
                    |          |                     alloc_file
                    |          |                     sock_alloc_file
                    |          |                     sock_map_fd
                    |          |                     sys_socket
                    |          |                     entry_SYSCALL_64_fastpath
                    |          |                     __socket
                    |          |          
                    |           --0.73%-- kmem_cache_alloc_trace
                    |                     sock_alloc_inode
                    |                     alloc_inode
                    |                     new_inode_pseudo
                    |                     sock_alloc
                    |                     __sock_create
                    |                     sock_create
                    |                     sys_socket
                    |                     entry_SYSCALL_64_fastpath
                    |                     __socket
                    |          
                    |--10.80%-- sock_alloc_file
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--7.47%-- sock_alloc
                    |          __sock_create
                    |          sock_create
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--6.96%-- kmem_cache_alloc
                    |          |          
                    |          |--72.94%-- sk_prot_alloc
                    |          |          sk_alloc
                    |          |          inet_create
                    |          |          __sock_create
                    |          |          sock_create
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          
                    |          |--15.51%-- sock_alloc_inode
                    |          |          alloc_inode
                    |          |          new_inode_pseudo
                    |          |          sock_alloc
                    |          |          __sock_create
                    |          |          sock_create
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          
                    |          |--7.59%-- get_empty_filp
                    |          |          alloc_file
                    |          |          sock_alloc_file
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          
                    |           --3.96%-- __d_alloc
                    |                     d_alloc_pseudo
                    |                     sock_alloc_file
                    |                     sock_map_fd
                    |                     sys_socket
                    |                     entry_SYSCALL_64_fastpath
                    |                     __socket
                    |          
                    |--3.74%-- d_instantiate
                    |          sock_alloc_file
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--2.03%-- iput
                    |          __dentry_kill
                    |          dput
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.60%-- __fsnotify_inode_delete
                    |          __destroy_inode
                    |          destroy_inode
                    |          evict
                    |          iput
                    |          __dentry_kill
                    |          dput
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.55%-- evict
                    |          iput
                    |          __dentry_kill
                    |          dput
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.53%-- inet_release
                    |          sock_release
                    |          sock_close
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.51%-- __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath


Perf taken on the 48 threads run :

   48.34%       opensock  [kernel.kallsyms]   [k] queued_spin_lock_slowpath           
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.93%-- _raw_spin_lock
                    |          |          
                    |          |--50.11%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--49.85%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.03%-- [...]
                     --0.07%-- [...]

    41.03%       opensock  opensock            [.] memset                              
                 |
                 --- memset

     0.69%       opensock  [kernel.kallsyms]   [k] _raw_spin_lock                      
                 |
                 --- _raw_spin_lock
                    |          
                    |--30.22%-- sys_close
                    |          entry_SYSCALL_64_fastpath
                    |          __libc_close
                    |          |          
                    |           --100.00%-- 0x0
                    |          
                    |--30.15%-- get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--14.61%-- cache_alloc_refill
                    |          |          

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02  2:14                                                             ` Eric Dumazet
@ 2015-11-02  6:22                                                               ` Al Viro
  0 siblings, 0 replies; 32+ messages in thread
From: Al Viro @ 2015-11-02  6:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sun, Nov 01, 2015 at 06:14:43PM -0800, Eric Dumazet wrote:
> On Mon, 2015-11-02 at 00:24 +0000, Al Viro wrote:
> 
> > This ought to be a bit cleaner.  Eric, could you test the variant below on your
> > setup?
> 
> Sure !
> 
> 5 runs of :
> lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
> 
> total = 4386311
> total = 4560402
> total = 4437309
> total = 4516227
> total = 4478778

Umm...  With Linus' variant it was what, around 4000000?  +10% or so, then...

> With 48 threads :
> 
> ./opensock -t 48 -n 10000000 -l 10
> total = 4940245
> total = 4848513
> total = 4813153
> total = 4813946
> total = 5127804

And that - +40%?  Interesting...  And it looks like at 48 threads you are
still seeing arseloads of contention, but apparently less than with Linus'
variant...  What if you throw the __clear_close_on_exec() patch on
top of that?

Looks like it's spending less time under ->files_lock...  Could you get
information on fs/file.o hotspots?

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-11-02  6:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <201510221824.t9MIOp6n003978@room101.nl.oracle.com>
     [not found] ` <20151022190701.GV22011@ZenIV.linux.org.uk>
     [not found]   ` <201510221951.t9MJp5LC005892@room101.nl.oracle.com>
     [not found]     ` <20151022215741.GW22011@ZenIV.linux.org.uk>
     [not found]       ` <201510230952.t9N9qYZJ021998@room101.nl.oracle.com>
     [not found]         ` <20151024023054.GZ22011@ZenIV.linux.org.uk>
     [not found]           ` <201510270908.t9R9873a001683@room101.nl.oracle.com>
     [not found]             ` <562F577E.6000901@oracle.com>
     [not found]               ` <20151027231702.GA22011@ZenIV.linux.org.uk>
     [not found]                 ` <1445991236.7476.59.camel@edumazet-glaptop2.roam.corp.google.com>
2015-10-28 12:35                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
2015-10-28 13:24                     ` Eric Dumazet
2015-10-28 14:47                       ` Eric Dumazet
2015-10-28 21:13                         ` Al Viro
2015-10-28 21:44                           ` Eric Dumazet
2015-10-28 22:33                             ` Al Viro
2015-10-28 23:08                               ` Eric Dumazet
2015-10-29  0:15                                 ` Al Viro
2015-10-29  3:29                                   ` Eric Dumazet
2015-10-29  4:16                                     ` Al Viro
2015-10-29 12:35                                       ` Eric Dumazet
2015-10-29 13:48                                         ` Eric Dumazet
2015-10-30 17:18                                         ` Linus Torvalds
2015-10-30 21:02                                           ` Al Viro
2015-10-30 21:23                                             ` Linus Torvalds
2015-10-30 21:50                                               ` Linus Torvalds
2015-10-30 22:33                                                 ` Al Viro
2015-10-30 23:52                                                   ` Linus Torvalds
2015-10-31  0:09                                                     ` Al Viro
2015-10-31 15:59                                                     ` Eric Dumazet
2015-10-31 19:34                                                     ` Al Viro
2015-10-31 19:54                                                       ` Linus Torvalds
2015-10-31 20:29                                                         ` Al Viro
2015-11-02  0:24                                                           ` Al Viro
2015-11-02  0:59                                                             ` Linus Torvalds
2015-11-02  2:14                                                             ` Eric Dumazet
2015-11-02  6:22                                                               ` Al Viro
2015-10-31 20:45                                                         ` Eric Dumazet
2015-10-31 21:23                                                           ` Linus Torvalds
2015-10-31 21:51                                                             ` Al Viro
2015-10-31 22:34                                                             ` Eric Dumazet
2015-10-31  1:07                                                 ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).