Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
  2008-12-11 22:40                       ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet
@ 2007-07-24  1:13                         ` Nick Piggin
  2008-12-12  2:50                           ` Nick Piggin
  2008-12-12  4:45                           ` Eric Dumazet
  0 siblings, 2 replies; 37+ messages in thread
From: Nick Piggin @ 2007-07-24  1:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> From: Christoph Lameter <cl@linux-foundation.org>
>
> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>
> Currently we schedule RCU frees for each file we free separately. That has
> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
> did not require RCU callbacks:
>
> 1. Excessive number of RCU callbacks can be generated causing long RCU
>   queues that in turn cause long latencies. We hit SLUB page allocation
>   more often than necessary.
>
> 2. The cache hot object is not preserved between free and realloc. A close
>   followed by another open is very fast with the RCUless approach because
>   the last freed object is returned by the slab allocator that is
>   still cache hot. RCU free means that the object is not immediately
>   available again. The new object is cache cold and therefore open/close
>   performance tests show a significant degradation with the RCU
>   implementation.
>
> One solution to this problem is to move the RCU freeing into the Slab
> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> time. The slab allocator will do RCU frees only when it is necessary
> to dispose of slabs of objects (rare). So with that approach we can cut
> out the RCU overhead significantly.
>
> However, the slab allocator may return the object for another use even
> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> there is the (unlikely) possibility that the object is going to be
> switched under us in sections protected by rcu_read_lock() and
> rcu_read_unlock(). So we need to verify that we have acquired the correct
> object after establishing a stable object reference (incrementing the
> refcounter does that).
>
>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>  include/linux/fs.h                  |    5 ---
>  3 files changed, 42 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/files.txt
> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> --- a/Documentation/filesystems/files.txt
> +++ b/Documentation/filesystems/files.txt
> @@ -78,13 +78,28 @@ the fdtable structure -
>     that look-up may race with the last put() operation on the
>     file structure. This is avoided using atomic_long_inc_not_zero()
>     on ->f_count :
> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> +   they can also be freed before a RCU grace period, and reused,
> +   but still as a struct file.
> +   It is necessary to check again after getting
> +   a stable reference (ie after atomic_long_inc_not_zero()),
> +   that fcheck_files(files, fd) points to the same file.
>
>  	rcu_read_lock();
>  	file = fcheck_files(files, fd);
>  	if (file) {
> -		if (atomic_long_inc_not_zero(&file->f_count))
> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>  			*fput_needed = 1;
> -		else
> +			/*
> +			 * Now we have a stable reference to an object.
> +			 * Check if other threads freed file and reallocated it.
> +			 */
> +			if (file != fcheck_files(files, fd)) {
> +				*fput_needed = 0;
> +				put_filp(file);
> +				file = NULL;
> +			}
> +		} else
>  		/* Didn't get the reference, someone's freed */
>  			file = NULL;
>  	}
> @@ -95,6 +110,8 @@ the fdtable structure -
>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>     goes to zero during increment. If it does, we fail
>     fget()/fget_light().
> +   The second call to fcheck_files(files, fd) checks that this filp
> +   was not freed, then reused by an other thread.
>
>  6. Since both fdtable and file structures can be looked up
>     lock-free, they must be installed using rcu_assign_pointer()
> diff --git a/fs/file_table.c b/fs/file_table.c
> index a46e880..3e9259d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>
>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>
> -static inline void file_free_rcu(struct rcu_head *head)
> -{
> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> -	kmem_cache_free(filp_cachep, f);
> -}
> -
>  static inline void file_free(struct file *f)
>  {
>  	percpu_counter_dec(&nr_files);
>  	file_check_state(f);
> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> +	kmem_cache_free(filp_cachep, f);
>  }
>
>  /*
> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>  			rcu_read_unlock();
>  			return NULL;
>  		}
> +		/*
> +		 * Now we have a stable reference to an object.
> +		 * Check if other threads freed file and re-allocated it.
> +		 */
> +		if (unlikely(file != fcheck_files(files, fd))) {
> +			put_filp(file);
> +			file = NULL;
> +		}

This is a non-trivial change, because that put_filp may drop the last
reference to the file. So now we have the case where we free the file
from a context in which it had never been allocated.

>From a quick glance though the callchains, I can't seen an obvious
problem. But it needs to have documentation in put_filp, or at least
a mention in the changelog, and also cc'ed to the security lists.

Also, it adds code and cost to the get/put path in return for
improvement in the free path. get/put is the more common path, but
it is a small loss for a big improvement. So it might be worth it. But
it is not justified by your microbenchmark. Do we have a more useful
case that it helps?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
  2008-12-11 22:38                     ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
@ 2007-07-24  1:24                       ` Nick Piggin
       [not found]                       ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  1 sibling, 0 replies; 37+ messages in thread
From: Nick Piggin @ 2007-07-24  1:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:38, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
>
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
>
> d_alloc() can avoid taking dcache_lock if parent is NULL
>
> ("socketallocbench -n8" result : 27.5s to 25s)

Seems like a good idea.


> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct
> dentry *dentry) * otherwise we ascend to the parent and move to the
>  			 * next sibling if there is one */
>  			if (!parent)
> -				goto out;
> +				return;
>
>  			dentry = parent;
>

Andrew doesn't like return from middle of function.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
       [not found]                       ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2007-07-24  1:30                         ` Nick Piggin
       [not found]                           ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
  2008-12-16 21:10                         ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2007-07-24  1:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro,
	Paul E. McKenney

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
>
> (socket8 bench result : no difference at this point)

Looks good.

But.... If we never actually need fast access to the approximate
total, (which seems to apply to this and the previous patch) we
could use something much simpler which does not have the spinlock
or all this batching stuff that percpu counters have. I'd prefer
that because it will be faster in a straight line...

(BTW. percpu counters can't be used in interrupt context? That's
nice.)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
  2008-12-11 22:39                     ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
@ 2007-07-24  1:34                       ` Nick Piggin
  2008-12-16 21:26                       ` Paul E. McKenney
  1 sibling, 0 replies; 37+ messages in thread
From: Nick Piggin @ 2007-07-24  1:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

I don't suppose this would cause any filesystems to do silly
things?

Seems like a good idea, if you could just add a #define instead
of 1024.

>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
>  	return node ? inode : NULL;
>  }
>
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> +	static atomic_t shared_last_ino;
> +	int *p = &get_cpu_var(last_ino);
> +	int res = *p;
> +
> +	if (unlikely((res & 1023) == 0))
> +		res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> +	*p = ++res;
> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> +	static int last_ino;
> +
> +	return ++last_ino;
> +}
> +#endif
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
>  	 * error if st_ino won't fit in target struct field. Use 32bit counter
>  	 * here to attempt to avoid that.
>  	 */
> -	static unsigned int last_ino;
>  	struct inode * inode;
>
>  	spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
> +		inode->i_state = 0;
> +		inode->i_ino = last_ino_get();
>  		spin_lock(&inode_lock);
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		inode->i_ino = ++last_ino;
> -		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
       [not found]         ` <4926D022.5060008@cosmosbay.com>
@ 2008-11-21 15:36           ` Christoph Hellwig
  2008-11-21 17:58             ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
       [not found]           ` <20081121152148.GA20388@elte.hu>
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2008-11-21 15:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers,
	efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel

On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote:
> [PATCH] fs: pipe/sockets/anon dentries should not have a parent
>
> Linking pipe/sockets/anon dentries to one root 'parent' has no functional
> impact at all, but a scalability one.
>
> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
> We avoid an expensive atomic_dec_and_lock() call on the root dentry.
>
> If we correct dnotify_parent() and inotify_d_instantiate() to take into account
> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Sorry folks, but a NULL d_parent is a no-go from the VFS perspective,
but you can set d_parent to the dentry itself which is the magic used
for root of tree dentries.  They should also be marked
DCACHE_DISCONNECTED to make sure this is not unexpected.

And this kind of stuff really needs to go through -fsdevel.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
  2008-11-21 15:36           ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
@ 2008-11-21 17:58             ` Eric Dumazet
       [not found]               ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-11-21 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers,
	efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5101 bytes --]

Christoph Hellwig a écrit :
> On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote:
>> [PATCH] fs: pipe/sockets/anon dentries should not have a parent
>>
>> Linking pipe/sockets/anon dentries to one root 'parent' has no functional
>> impact at all, but a scalability one.
>>
>> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
>> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
>> We avoid an expensive atomic_dec_and_lock() call on the root dentry.
>>
>> If we correct dnotify_parent() and inotify_d_instantiate() to take into account
>> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.
> 
> Sorry folks, but a NULL d_parent is a no-go from the VFS perspective,
> but you can set d_parent to the dentry itself which is the magic used
> for root of tree dentries.  They should also be marked
> DCACHE_DISCONNECTED to make sure this is not unexpected.
> 
> And this kind of stuff really needs to go through -fsdevel.

Thanks Christoph for your review, sorry for fsdevel being forgotten.

d_alloc_root() is not an option here, since we also want such dentries
to be unhashed. So here is a second version, with the introduction
of a new helper, d_alloc_unhashed(), to be used by pipes, sockets and anon

I got even better numbers, probably because dnotify/inotify dont have
the NULL d_parent test anymore.

[PATCH] fs: pipe/sockets/anon dentries should have themselves as parent


Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

We add d_alloc_unhashed(const char *name, struct inode *inode) helper
to be used by pipes/socket/anon. This function is about the same as
d_alloc_root() but for unhashed entries.

Before patch, time to run 8 *  1 million of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.843s
user    0m0.616s
sys     3m9.732s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
148703   148703        10.8581  10.8581    inet_create
116680   265383         8.5198  19.3779    new_inode
108912   374295         7.9526  27.3306    init_file
82911    457206         6.0541  33.3846    kmem_cache_alloc
65690    522896         4.7966  38.1812    wake_up_inode
53286    576182         3.8909  42.0721    _atomic_dec_and_lock
43814    619996         3.1992  45.2713    generic_forget_inode
41993    661989         3.0663  48.3376    d_alloc
41244    703233         3.0116  51.3492    kmem_cache_free
39244    742477         2.8655  54.2148    tcp_v4_init_sock
37402    779879         2.7310  56.9458    tcp_close
33336    813215         2.4342  59.3800    sysenter_past_esp
28596    841811         2.0880  61.4680    inode_has_buffers
25769    867580         1.8816  63.3496    d_kill
22606    890186         1.6507  65.0003    dentry_iput
20224    910410         1.4767  66.4770    vfs_dq_drop
19800    930210         1.4458  67.9228    __copy_from_user_ll

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |    9 +--------
 fs/dcache.c            |   31 +++++++++++++++++++++++++++++++
 fs/pipe.c              |   10 +---------
 include/linux/dcache.h |    1 +
 net/socket.c           |   10 +---------
 5 files changed, 35 insertions(+), 26 deletions(-)

[-- Attachment #2: d_alloc_unhashed.patch --]
[-- Type: text/plain, Size: 4728 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..9fd0515 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = {
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 		     void *priv, int flags)
 {
-	struct qstr this;
 	struct dentry *dentry;
 	struct file *file;
 	int error, fd;
@@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	 * using the inode sequence number.
 	 */
 	error = -ENOMEM;
-	this.name = name;
-	this.len = strlen(name);
-	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_unhashed(name, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..a5477fd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1111,6 +1111,37 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_unhashed - allocate unhashed dentry
+ * @inode: inode to allocate the dentry for
+ * @name: dentry name
+ *
+ * Allocate an unhashed dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory. Unhashed dentries have themselves as a parent.
+ */
+ 
+struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
+{
+	struct qstr q = { .name = name, .len = strlen(name) };
+	struct dentry *res;
+
+	res = d_alloc(NULL, &q);
+	if (res) {
+		res->d_sb = inode->i_sb;
+		res->d_parent = res;
+		/*
+		 * We dont want to push this dentry into global dentry hash table.
+		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
+		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
+		 */
+		res->d_flags &= ~DCACHE_UNHASHED;
+		res->d_flags |= DCACHE_DISCONNECTED;
+		d_instantiate(res, inode);
+	}
+	return res;
+}
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..29fcac2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..12438d6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern struct dentry * d_alloc_unhashed(const char *, struct inode *);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b659b5d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
       [not found]               ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-11-21 18:43                 ` Matthew Wilcox
  2008-11-23  3:53                   ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2008-11-21 18:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, David Miller, mingo-X9Un+BFzKDI,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote:
> +/**
> + * d_alloc_unhashed - allocate unhashed dentry
> + * @inode: inode to allocate the dentry for
> + * @name: dentry name

It's normal to list the parameters in the order they're passed to the
function.  Not sure if we have a tool that checks for this or not --
Randy?

> + *
> + * Allocate an unhashed dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory. Unhashed dentries have themselves as a parent.
> + */
> + 
> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
> +{
> +	struct qstr q = { .name = name, .len = strlen(name) };
> +	struct dentry *res;
> +
> +	res = d_alloc(NULL, &q);
> +	if (res) {
> +		res->d_sb = inode->i_sb;
> +		res->d_parent = res;
> +		/*
> +		 * We dont want to push this dentry into global dentry hash table.
> +		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> +		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
> +		 */

Line length ... as checkpatch would have warned you ;-)

And there are several other grammatical nitpicks with this comment.  Try
this:

		/*
		 * We don't want to put this dentry in the global dentry
		 * hash table, so we pretend the dentry is already hashed
		 * by unsetting DCACHE_UNHASHED.  This permits 
		 * /proc/$pid/fd/XXX t work for sockets, pipes and
		 * anonymous files (signalfd, timerfd, etc).
		 */

> +		res->d_flags &= ~DCACHE_UNHASHED;
> +		res->d_flags |= DCACHE_DISCONNECTED;

Is this really better than:

		res->d_flags = res->d_flags & ~DCACHE_UNHASHED |
						DCACHE_DISCONNECTED;

Anyway, nice cleanup.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
  2008-11-21 18:43                 ` Matthew Wilcox
@ 2008-11-23  3:53                   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-11-23  3:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, David Miller, mingo, cl, rjw, linux-kernel,
	kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5733 bytes --]

Matthew Wilcox a écrit :
> On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote:
>> +/**
>> + * d_alloc_unhashed - allocate unhashed dentry
>> + * @inode: inode to allocate the dentry for
>> + * @name: dentry name
> 
> It's normal to list the parameters in the order they're passed to the
> function.  Not sure if we have a tool that checks for this or not --
> Randy?

Yes, no problem, better to have the same order.

> 
>> + *
>> + * Allocate an unhashed dentry for the inode given. The inode is
>> + * instantiated and returned. %NULL is returned if there is insufficient
>> + * memory. Unhashed dentries have themselves as a parent.
>> + */
>> + 
>> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
>> +{
>> +	struct qstr q = { .name = name, .len = strlen(name) };
>> +	struct dentry *res;
>> +
>> +	res = d_alloc(NULL, &q);
>> +	if (res) {
>> +		res->d_sb = inode->i_sb;
>> +		res->d_parent = res;
>> +		/*
>> +		 * We dont want to push this dentry into global dentry hash table.
>> +		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
>> +		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
>> +		 */
> 
> Line length ... as checkpatch would have warned you ;-)
> 
> And there are several other grammatical nitpicks with this comment.  Try
> this:
> 
> 		/*
> 		 * We don't want to put this dentry in the global dentry
> 		 * hash table, so we pretend the dentry is already hashed
> 		 * by unsetting DCACHE_UNHASHED.  This permits 
> 		 * /proc/$pid/fd/XXX t work for sockets, pipes and
> 		 * anonymous files (signalfd, timerfd, etc).
> 		 */

Yes, this is better.

> 
>> +		res->d_flags &= ~DCACHE_UNHASHED;
>> +		res->d_flags |= DCACHE_DISCONNECTED;
> 
> Is this really better than:
> 
> 		res->d_flags = res->d_flags & ~DCACHE_UNHASHED |
> 						DCACHE_DISCONNECTED;

Well, I personally prefer the two lines, intention is more readable :)

> 
> Anyway, nice cleanup.
> 

Thanks Matthew, here is an updated version of the patch.

[PATCH] fs: pipe/sockets/anon dentries should have themselves as parent


Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

We add d_alloc_unhashed(const char *name, struct inode *inode) helper
to be used by pipes/socket/anon. This function is about the same as
d_alloc_root() but for unhashed entries.

Before patch, time to run 8 *  1 million of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.843s
user    0m0.616s
sys     3m9.732s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
148703   148703        10.8581  10.8581    inet_create
116680   265383         8.5198  19.3779    new_inode
108912   374295         7.9526  27.3306    init_file
82911    457206         6.0541  33.3846    kmem_cache_alloc
65690    522896         4.7966  38.1812    wake_up_inode
53286    576182         3.8909  42.0721    _atomic_dec_and_lock
43814    619996         3.1992  45.2713    generic_forget_inode
41993    661989         3.0663  48.3376    d_alloc
41244    703233         3.0116  51.3492    kmem_cache_free
39244    742477         2.8655  54.2148    tcp_v4_init_sock
37402    779879         2.7310  56.9458    tcp_close
33336    813215         2.4342  59.3800    sysenter_past_esp
28596    841811         2.0880  61.4680    inode_has_buffers
25769    867580         1.8816  63.3496    d_kill
22606    890186         1.6507  65.0003    dentry_iput
20224    910410         1.4767  66.4770    vfs_dq_drop
19800    930210         1.4458  67.9228    __copy_from_user_ll

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |    9 +--------
 fs/dcache.c            |   33 +++++++++++++++++++++++++++++++++
 fs/pipe.c              |   10 +---------
 include/linux/dcache.h |    1 +
 net/socket.c           |   10 +---------
 5 files changed, 37 insertions(+), 26 deletions(-)

[-- Attachment #2: d_alloc_unhashed2.patch --]
[-- Type: text/plain, Size: 4788 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..9fd0515 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = {
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 		     void *priv, int flags)
 {
-	struct qstr this;
 	struct dentry *dentry;
 	struct file *file;
 	int error, fd;
@@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	 * using the inode sequence number.
 	 */
 	error = -ENOMEM;
-	this.name = name;
-	this.len = strlen(name);
-	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_unhashed(name, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..43ef88d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1111,6 +1111,39 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_unhashed - allocate unhashed dentry
+ * @name: dentry name
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an unhashed dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory. Unhashed dentries have themselves as a parent.
+ */
+ 
+struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
+{
+	struct qstr q = { .name = name, .len = strlen(name) };
+	struct dentry *res;
+
+	res = d_alloc(NULL, &q);
+	if (res) {
+		res->d_sb = inode->i_sb;
+		res->d_parent = res;
+		/*
+		 * We dont want to push this dentry into global dentry
+		 * hash table, so we pretend the dentry is already hashed
+		 * by unsetting DCACHE_UNHASHED. This permits
+		 * /proc/$pid/fd/XXX to work for sockets, pipes, and
+		 * anonymous files (signalfd, timerfd, ...)
+		 */
+		res->d_flags &= ~DCACHE_UNHASHED;
+		res->d_flags |= DCACHE_DISCONNECTED;
+		d_instantiate(res, inode);
+	}
+	return res;
+}
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..29fcac2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..12438d6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern struct dentry * d_alloc_unhashed(const char *, struct inode *);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b659b5d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP
       [not found]                 ` <492DDB6A.8090806@cosmosbay.com>
@ 2008-11-29  8:43                   ` Eric Dumazet
  2008-12-11 22:38                     ` [PATCH v3 0/7] " Eric Dumazet
                                       ` (7 more replies)
       [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-11-29  8:44                   ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
  2 siblings, 8 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks))

Long version :

For this second version, I removed the mntput()/mntget() optimization
since most reviewers are not convinced it is usefull.
This is a four lines patch that can be reconsidered later.

I chose the name SINGLE instead of SPECIAL to name
isolated dentries (for sockets, pipes, anonymous fd) that
have no parent and no relationship in the vfs.

Thanks all

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
	close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.325s   (instead of 1.561s)
user    0.091s
sys     1.234s


If run on 8 CPUS :

real    0m2.971s
user    0m0.726s
sys     0m21.310s

CPU: Core 2, speed 3000.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
189772   189772        12.7205  12.7205    _atomic_dec_and_lock
140467   330239         9.4155  22.1360    __slab_free
128210   458449         8.5940  30.7300    add_partial
121578   580027         8.1494  38.8794    kmem_cache_alloc
72626    652653         4.8681  43.7475    init_file
62720    715373         4.2041  47.9517    __percpu_counter_add
51632    767005         3.4609  51.4126    sysenter_past_esp
49196    816201         3.2976  54.7102    tcp_close
47933    864134         3.2130  57.9231    kmem_cache_free
29628    893762         1.9860  59.9091    copy_from_user
28443    922205         1.9065  61.8157    init_timer
25602    947807         1.7161  63.5318    __slab_alloc
22139    969946         1.4840  65.0158    discard_slab
20428    990374         1.3693  66.3851    __call_rcu
18174    1008548        1.2182  67.6033    alloc_fd
17643    1026191        1.1826  68.7859    __fput
17374    1043565        1.1646  69.9505    d_alloc
17196    1060761        1.1527  71.1031    sys_close
17024    1077785        1.1411  72.2442    inet_create
15208    1092993        1.0194  73.2636    alloc_inode
12201    1105194        0.8178  74.0815    fd_install
12167    1117361        0.8156  74.8970    lock_sock_nested
12123    1129484        0.8126  75.7096    get_empty_filp
11648    1141132        0.7808  76.4904    release_sock
11509    1152641        0.7715  77.2619    dput
11335    1163976        0.7598  78.0216    sock_init_data
11038    1175014        0.7399  78.7615    inet_csk_destroy_sock
10880    1185894        0.7293  79.4908    drop_file_write_access
10083    1195977        0.6759  80.1667    inotify_d_instantiate
9216     1205193        0.6178  80.7844    local_bh_enable_ip
8881     1214074        0.5953  81.3797    sysenter_do_call
8759     1222833        0.5871  81.9668    setup_object
8489     1231322        0.5690  82.5359    iput_single

So we now hit mntput()/mntget() and SLUB.

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 (or slub_min_objects=45) at boot,
or we use Christoph Lameter patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

If run on 8 CPUS :

real    0m2.315s
user    0m0.752s
sys     0m17.324s

CPU: Core 2, speed 3000.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
199409   199409        15.6440  15.6440    _atomic_dec_and_lock    (mntput())
141606   341015        11.1092  26.7532    kmem_cache_alloc
76071    417086         5.9679  32.7211    init_file
70595    487681         5.5383  38.2595    __percpu_counter_add
51595    539276         4.0477  42.3072    sysenter_past_esp
49313    588589         3.8687  46.1759    tcp_close
45503    634092         3.5698  49.7457    kmem_cache_free
41413    675505         3.2489  52.9946    __slab_free
29911    705416         2.3466  55.3412    copy_from_user
28979    734395         2.2735  57.6146    init_timer
22251    756646         1.7456  59.3602    get_empty_filp
19942    776588         1.5645  60.9247    __call_rcu
18348    794936         1.4394  62.3642    __fput
18328    813264         1.4379  63.8020    alloc_fd
17395    830659         1.3647  65.1667    sys_close
17301    847960         1.3573  66.5240    d_alloc
16570    864530         1.2999  67.8239    inet_create
15522    880052         1.2177  69.0417    alloc_inode
13185    893237         1.0344  70.0761    setup_object
12359    905596         0.9696  71.0456    fd_install
12275    917871         0.9630  72.0086    lock_sock_nested
11924    929795         0.9355  72.9441    release_sock
11790    941585         0.9249  73.8690    sock_init_data
11310    952895         0.8873  74.7563    dput
10924    963819         0.8570  75.6133    drop_file_write_access
10903    974722         0.8554  76.4687    inet_csk_destroy_sock
10184    984906         0.7990  77.2676    inotify_d_instantiate
9372     994278         0.7353  78.0029    local_bh_enable_ip
8901     1003179        0.6983  78.7012    sysenter_do_call
8569     1011748        0.6723  79.3735    iput_single
8194     1019942        0.6428  80.0163    inet_release


This patch serie contains 5 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/5] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

[PATCH 2/5] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

[PATCH 3/5] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

(socket8 bench result : no difference)


[PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/5] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
Overall diffstat :

 fs/anon_inodes.c       |   18 ------
 fs/dcache.c            |  100 ++++++++++++++++++++++++++++++--------
 fs/fs-writeback.c      |    2
 fs/inode.c             |  101 +++++++++++++++++++++++++++++++--------
 fs/pipe.c              |   25 +--------
 include/linux/dcache.h |    9 +++
 include/linux/fs.h     |   17 ++++++
 kernel/sysctl.c        |    6 +-
 mm/page-writeback.c    |    2
 net/socket.c           |   26 +---------
 10 files changed, 200 insertions(+), 106 deletions(-)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry
       [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-11-29  8:43                     ` Eric Dumazet
  2008-11-29  8:43                     ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 32 insertions(+), 21 deletions(-)

[-- Attachment #2: nr_dentry.patch --]
[-- Type: text/plain, Size: 4891 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..46d5d1e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	percpu_counter_dec(&nr_dentry);
 }
 
 /*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
-
+	percpu_counter_inc(&nr_dentry);
 	return dentry;
 }
 
@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes
       [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-11-29  8:43                     ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
@ 2008-11-29  8:43                     ` Eric Dumazet
  2008-11-29  8:44                     ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
  2008-11-29  8:45                     ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
  3 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 507 bytes --]

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/fs-writeback.c   |    2 +-
 fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
 include/linux/fs.h  |    3 +++
 kernel/sysctl.c     |    4 ++--
 mm/page-writeback.c |    2 +-
 5 files changed, 38 insertions(+), 12 deletions(-)

[-- Attachment #2: nr_inodes.patch --]
[-- Type: text/plain, Size: 5626 bytes --]

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	percpu_counter_sub(&nr_inodes, nr_disposed);
 }
 
 /*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
 	
 	inode = alloc_inode(sb);
 	if (inode) {
+		percpu_counter_inc(&nr_inodes);
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_inodes, 0);
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator
       [not found]                 ` <492DDB6A.8090806@cosmosbay.com>
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
       [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-11-29  8:44                   ` Eric Dumazet
  2 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:44 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 505 bytes --]

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)


[-- Attachment #2: last_ino.patch --]
[-- Type: text/plain, Size: 1511 bytes --]

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+	static atomic_t shared_last_ino;
+	int *p = &get_cpu_var(last_ino);
+	int res = *p;
+
+	if (unlikely((res & 1023) == 0))
+		res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
+		inode->i_state = 0;
+		inode->i_ino = last_ino_get();
 		spin_lock(&inode_lock);
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
-		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
       [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-11-29  8:43                     ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
  2008-11-29  8:43                     ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
@ 2008-11-29  8:44                     ` Eric Dumazet
       [not found]                       ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-11-29  8:45                     ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
  3 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:44 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]


Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/anon_inodes.c       |   16 ------------
 fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
 fs/pipe.c              |   23 +----------------
 include/linux/dcache.h |    9 ++++++
 net/socket.c           |   24 +-----------------
 5 files changed, 65 insertions(+), 58 deletions(-)

[-- Attachment #2: dcache_single.patch --]
[-- Type: text/plain, Size: 7886 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
 static struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
 };
 
 /**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_single(&this, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index 46d5d1e..35d4a25 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
  */
 
 /*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	if (!atomic_dec_and_test(&dentry->d_count))
+		return;
+	inode = dentry->d_inode;
+	if (inode)
+		iput(inode);
+	d_free(dentry);
+}
+
+/*
  * dput - release a dentry
  * @dentry: dentry to release 
  *
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
+	/*
+	 * single dentries (sockets/pipes/anon) fast path
+	 */
+	if (dentry->d_flags & DCACHE_SINGLE)
+		return dput_single(dentry);
 
 repeat:
 	if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+	struct dentry *entry;
+
+	entry = d_alloc(NULL, name);
+	if (entry) {
+		entry->d_sb = inode->i_sb;
+		entry->d_parent = entry;
+		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+		entry->d_inode = inode;
+		fsnotify_d_instantiate(entry, inode);
+		security_d_instantiate(entry, inode);
+	}
+	return entry;
+}
+
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
 #define DCACHE_UNHASHED		0x0010	
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE		0x0040
+	/*
+	 * socket, pipe or anonymous fd dentry
+	 * - SINGLE dentries have themselves as a parent.
+	 * - SINGLE dentries are not hashed into global hash table
+	 * - Their d_alias list is empty
+	 * - They dont need dcache_lock synchronization
+	 */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..231cd66 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -307,18 +307,6 @@ static struct file_system_type sock_fs_type = {
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -329,7 +317,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 5/5] fs: new_inode_single() and iput_single()
       [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
                                       ` (2 preceding siblings ...)
  2008-11-29  8:44                     ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
@ 2008-11-29  8:45                     ` Eric Dumazet
  2008-11-29 11:14                       ` Jörn Engel
  3 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:45 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 931 bytes --]

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/anon_inodes.c   |    2 +-
 fs/dcache.c        |    2 +-
 fs/inode.c         |   29 ++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |   12 +++++++++++-
 net/socket.c       |    2 +-
 6 files changed, 35 insertions(+), 14 deletions(-)

[-- Attachment #2: new_inode_single.patch --]
[-- Type: text/plain, Size: 4080 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index 35d4a25..3aa9ed5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_single(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_single(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		percpu_counter_dec(&nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
 #endif
 
 /**
- *	new_inode 	- obtain an inode
+ *	__new_inode 	- obtain an inode
  *	@sb: superblock
+ *  @single: if true, dont link new inode in a list
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
  *	newly created inode's mapping
  *
  */
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
 		inode->i_state = 0;
 		inode->i_ino = last_ino_get();
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
-		spin_unlock(&inode_lock);
+ 		if (single) {
+  			INIT_LIST_HEAD(&inode->i_list);
+  			INIT_LIST_HEAD(&inode->i_sb_list);
+ 		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 	}
 	return inode;
 }
 
-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..b3daffc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1898,7 +1898,17 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+	return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+	return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/net/socket.c b/net/socket.c
index 231cd66..f1e656c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -463,7 +463,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_single(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
       [not found]                       ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-11-29 10:38                         ` Jörn Engel
       [not found]                           ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Jörn Engel @ 2008-11-29 10:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;

Calling the struct dentry entry had me onfused a bit.  I believe
everyone else (including the code you removed) uses dentry.

> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
>  
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
...
> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };

These two could even be combined.

And of course I realize that I comment on absolute trivialities.  On the
whole, I couldn't spot a real problem in your patches.

Jörn

-- 
Public Domain  - Free as in Beer
General Public - Free as in Speech
BSD License    - Free as in Enterprise
Shared Source  - Free as in "Work will make you..."

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
       [not found]                           ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>
@ 2008-11-29 11:14                             ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

Jörn Engel a écrit :
> On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
>> +{
>> +	struct dentry *entry;
>> +
>> +	entry = d_alloc(NULL, name);
>> +	if (entry) {
>> +		entry->d_sb = inode->i_sb;
>> +		entry->d_parent = entry;
>> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
>> +		entry->d_inode = inode;
>> +		fsnotify_d_instantiate(entry, inode);
>> +		security_d_instantiate(entry, inode);
>> +	}
>> +	return entry;
> 
> Calling the struct dentry entry had me onfused a bit.  I believe
> everyone else (including the code you removed) uses dentry.

Ah yes, it seems I took it from d_instantiate(), I guess a cleanup
patch would be nice.

> 
>> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>>  	struct inode *inode;
>>  	struct file *f;
>>  	struct dentry *dentry;
>> -	struct qstr name = { .name = "" };
>> +	static const struct qstr name = { .name = "" };
>>  
>>  	err = -ENFILE;
>>  	inode = get_pipe_inode();
> ...
>> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>>  {
>>  	struct dentry *dentry;
>> -	struct qstr name = { .name = "" };
>> +	static const struct qstr name = { .name = "" };
> 
> These two could even be combined.
> 
> And of course I realize that I comment on absolute trivialities.  On the
> whole, I couldn't spot a real problem in your patches.

Well, at least you reviewed it, it's the important point !

Thanks Jörn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single()
  2008-11-29  8:45                     ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
@ 2008-11-29 11:14                       ` Jörn Engel
  0 siblings, 0 replies; 37+ messages in thread
From: Jörn Engel @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote:
>  
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}

I wonder if it is possible to avoid the atomic_dec_and_test() here, at
least in the common case, and combine it with the atomic_dec_and_test()
of the dentry.  A quick look at fs/inode.c indicates that inode->i_count
may never get changed for a SINGLE inode, except during creation or
deletion.

It might be worth to
- remove the conditional from iput_single() and measure that it makes a
  difference,
- poison SINGLE inodes with some value and
- put a BUG_ON() in __iget() that checks for the poison value.

I _think_ the BUG_ON() is unnecessary, but at least my brain is not
sufficient to convince me.  Can inotify somehow get a hold of a socket?
Or dquot (how insane would that be?)

Jörn

-- 
Mac is for working,
Linux is for Networking,
Windows is for Solitaire!
-- stolen from dc

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 0/7] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
@ 2008-12-11 22:38                     ` Eric Dumazet
  2008-12-11 22:38                     ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
                                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Hi Andrew

Take v2 of this patch serie got no new feedback, maybe its time for mm
inclusion for a while ?

In this third version I added last two patches, one intialy from Christoph
Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs.

Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done
on "struct file".

Thank you

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer). This point is addressed by 6th
patch.

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount

At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

socketallocbench is a very simple program (attached to this mail) that makes
a loop :

for (i = 0; i < 1000000; i++)
    close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(socketallocbench -n 8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close

This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.

New cost if run on one cpu :

real    1.245s (instead of 1.561s)
user    0.074s
sys     1.161s

If run on 8 CPUS :

real    1.624s
user    0.580s
sys     12.296s

On oprofile, we finally can see network stuff coming at the front of
expensive stuff. (with the exception of kmem_cache_[z]alloc(), because
it has to clear 192 bytes of file structures, this takes half of the time)

CPU: Core 2, speed 3000.09 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
176586   176586        10.9376  10.9376    kmem_cache_alloc
169838   346424        10.5196  21.4572    tcp_close
105331   451755         6.5241  27.9813    tcp_v4_init_sock
105146   556901         6.5126  34.4939    tcp_v4_destroy_sock
83307    640208         5.1600  39.6539    sysenter_past_esp
80241    720449         4.9701  44.6239    inet_csk_destroy_sock
74263    794712         4.5998  49.2237    kmem_cache_free
56806    851518         3.5185  52.7422    __percpu_counter_add
48619    900137         3.0114  55.7536    copy_from_user
44803    944940         2.7751  58.5287    init_timer
28539    973479         1.7677  60.2964    d_alloc
27795    1001274        1.7216  62.0180    alloc_fd
26747    1028021        1.6567  63.6747    __fput
24312    1052333        1.5059  65.1805    sys_close
24205    1076538        1.4992  66.6798    inet_create
22409    1098947        1.3880  68.0677    alloc_inode
21359    1120306        1.3230  69.3907    release_sock
19865    1140171        1.2304  70.6211    fd_install
19472    1159643        1.2061  71.8272    lock_sock_nested
18956    1178599        1.1741  73.0013    sock_init_data
17301    1195900        1.0716  74.0729    drop_file_write_access
17113    1213013        1.0600  75.1329    inotify_d_instantiate
16384    1229397        1.0148  76.1477    dput
15173    1244570        0.9398  77.0875    local_bh_enable_ip
15017    1259587        0.9301  78.0176    local_bh_enable
13354    1272941        0.8271  78.8448    __sock_create
13139    1286080        0.8138  79.6586    inet_release
13062    1299142        0.8090  80.4676    sysenter_do_call
11935    1311077        0.7392  81.2069    iput_single

This patch serie contains 7 patches, against linux-2.6 tree,
plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c)

[PATCH 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n 8" bench result : 27.5s to 25s)

[PATCH 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

("socketallocbench -n 8" bench result : no difference at this point)

[PATCH 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 232 allocations)

("socketallocbench -n 8" result : no difference)

[PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.

Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)

[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <cl@linux-foundation.org>

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

("socketallocbench -n 8" result : from 3.01s to 2.20s)

[PATCH 7/7] fs: MS_NOREFCOUNT

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

cat socketallocbench.c
/*
 * socketallocbench benchmark
 *
 * Usage : socket [-n procs]  [-l loops]
 */
#include <sys/socket.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/wait.h>

void dowork(int loops)
{
        int i;

        for (i = 0; i < loops; i++)
                close(socket(AF_INET, SOCK_STREAM, 0));
}

int main(int argc, char *argv[])
{
        int i;
        int n = 1;
        int loops = 1000000;
        pid_t *pidtable;

        while ((i = getopt(argc, argv, "n:l:")) != EOF) {
                if (i == 'n')
                        n = atoi(optarg);
                if (i == 'l')
                        loops = atoi(optarg);
        }
        pidtable = malloc(n * sizeof(pid_t));
        for (i = 1; i < n; i++) {
                pidtable[i] = fork();
                if (pidtable[i] == 0) {
                        dowork(loops);
                        _exit(0);
                }
                if (pidtable[i] == -1) {
                        perror("fork");
                        n = i;
                        break;
                }
        }
        dowork(loops);
        for (i = 1; i < n; i++) {
                int status;

                wait(&status);
                }
        return 0;
}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
  2008-12-11 22:38                     ` [PATCH v3 0/7] " Eric Dumazet
@ 2008-12-11 22:38                     ` Eric Dumazet
  2007-07-24  1:24                       ` Nick Piggin
       [not found]                       ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-12-11 22:39                     ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
                                       ` (5 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n8" result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fa1ba03..f463a81 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	percpu_counter_dec(&nr_dentry);
 }
 
 /*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
-
+	percpu_counter_inc(&nr_dentry);
 	return dentry;
 }
 
@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..114cb65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d56fe7..777bee7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
  2008-12-11 22:38                     ` [PATCH v3 0/7] " Eric Dumazet
  2008-12-11 22:38                     ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
@ 2008-12-11 22:39                     ` Eric Dumazet
       [not found]                       ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-12-11 22:39                     ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
                                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/fs-writeback.c   |    2 +-
 fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
 include/linux/fs.h  |    3 +++
 kernel/sysctl.c     |    4 ++--
 mm/page-writeback.c |    2 +-
 5 files changed, 38 insertions(+), 12 deletions(-)


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	percpu_counter_sub(&nr_inodes, nr_disposed);
 }
 
 /*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
 	
 	inode = alloc_inode(sb);
 	if (inode) {
+		percpu_counter_inc(&nr_inodes);
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_inodes, 0);
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 114cb65..a789346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 777bee7..b705f3a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
                                       ` (2 preceding siblings ...)
  2008-12-11 22:39                     ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
@ 2008-12-11 22:39                     ` Eric Dumazet
  2007-07-24  1:34                       ` Nick Piggin
  2008-12-16 21:26                       ` Paul E. McKenney
  2008-12-11 22:39                     ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
                                       ` (3 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+	static atomic_t shared_last_ino;
+	int *p = &get_cpu_var(last_ino);
+	int res = *p;
+
+	if (unlikely((res & 1023) == 0))
+		res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
+		inode->i_state = 0;
+		inode->i_ino = last_ino_get();
 		spin_lock(&inode_lock);
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
-		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
                                       ` (3 preceding siblings ...)
  2008-12-11 22:39                     ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
@ 2008-12-11 22:39                     ` Eric Dumazet
       [not found]                       ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-12-11 22:40                     ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
                                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

("socketallocbench -n 8" bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |   16 ------------
 fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
 fs/pipe.c              |   23 +----------------
 include/linux/dcache.h |    9 ++++++
 net/socket.c           |   24 +-----------------
 5 files changed, 65 insertions(+), 58 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
 static struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
 };
 
 /**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_single(&this, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index f463a81..af3bfb3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
  */
 
 /*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	if (!atomic_dec_and_test(&dentry->d_count))
+		return;
+	inode = dentry->d_inode;
+	if (inode)
+		iput(inode);
+	d_free(dentry);
+}
+
+/*
  * dput - release a dentry
  * @dentry: dentry to release 
  *
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
+	/*
+	 * single dentries (sockets/pipes/anon) fast path
+	 */
+	if (dentry->d_flags & DCACHE_SINGLE)
+		return dput_single(dentry);
 
 repeat:
 	if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+	struct dentry *entry;
+
+	entry = d_alloc(NULL, name);
+	if (entry) {
+		entry->d_sb = inode->i_sb;
+		entry->d_parent = entry;
+		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+		entry->d_inode = inode;
+		fsnotify_d_instantiate(entry, inode);
+		security_d_instantiate(entry, inode);
+	}
+	return entry;
+}
+
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
 #define DCACHE_UNHASHED		0x0010	
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE		0x0040
+	/*
+	 * socket, pipe or anonymous fd dentry
+	 * - SINGLE dentries have themselves as a parent.
+	 * - SINGLE dentries are not hashed into global hash table
+	 * - Their d_alias list is empty
+	 * - They dont need dcache_lock synchronization
+	 */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index 92764d8..353c928 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 5/7] fs: new_inode_single() and iput_single()
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
                                       ` (4 preceding siblings ...)
  2008-12-11 22:39                     ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
@ 2008-12-11 22:40                     ` Eric Dumazet
  2008-12-16 21:41                       ` Paul E. McKenney
       [not found]                     ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-12-11 22:41                     ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
  7 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it
as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c   |    2 +-
 fs/dcache.c        |    2 +-
 fs/inode.c         |   29 ++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |   12 +++++++++++-
 net/socket.c       |    2 +-
 6 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index af3bfb3..3363853 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_single(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_single(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		percpu_counter_dec(&nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
 #endif
 
 /**
- *	new_inode 	- obtain an inode
+ *	__new_inode 	- obtain an inode
  *	@sb: superblock
+ *  @single: if true, dont link new inode in a list
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
  *	newly created inode's mapping
  *
  */
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
 		inode->i_state = 0;
 		inode->i_ino = last_ino_get();
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
-		spin_unlock(&inode_lock);
+ 		if (single) {
+  			INIT_LIST_HEAD(&inode->i_list);
+  			INIT_LIST_HEAD(&inode->i_sb_list);
+ 		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 	}
 	return inode;
 }
 
-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a789346..a702d81 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+	return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+	return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/net/socket.c b/net/socket.c
index 353c928..4017409 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_single(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
       [not found]                     ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-12-11 22:40                       ` Eric Dumazet
  2007-07-24  1:13                         ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro,
	Paul E. McKenney

From: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

[PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Signed-off-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/filesystems/files.txt |   21 ++++++++++++++--
 fs/file_table.c                     |   33 ++++++++++++++++++--------
 include/linux/fs.h                  |    5 ---
 3 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
index ac2facc..6916baa 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.txt
@@ -78,13 +78,28 @@ the fdtable structure -
    that look-up may race with the last put() operation on the
    file structure. This is avoided using atomic_long_inc_not_zero()
    on ->f_count :
+   As file structures are allocated with SLAB_DESTROY_BY_RCU,
+   they can also be freed before a RCU grace period, and reused,
+   but still as a struct file.
+   It is necessary to check again after getting
+   a stable reference (ie after atomic_long_inc_not_zero()),
+   that fcheck_files(files, fd) points to the same file.
 
 	rcu_read_lock();
 	file = fcheck_files(files, fd);
 	if (file) {
-		if (atomic_long_inc_not_zero(&file->f_count))
+		if (atomic_long_inc_not_zero(&file->f_count)) {
 			*fput_needed = 1;
-		else
+			/*
+			 * Now we have a stable reference to an object.
+			 * Check if other threads freed file and reallocated it.
+			 */
+			if (file != fcheck_files(files, fd)) {
+				*fput_needed = 0;
+				put_filp(file);
+				file = NULL;
+			}
+		} else
 		/* Didn't get the reference, someone's freed */
 			file = NULL;
 	}
@@ -95,6 +110,8 @@ the fdtable structure -
    atomic_long_inc_not_zero() detects if refcounts is already zero or
    goes to zero during increment. If it does, we fail
    fget()/fget_light().
+   The second call to fcheck_files(files, fd) checks that this filp
+   was not freed, then reused by an other thread.
 
 6. Since both fdtable and file structures can be looked up
    lock-free, they must be installed using rcu_assign_pointer()
diff --git a/fs/file_table.c b/fs/file_table.c
index a46e880..3e9259d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
-static inline void file_free_rcu(struct rcu_head *head)
-{
-	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
-	kmem_cache_free(filp_cachep, f);
-}
-
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
 	file_check_state(f);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+	kmem_cache_free(filp_cachep, f);
 }
 
 /*
@@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
 			rcu_read_unlock();
 			return NULL;
 		}
+		/*
+		 * Now we have a stable reference to an object.
+		 * Check if other threads freed file and re-allocated it.
+		 */
+		if (unlikely(file != fcheck_files(files, fd))) {
+			put_filp(file);
+			file = NULL;
+		}
 	}
 	rcu_read_unlock();
 
@@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 		rcu_read_lock();
 		file = fcheck_files(files, fd);
 		if (file) {
-			if (atomic_long_inc_not_zero(&file->f_count))
+			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
-			else
+				/*
+				 * Now we have a stable reference to an object.
+				 * Check if other threads freed this file and
+				 * re-allocated it.
+				 */
+				if (unlikely(file != fcheck_files(files, fd))) {
+					*fput_needed = 0;
+					put_filp(file);
+					file = NULL;
+				}
+			} else
 				/* Didn't get the reference, someone's freed */
 				file = NULL;
 		}
@@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages)
 	int n; 
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC,
+			NULL);
 
 	/*
 	 * One file with associated inode and dcache is very roughly 1K. 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a702d81..a1f56d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 #define FILE_MNT_WRITE_RELEASED	2
 
 struct file {
-	/*
-	 * fu_list becomes invalid after file_free is called and queued via
-	 * fu_rcuhead for RCU freeing
-	 */
 	union {
 		struct list_head	fu_list;
-		struct rcu_head 	fu_rcuhead;
 	} f_u;
 	struct path		f_path;
 #define f_dentry	f_path.dentry

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 7/7] fs: MS_NOREFCOUNT
  2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
                                       ` (6 preceding siblings ...)
       [not found]                     ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-12-11 22:41                     ` Eric Dumazet
  7 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c      |    1 +
 fs/pipe.c             |    3 ++-
 include/linux/fs.h    |    2 ++
 include/linux/mount.h |    8 +++-----
 net/socket.c          |    1 +
 5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 89fd36d..de0ec3b 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
 	}
+	anon_inode_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 	anon_inode_inode = anon_inode_mkinode();
 	if (IS_ERR(anon_inode_inode)) {
 		error = PTR_ERR(anon_inode_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 8c51a0d..f547432 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
-		}
+		} else
+			pipe_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 	}
 	return err;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a1f56d4..11b0452 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -137,6 +137,8 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+
+#define MS_NOREFCOUNT	(1<<29) /* kernel static mnt : no refcounting needed */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..51418b5 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -14,10 +14,8 @@
 #include <linux/nodemask.h>
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
+#include <linux/fs.h>
 
-struct super_block;
-struct vfsmount;
-struct dentry;
 struct mnt_namespace;
 
 #define MNT_NOSUID	0x01
@@ -73,7 +71,7 @@ struct vfsmount {
 
 static inline struct vfsmount *mntget(struct vfsmount *mnt)
 {
-	if (mnt)
+	if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT))
 		atomic_inc(&mnt->mnt_count);
 	return mnt;
 }
@@ -87,7 +85,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
-	if (mnt) {
+	if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) {
 		mnt->mnt_expiry_mark = 0;
 		mntput_no_expire(mnt);
 	}
diff --git a/net/socket.c b/net/socket.c
index 4017409..2534dbc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2206,6 +2206,7 @@ static int __init sock_init(void)
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
 	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
  2007-07-24  1:13                         ` Nick Piggin
@ 2008-12-12  2:50                           ` Nick Piggin
  2008-12-12  4:45                           ` Eric Dumazet
  1 sibling, 0 replies; 37+ messages in thread
From: Nick Piggin @ 2008-12-12  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Tuesday 24 July 2007 11:13, Nick Piggin wrote:
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> > From: Christoph Lameter <cl@linux-foundation.org>
> >
> > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
> >
> > Currently we schedule RCU frees for each file we free separately. That
> > has several drawbacks against the earlier file handling (in 2.6.5 f.e.),
> > which did not require RCU callbacks:
> >
> > 1. Excessive number of RCU callbacks can be generated causing long RCU
> >   queues that in turn cause long latencies. We hit SLUB page allocation
> >   more often than necessary.
> >
> > 2. The cache hot object is not preserved between free and realloc. A
> > close followed by another open is very fast with the RCUless approach
> > because the last freed object is returned by the slab allocator that is
> > still cache hot. RCU free means that the object is not immediately
> > available again. The new object is cache cold and therefore open/close
> > performance tests show a significant degradation with the RCU
> >   implementation.
> >
> > One solution to this problem is to move the RCU freeing into the Slab
> > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> > time. The slab allocator will do RCU frees only when it is necessary
> > to dispose of slabs of objects (rare). So with that approach we can cut
> > out the RCU overhead significantly.
> >
> > However, the slab allocator may return the object for another use even
> > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> > there is the (unlikely) possibility that the object is going to be
> > switched under us in sections protected by rcu_read_lock() and
> > rcu_read_unlock(). So we need to verify that we have acquired the correct
> > object after establishing a stable object reference (incrementing the
> > refcounter does that).
> >
> >
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  Documentation/filesystems/files.txt |   21 ++++++++++++++--
> >  fs/file_table.c                     |   33 ++++++++++++++++++--------
> >  include/linux/fs.h                  |    5 ---
> >  3 files changed, 42 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/filesystems/files.txt
> > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> > --- a/Documentation/filesystems/files.txt
> > +++ b/Documentation/filesystems/files.txt
> > @@ -78,13 +78,28 @@ the fdtable structure -
> >     that look-up may race with the last put() operation on the
> >     file structure. This is avoided using atomic_long_inc_not_zero()
> >     on ->f_count :
> > +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> > +   they can also be freed before a RCU grace period, and reused,
> > +   but still as a struct file.
> > +   It is necessary to check again after getting
> > +   a stable reference (ie after atomic_long_inc_not_zero()),
> > +   that fcheck_files(files, fd) points to the same file.
> >
> >  	rcu_read_lock();
> >  	file = fcheck_files(files, fd);
> >  	if (file) {
> > -		if (atomic_long_inc_not_zero(&file->f_count))
> > +		if (atomic_long_inc_not_zero(&file->f_count)) {
> >  			*fput_needed = 1;
> > -		else
> > +			/*
> > +			 * Now we have a stable reference to an object.
> > +			 * Check if other threads freed file and reallocated it.
> > +			 */
> > +			if (file != fcheck_files(files, fd)) {
> > +				*fput_needed = 0;
> > +				put_filp(file);
> > +				file = NULL;
> > +			}
> > +		} else
> >  		/* Didn't get the reference, someone's freed */
> >  			file = NULL;
> >  	}
> > @@ -95,6 +110,8 @@ the fdtable structure -
> >     atomic_long_inc_not_zero() detects if refcounts is already zero or
> >     goes to zero during increment. If it does, we fail
> >     fget()/fget_light().
> > +   The second call to fcheck_files(files, fd) checks that this filp
> > +   was not freed, then reused by an other thread.
> >
> >  6. Since both fdtable and file structures can be looked up
> >     lock-free, they must be installed using rcu_assign_pointer()
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index a46e880..3e9259d 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
> >
> >  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> >
> > -static inline void file_free_rcu(struct rcu_head *head)
> > -{
> > -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> > -	kmem_cache_free(filp_cachep, f);
> > -}
> > -
> >  static inline void file_free(struct file *f)
> >  {
> >  	percpu_counter_dec(&nr_files);
> >  	file_check_state(f);
> > -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > +	kmem_cache_free(filp_cachep, f);
> >  }
> >
> >  /*
> > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> >  			rcu_read_unlock();
> >  			return NULL;
> >  		}
> > +		/*
> > +		 * Now we have a stable reference to an object.
> > +		 * Check if other threads freed file and re-allocated it.
> > +		 */
> > +		if (unlikely(file != fcheck_files(files, fd))) {
> > +			put_filp(file);
> > +			file = NULL;
> > +		}
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.
>
> From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.
>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Sorry, my clock screwed up and I didn't notice :(

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
  2007-07-24  1:13                         ` Nick Piggin
  2008-12-12  2:50                           ` Nick Piggin
@ 2008-12-12  4:45                           ` Eric Dumazet
       [not found]                             ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-12-12  4:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Nick Piggin a écrit :
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>> From: Christoph Lameter <cl@linux-foundation.org>
>>
>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>
>> Currently we schedule RCU frees for each file we free separately. That has
>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>> did not require RCU callbacks:
>>
>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>   queues that in turn cause long latencies. We hit SLUB page allocation
>>   more often than necessary.
>>
>> 2. The cache hot object is not preserved between free and realloc. A close
>>   followed by another open is very fast with the RCUless approach because
>>   the last freed object is returned by the slab allocator that is
>>   still cache hot. RCU free means that the object is not immediately
>>   available again. The new object is cache cold and therefore open/close
>>   performance tests show a significant degradation with the RCU
>>   implementation.
>>
>> One solution to this problem is to move the RCU freeing into the Slab
>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>> time. The slab allocator will do RCU frees only when it is necessary
>> to dispose of slabs of objects (rare). So with that approach we can cut
>> out the RCU overhead significantly.
>>
>> However, the slab allocator may return the object for another use even
>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>> there is the (unlikely) possibility that the object is going to be
>> switched under us in sections protected by rcu_read_lock() and
>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>> object after establishing a stable object reference (incrementing the
>> refcounter does that).
>>
>>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> ---
>>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>>  include/linux/fs.h                  |    5 ---
>>  3 files changed, 42 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/filesystems/files.txt
>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>> --- a/Documentation/filesystems/files.txt
>> +++ b/Documentation/filesystems/files.txt
>> @@ -78,13 +78,28 @@ the fdtable structure -
>>     that look-up may race with the last put() operation on the
>>     file structure. This is avoided using atomic_long_inc_not_zero()
>>     on ->f_count :
>> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
>> +   they can also be freed before a RCU grace period, and reused,
>> +   but still as a struct file.
>> +   It is necessary to check again after getting
>> +   a stable reference (ie after atomic_long_inc_not_zero()),
>> +   that fcheck_files(files, fd) points to the same file.
>>
>>  	rcu_read_lock();
>>  	file = fcheck_files(files, fd);
>>  	if (file) {
>> -		if (atomic_long_inc_not_zero(&file->f_count))
>> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>>  			*fput_needed = 1;
>> -		else
>> +			/*
>> +			 * Now we have a stable reference to an object.
>> +			 * Check if other threads freed file and reallocated it.
>> +			 */
>> +			if (file != fcheck_files(files, fd)) {
>> +				*fput_needed = 0;
>> +				put_filp(file);
>> +				file = NULL;
>> +			}
>> +		} else
>>  		/* Didn't get the reference, someone's freed */
>>  			file = NULL;
>>  	}
>> @@ -95,6 +110,8 @@ the fdtable structure -
>>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>>     goes to zero during increment. If it does, we fail
>>     fget()/fget_light().
>> +   The second call to fcheck_files(files, fd) checks that this filp
>> +   was not freed, then reused by an other thread.
>>
>>  6. Since both fdtable and file structures can be looked up
>>     lock-free, they must be installed using rcu_assign_pointer()
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index a46e880..3e9259d 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>
>>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>
>> -static inline void file_free_rcu(struct rcu_head *head)
>> -{
>> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
>> -	kmem_cache_free(filp_cachep, f);
>> -}
>> -
>>  static inline void file_free(struct file *f)
>>  {
>>  	percpu_counter_dec(&nr_files);
>>  	file_check_state(f);
>> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>> +	kmem_cache_free(filp_cachep, f);
>>  }
>>
>>  /*
>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>  			rcu_read_unlock();
>>  			return NULL;
>>  		}
>> +		/*
>> +		 * Now we have a stable reference to an object.
>> +		 * Check if other threads freed file and re-allocated it.
>> +		 */
>> +		if (unlikely(file != fcheck_files(files, fd))) {
>> +			put_filp(file);
>> +			file = NULL;
>> +		}
> 
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.

If we got at this point, we :

Found a non NULL pointer in our fd table.
Then, another thread came, closed the file while we not yet added our reference.
This file was freed (kmem_cache_free(filp_cachep, file))
This file was reused and inserted on another thread fd table.
We added our reference on refcount.
We checked if this file is still ours (in our fd tab).
We found this file is not anymore the file we wanted.
Calling put_filp() here is our only choice to safely remove the reference on
a truly allocated file. At this point the file is
a truly allocated file but not anymore ours.
Unfortunatly we added a reference on it : we must release it.
If the other thread already called put_filp() because it wanted to close its new file,
we must see f_refcnt going to zero, and we must call __fput(), to perform
all the relevant file cleanup ourself.


> 
>>From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.

I see your point. But currently, any thread can be "releasing the last
reference on a file". That is not always the thread that called close(fd)
We extend this to "any thread of any process", so it might have
a security effect you are absolutely right.

> 
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Any real world program that open and close files, or said better,
that close and open files :)

sizeof(struct file) is 192 bytes. Thats three cache lines.
Being able to reuse a hot "struct file" avoids three cache line misses.

Thats about 120 ns.

Then, using call_rcu() is also a latency killer, since we explicitly say :
I dont want to free this file right now, I delegate this job to another layer
in two or three milli second (or more)

A final point is that SLUB doesnt need to allocate or free a slab in many cases.
(This is probably why Christoph needed this patch in 2006 :) )
In my case, I need all these patches to speedup http servers.
They obviously open and close many files per second.

The added code has a cost of less than 3 ns, but I suspect we can cut it to less than 1ns
We prefered with Christoph and Paul to keep patch as short as possible to focus
on essential points.

               :c0287656:       mov    -0x14(%ebp),%esi
               :c0287659:       mov    -0x24(%ebp),%edi
               :c028765c:       mov    0x4(%esi),%eax
               :c028765f:       cmp    (%eax),%edi
               :c0287661:       jb     c0287678 <fget+0xc8>
               :c0287663:       mov    %ebx,%eax
               :c0287665:       xor    %ebx,%ebx
               :c0287667:       call   c0287450 <put_filp>
               :c028766c:       jmp    c02875ec <fget+0x3c>
               :c0287671:       lea    0x0(%esi,%eiz,1),%esi
               :c0287678:       mov    0x4(%eax),%edi
               :c028767b:       add    %edi,-0x10(%ebp)
               :c028767e:       mov    -0x10(%ebp),%edx
     1 8.8e-05 :c0287681:       mov    (%edx),%eax
               :c0287683:       cmp    %eax,%ebx
               :c0287685:       je     c02875ec <fget+0x3c>
               :c028768b:       jmp    c0287663 <fget+0xb3>

We could avoid doing the full test, because there is no way the files->max_fds could
become lower under us, or even fdt itself, and fdt->fd

So instead of using twice this function :

static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
{
        struct file * file = NULL;
        struct fdtable *fdt = files_fdtable(files);

        if (fd < fdt->max_fds)
                file = rcu_dereference(fdt->fd[fd]);
        return file;
}

We could use the attached patch


This becomes a matter of three instructions, including a 99.99% predicted branch :

c0287646:       8b 03                   mov    (%ebx),%eax
c0287648:       39 45 e4                cmp    %eax,-0x1c(%ebp)
c028764b:       74 a1                   je     c02875ee <fget+0x3e>

c028764d:       8b 45 e4                mov    -0x1c(%ebp),%eax
c0287650:       e8 fb fd ff ff          call   c0287450 <put_filp>
c0287655:       31 c0                   xor    %eax,%eax
c0287657:       eb 98                   jmp    c02875f1 <fget+0x41>
	

At the time Christoph sent its patch (in 2006), nobody cared, because
we had no benchmark or real world workload that demonstrated the gain 
of his patch, only intuitions.
We had too many contended cache lines that slow down the whole process.

SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
misses costs become really problematic. This patch series clearly demonstrate
it.

Thanks Nick for your feedback and comments.

Eric

[PATCH] fs: optimize fget() & fget_light()

Instead of calling fcheck_files() a second time, we can take into account we
already did part of the job, in a rcu read locked section. We need a
struct file **filp pointer so that we only dereference it a second time.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/file_table.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 3e9259d..4bc019f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -289,11 +289,16 @@ void __fput(struct file *file)
 
 struct file *fget(unsigned int fd)
 {
-	struct file *file;
+	struct file *file = NULL, **filp;
 	struct files_struct *files = current->files;
+	struct fdtable *fdt;
 
 	rcu_read_lock();
-	file = fcheck_files(files, fd);
+	fdt = files_fdtable(files);
+	if (likely(fd < fdt->max_fds)) {
+		filp = &fdt->fd[fd];
+		file = rcu_dereference(*filp);
+	}
 	if (file) {
 		if (!atomic_long_inc_not_zero(&file->f_count)) {
 			/* File object ref couldn't be taken */
@@ -304,7 +309,7 @@ struct file *fget(unsigned int fd)
 		 * Now we have a stable reference to an object.
 		 * Check if other threads freed file and re-allocated it.
 		 */
-		if (unlikely(file != fcheck_files(files, fd))) {
+		if (unlikely(file != rcu_dereference(*filp))) {
 			put_filp(file);
 			file = NULL;
 		}
@@ -325,15 +330,21 @@ EXPORT_SYMBOL(fget);
  */
 struct file *fget_light(unsigned int fd, int *fput_needed)
 {
-	struct file *file;
+	struct file *file, **filp;
 	struct files_struct *files = current->files;
+	struct fdtable *fdt;
 
 	*fput_needed = 0;
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
 		rcu_read_lock();
-		file = fcheck_files(files, fd);
+		fdt = files_fdtable(files);
+		file = NULL;
+		if (likely(fd < fdt->max_fds)) {
+			filp = &fdt->fd[fd];
+			file = rcu_dereference(*filp);
+		}
 		if (file) {
 			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
@@ -342,7 +353,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 				 * Check if other threads freed this file and
 				 * re-allocated it.
 				 */
-				if (unlikely(file != fcheck_files(files, fd))) {
+				if (unlikely(file != rcu_dereference(*filp))) {
 					*fput_needed = 0;
 					put_filp(file);
 					file = NULL;


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
       [not found]                           ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
@ 2008-12-12  5:11                             ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-12-12  5:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro,
	Paul E. McKenney

Nick Piggin a écrit :
> On Friday 12 December 2008 09:39, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes dont need inode_lock anymore.
>>
>> (socket8 bench result : no difference at this point)
> 
> Looks good.
> 
> But.... If we never actually need fast access to the approximate
> total, (which seems to apply to this and the previous patch) we
> could use something much simpler which does not have the spinlock
> or all this batching stuff that percpu counters have. I'd prefer
> that because it will be faster in a straight line...

Well, using a non batching mode could be real easy, just
call __percpu_counter_add(&counter, inc, 1<<30);

Or define a new percpu_counter_fastadd(&counter, inc);

percpu_counter are nice because handle the CPU hotplug problem,
if we want to use for_each_online_cpu() instead of
for_each_possible_cpu().

> 
> (BTW. percpu counters can't be used in interrupt context? That's
> nice.)
> 
> 

Not sure why you said this.

I would like to have a irqsafe percpu_counter, I was preparing such a
patch because we need it for net-next

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
       [not found]                             ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-12-12 16:48                               ` Eric Dumazet
       [not found]                                 ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-12-13  1:41                               ` Christoph Lameter
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2008-12-12 16:48 UTC (permalink / raw)
  To: Christoph Lameter, Paul E. McKenney
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig,
	David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

Eric Dumazet a écrit :
> Nick Piggin a écrit :
>> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>>> From: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>>
>>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>>
>>> Currently we schedule RCU frees for each file we free separately. That has
>>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>>> did not require RCU callbacks:
>>>
>>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>>   queues that in turn cause long latencies. We hit SLUB page allocation
>>>   more often than necessary.
>>>
>>> 2. The cache hot object is not preserved between free and realloc. A close
>>>   followed by another open is very fast with the RCUless approach because
>>>   the last freed object is returned by the slab allocator that is
>>>   still cache hot. RCU free means that the object is not immediately
>>>   available again. The new object is cache cold and therefore open/close
>>>   performance tests show a significant degradation with the RCU
>>>   implementation.
>>>
>>> One solution to this problem is to move the RCU freeing into the Slab
>>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>>> time. The slab allocator will do RCU frees only when it is necessary
>>> to dispose of slabs of objects (rare). So with that approach we can cut
>>> out the RCU overhead significantly.
>>>
>>> However, the slab allocator may return the object for another use even
>>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>>> there is the (unlikely) possibility that the object is going to be
>>> switched under us in sections protected by rcu_read_lock() and
>>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>>> object after establishing a stable object reference (incrementing the
>>> refcounter does that).
>>>
>>>
>>> Signed-off-by: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
>>> Signed-off-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>> ---
>>>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>>>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>>>  include/linux/fs.h                  |    5 ---
>>>  3 files changed, 42 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/filesystems/files.txt
>>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>>> --- a/Documentation/filesystems/files.txt
>>> +++ b/Documentation/filesystems/files.txt
>>> @@ -78,13 +78,28 @@ the fdtable structure -
>>>     that look-up may race with the last put() operation on the
>>>     file structure. This is avoided using atomic_long_inc_not_zero()
>>>     on ->f_count :
>>> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
>>> +   they can also be freed before a RCU grace period, and reused,
>>> +   but still as a struct file.
>>> +   It is necessary to check again after getting
>>> +   a stable reference (ie after atomic_long_inc_not_zero()),
>>> +   that fcheck_files(files, fd) points to the same file.
>>>
>>>  	rcu_read_lock();
>>>  	file = fcheck_files(files, fd);
>>>  	if (file) {
>>> -		if (atomic_long_inc_not_zero(&file->f_count))
>>> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>>>  			*fput_needed = 1;
>>> -		else
>>> +			/*
>>> +			 * Now we have a stable reference to an object.
>>> +			 * Check if other threads freed file and reallocated it.
>>> +			 */
>>> +			if (file != fcheck_files(files, fd)) {
>>> +				*fput_needed = 0;
>>> +				put_filp(file);
>>> +				file = NULL;
>>> +			}
>>> +		} else
>>>  		/* Didn't get the reference, someone's freed */
>>>  			file = NULL;
>>>  	}
>>> @@ -95,6 +110,8 @@ the fdtable structure -
>>>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>>>     goes to zero during increment. If it does, we fail
>>>     fget()/fget_light().
>>> +   The second call to fcheck_files(files, fd) checks that this filp
>>> +   was not freed, then reused by an other thread.
>>>
>>>  6. Since both fdtable and file structures can be looked up
>>>     lock-free, they must be installed using rcu_assign_pointer()
>>> diff --git a/fs/file_table.c b/fs/file_table.c
>>> index a46e880..3e9259d 100644
>>> --- a/fs/file_table.c
>>> +++ b/fs/file_table.c
>>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>>
>>>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>>
>>> -static inline void file_free_rcu(struct rcu_head *head)
>>> -{
>>> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
>>> -	kmem_cache_free(filp_cachep, f);
>>> -}
>>> -
>>>  static inline void file_free(struct file *f)
>>>  {
>>>  	percpu_counter_dec(&nr_files);
>>>  	file_check_state(f);
>>> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>>> +	kmem_cache_free(filp_cachep, f);
>>>  }
>>>
>>>  /*
>>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>>  			rcu_read_unlock();
>>>  			return NULL;
>>>  		}
>>> +		/*
>>> +		 * Now we have a stable reference to an object.
>>> +		 * Check if other threads freed file and re-allocated it.
>>> +		 */
>>> +		if (unlikely(file != fcheck_files(files, fd))) {
>>> +			put_filp(file);
>>> +			file = NULL;
>>> +		}
>> This is a non-trivial change, because that put_filp may drop the last
>> reference to the file. So now we have the case where we free the file
>> from a context in which it had never been allocated.
> 
> If we got at this point, we :
> 
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Reading again this mail I realise we call put_filp(file), while this should
be fput(file) or put_filp(file), we dont know.

Damned, this patch is wrong as is.

Christoph, Paul, do you see the problem ?

In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
and tried to close it while we got a reference on file) had to call put_filp() or fput()
to release its own reference. So we call atomic_long_dec_and_test() and cannot
take the appropriate action (calling the full __fput() version or the small one,
that some systems use to 'close' an not really opened file.

void put_filp(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count)) {
                security_file_free(file);
                file_kill(file);
                file_free(file);
        }
}

void fput(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count))
                __fput(file);
}

I believe put_filp() is only called on slowpath (error cases).

Should we just zap it and always call fput() ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
       [not found]                             ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2008-12-12 16:48                               ` Eric Dumazet
@ 2008-12-13  1:41                               ` Christoph Lameter
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Lameter @ 2008-12-13  1:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig,
	David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney

On Fri, 12 Dec 2008, Eric Dumazet wrote:


> > This is a non-trivial change, because that put_filp may drop the last
> > reference to the file. So now we have the case where we free the file
> > from a context in which it had never been allocated.
>
> If we got at this point, we :
>
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Correct. That was the idea.

> A final point is that SLUB doesnt need to allocate or free a slab in many cases.
> (This is probably why Christoph needed this patch in 2006 :) )

We needed this patch in 2006 because the AIM9 creat-clo test showed
regressions after the rcu free was put in (discovered during SLES11
verification cycle). All slab allocators do at least defer frees until all
objects in the page are freed if not longer.

> In my case, I need all these patches to speedup http servers.
> They obviously open and close many files per second.

Run AIM9 creat-close tests....

> SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
> misses costs become really problematic. This patch series clearly demonstrate
> it.

Well the issue becomes more severe as accesses to cold memory become more
extensive. Thanks for your work on this.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
       [not found]                                 ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-12-13  2:07                                   ` Christoph Lameter
       [not found]                                     ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Lameter @ 2008-12-13  2:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar,
	Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Fri, 12 Dec 2008, Eric Dumazet wrote:

> > a truly allocated file. At this point the file is
> > a truly allocated file but not anymore ours.

Its a valid file. Does ownership matter here?

> Reading again this mail I realise we call put_filp(file), while this should
> be fput(file) or put_filp(file), we dont know.
>
> Damned, this patch is wrong as is.
>
> Christoph, Paul, do you see the problem ?

Yes.

> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
> and tried to close it while we got a reference on file) had to call put_filp() or fput()
> to release its own reference. So we call atomic_long_dec_and_test() and cannot
> take the appropriate action (calling the full __fput() version or the small one,
> that some systems use to 'close' an not really opened file.

The difference is mainly that fput() does full processing whereas
put_filp() is used when we know that the file was not fully operational.
If the checks in __fput are able to handle the put_filp() situation by not
releasing resources that were not allocated then we should be fine.

> I believe put_filp() is only called on slowpath (error cases).

Looks like it. It seems to assume that no dentry is associated.

> Should we just zap it and always call fput() ?

Only if fput() can handle partially setup files.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
       [not found]                       ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-12-16 21:04                         ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Thu, Dec 11, 2008 at 11:38:56PM +0100, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
> 
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
> 
> d_alloc() can avoid taking dcache_lock if parent is NULL
> 
> ("socketallocbench -n8" result : 27.5s to 25s)

Looks good!  (At least once I realised that nr_dentry was global rather
than per-dentry!!!)

Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> ---
>  fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
>  include/linux/fs.h |    2 +
>  kernel/sysctl.c    |    2 -
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index fa1ba03..f463a81 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
>  static unsigned int d_hash_mask __read_mostly;
>  static unsigned int d_hash_shift __read_mostly;
>  static struct hlist_head *dentry_hashtable __read_mostly;
> +static struct percpu_counter nr_dentry;
> 
>  /* Statistics gathering. */
>  struct dentry_stat_t dentry_stat = {
>  	.age_limit = 45,
>  };
> 
> +/*
> + * Handle nr_dentry sysctl
> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
> +	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>  static void __d_free(struct dentry *dentry)
>  {
>  	WARN_ON(!list_empty(&dentry->d_alias));
> @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
>  }
> 
>  /*
> - * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
> - * inside dcache_lock.
> + * no dcache_lock, please.
>   */
>  static void d_free(struct dentry *dentry)
>  {
> @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
>  		__d_free(dentry);
>  	else
>  		call_rcu(&dentry->d_u.d_rcu, d_callback);
> +	percpu_counter_dec(&nr_dentry);
>  }
> 
>  /*
> @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
>  	struct dentry *parent;
> 
>  	list_del(&dentry->d_u.d_child);
> -	dentry_stat.nr_dentry--;	/* For d_free, below */
>  	/*drops the locks, at that point nobody can reach this dentry */
>  	dentry_iput(dentry);
>  	if (IS_ROOT(dentry))
> @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
>  static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  {
>  	struct dentry *parent;
> -	unsigned detached = 0;
> 
>  	BUG_ON(!IS_ROOT(dentry));
> 
> @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  			}
> 
>  			list_del(&dentry->d_u.d_child);
> -			detached++;
> 
>  			inode = dentry->d_inode;
>  			if (inode) {
> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  			 * otherwise we ascend to the parent and move to the
>  			 * next sibling if there is one */
>  			if (!parent)
> -				goto out;
> +				return;
> 
>  			dentry = parent;
> 
> @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  		dentry = list_entry(dentry->d_subdirs.next,
>  				    struct dentry, d_u.d_child);
>  	}
> -out:
> -	/* several dentries were freed, need to correct nr_dentry */
> -	spin_lock(&dcache_lock);
> -	dentry_stat.nr_dentry -= detached;
> -	spin_unlock(&dcache_lock);
>  }
> 
>  /*
> @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
>  	dentry->d_flags = DCACHE_UNHASHED;
>  	spin_lock_init(&dentry->d_lock);
>  	dentry->d_inode = NULL;
> -	dentry->d_parent = NULL;
> -	dentry->d_sb = NULL;
>  	dentry->d_op = NULL;
>  	dentry->d_fsdata = NULL;
>  	dentry->d_mounted = 0;
> @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
>  	if (parent) {
>  		dentry->d_parent = dget(parent);
>  		dentry->d_sb = parent->d_sb;
> +		spin_lock(&dcache_lock);
> +		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> +		spin_unlock(&dcache_lock);
>  	} else {
> +		dentry->d_parent = NULL;
> +		dentry->d_sb = NULL;
>  		INIT_LIST_HEAD(&dentry->d_u.d_child);
>  	}
> -
> -	spin_lock(&dcache_lock);
> -	if (parent)
> -		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> -	dentry_stat.nr_dentry++;
> -	spin_unlock(&dcache_lock);
> -
> +	percpu_counter_inc(&nr_dentry);
>  	return dentry;
>  }
> 
> @@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
>  {
>  	int loop;
> 
> +	percpu_counter_init(&nr_dentry, 0);
>  	/* 
>  	 * A constructor could be added for stable state like the lists,
>  	 * but it is probably not worth it because of the cache nature
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4a853ef..114cb65 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
>  struct ctl_table;
>  int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos);
> 
>  int get_filesystem_list(char * buf);
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 3d56fe7..777bee7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &dentry_stat,
>  		.maxlen		= 6*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_dentry,
>  	},
>  	{
>  		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
       [not found]                       ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  2007-07-24  1:30                         ` Nick Piggin
@ 2008-12-16 21:10                         ` Paul E. McKenney
  1 sibling, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Thu, Dec 11, 2008 at 11:39:10PM +0100, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
> 
> (socket8 bench result : no difference at this point)

I do like this per-CPU counter infrastructure!

One small comment change noted below.  Other than that:

Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> ---
>  fs/fs-writeback.c   |    2 +-
>  fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
>  include/linux/fs.h  |    3 +++
>  kernel/sysctl.c     |    4 ++--
>  mm/page-writeback.c |    2 +-
>  5 files changed, 38 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0ff0b8..b591cdd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
>  	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> 
>  	wbc.nr_to_write = nr_dirty + nr_unstable +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
> +			(get_nr_inodes() - inodes_stat.nr_unused) +
>  			nr_dirty + nr_unstable;
>  	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
>  	sync_sb_inodes(sb, &wbc);
> diff --git a/fs/inode.c b/fs/inode.c
> index 0487ddb..f94f889 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
>   * Statistics gathering..
>   */
>  struct inodes_stat_t inodes_stat;
> +static struct percpu_counter nr_inodes;
> 
>  static struct kmem_cache * inode_cachep __read_mostly;
> 
> +int get_nr_inodes(void)
> +{
> +	return percpu_counter_sum_positive(&nr_inodes);
> +}
> +
> +/*
> + * Handle nr_dentry sysctl

That would be "nr_inode", right?

> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	inodes_stat.nr_inodes = get_nr_inodes();
> +	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>  static void wake_up_inode(struct inode *inode)
>  {
>  	/*
> @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
>  		destroy_inode(inode);
>  		nr_disposed++;
>  	}
> -	spin_lock(&inode_lock);
> -	inodes_stat.nr_inodes -= nr_disposed;
> -	spin_unlock(&inode_lock);
> +	percpu_counter_sub(&nr_inodes, nr_disposed);
>  }
> 
>  /*
> @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
>  	
>  	inode = alloc_inode(sb);
>  	if (inode) {
> +		percpu_counter_inc(&nr_inodes);
>  		spin_lock(&inode_lock);
> -		inodes_stat.nr_inodes++;
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
>  		inode->i_ino = ++last_ino;
> @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
>  			if (set(inode, data))
>  				goto set_failed;
> 
> -			inodes_stat.nr_inodes++;
> +			percpu_counter_inc(&nr_inodes);
>  			list_add(&inode->i_list, &inode_in_use);
>  			list_add(&inode->i_sb_list, &sb->s_inodes);
>  			hlist_add_head(&inode->i_hash, head);
> @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
>  		old = find_inode_fast(sb, head, ino);
>  		if (!old) {
>  			inode->i_ino = ino;
> -			inodes_stat.nr_inodes++;
> +			percpu_counter_inc(&nr_inodes);
>  			list_add(&inode->i_list, &inode_in_use);
>  			list_add(&inode->i_sb_list, &sb->s_inodes);
>  			hlist_add_head(&inode->i_hash, head);
> @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
>  	list_del_init(&inode->i_list);
>  	list_del_init(&inode->i_sb_list);
>  	inode->i_state |= I_FREEING;
> -	inodes_stat.nr_inodes--;
>  	spin_unlock(&inode_lock);
> +	percpu_counter_dec(&nr_inodes);
> 
>  	security_inode_delete(inode);
> 
> @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
>  	list_del_init(&inode->i_list);
>  	list_del_init(&inode->i_sb_list);
>  	inode->i_state |= I_FREEING;
> -	inodes_stat.nr_inodes--;
>  	spin_unlock(&inode_lock);
> +	percpu_counter_dec(&nr_inodes);
>  	if (inode->i_data.nrpages)
>  		truncate_inode_pages(&inode->i_data, 0);
>  	clear_inode(inode);
> @@ -1394,6 +1416,7 @@ void __init inode_init(void)
>  {
>  	int loop;
> 
> +	percpu_counter_init(&nr_inodes, 0);
>  	/* inode slab cache */
>  	inode_cachep = kmem_cache_create("inode_cache",
>  					 sizeof(struct inode),
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 114cb65..a789346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -47,6 +47,7 @@ struct inodes_stat_t {
>  	int dummy[5];		/* padding for sysctl ABI compatibility */
>  };
>  extern struct inodes_stat_t inodes_stat;
> +extern int get_nr_inodes(void);
> 
>  extern int leases_enable, lease_break_time;
> 
> @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos);
>  int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
>  		   void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos);
> 
>  int get_filesystem_list(char * buf);
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 777bee7..b705f3a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &inodes_stat,
>  		.maxlen		= 2*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_inodes,
>  	},
>  	{
>  		.ctl_name	= FS_STATINODE,
> @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &inodes_stat,
>  		.maxlen		= 7*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_inodes,
>  	},
>  	{
>  		.procname	= "file-nr",
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2970e35..a71a922 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
>  	next_jif = start_jif + dirty_writeback_interval;
>  	nr_to_write = global_page_state(NR_FILE_DIRTY) +
>  			global_page_state(NR_UNSTABLE_NFS) +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +			(get_nr_inodes() - inodes_stat.nr_unused);
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
  2008-12-11 22:39                     ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
  2007-07-24  1:34                       ` Nick Piggin
@ 2008-12-16 21:26                       ` Paul E. McKenney
  1 sibling, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:39:18PM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
> 
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
> 
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

One question below, but just a clarification.  Works correctly as is,
though a bit strangely.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
>  	return node ? inode : NULL;
>  }
> 
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> +	static atomic_t shared_last_ino;
> +	int *p = &get_cpu_var(last_ino);
> +	int res = *p;
> +
> +	if (unlikely((res & 1023) == 0))
> +		res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> +	*p = ++res;

So the first CPU gets the range [1:1024], the second [1025:2048], and
so on, eventually wrapping to [4294966273:0].  Is that the intent?

(I don't see a problem with this, just seems a bit strange.)

> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> +	static int last_ino;
> +
> +	return ++last_ino;
> +}
> +#endif
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
>  	 * error if st_ino won't fit in target struct field. Use 32bit counter
>  	 * here to attempt to avoid that.
>  	 */
> -	static unsigned int last_ino;
>  	struct inode * inode;
> 
>  	spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
> +		inode->i_state = 0;
> +		inode->i_ino = last_ino_get();
>  		spin_lock(&inode_lock);
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		inode->i_ino = ++last_ino;
> -		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd
       [not found]                       ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2008-12-16 21:40                         ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Thu, Dec 11, 2008 at 11:39:38PM +0100, Eric Dumazet wrote:
> Sockets, pipes and anonymous fds have interesting properties.
> 
> Like other files, they use a dentry and an inode.
> 
> But dentries for these kind of files are not hashed into dcache,
> since there is no way someone can lookup such a file in the vfs tree.
> (/proc/{pid}/fd/{number} uses a different mechanism)
> 
> Still, allocating and freeing such dentries are expensive processes,
> because we currently take dcache_lock inside d_alloc(), d_instantiate(),
> and dput(). This lock is very contended on SMP machines.
> 
> This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
> a single one (for sockets, pipes, anonymous fd), and a new
> d_alloc_single(const struct qstr *name, struct inode *inode)
> method, called by the three subsystems.
> 
> Internally, dput() can take a fast path to dput_single() for
> SINGLE dentries. No more atomic_dec_and_lock()
> for such dentries.
> 
> 
> Differences betwen an SINGLE dentry and a normal one are :
> 
> 1) SINGLE dentry has the DCACHE_SINGLE flag
> 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
> This to avoid taking a reference on sb 'root' dentry, shared
> by too many dentries.
> 3) They are not hashed into global hash table (DCACHE_UNHASHED)
> 4) Their d_alias list is empty
> 
> ("socketallocbench -n 8" bench result : from 25s to 19.9s)

Acked-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> ---
>  fs/anon_inodes.c       |   16 ------------
>  fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
>  fs/pipe.c              |   23 +----------------
>  include/linux/dcache.h |    9 ++++++
>  net/socket.c           |   24 +-----------------
>  5 files changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 3662dd4..8bf83cb 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
>  			     mnt);
>  }
> 
> -static int anon_inodefs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * We faked vfs to believe the dentry was hashed when we created it.
> -	 * Now we restore the flag so that dput() will work correctly.
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 1;
> -}
> -
>  static struct file_system_type anon_inode_fs_type = {
>  	.name		= "anon_inodefs",
>  	.get_sb		= anon_inodefs_get_sb,
>  	.kill_sb	= kill_anon_super,
>  };
>  static struct dentry_operations anon_inodefs_dentry_operations = {
> -	.d_delete	= anon_inodefs_delete_dentry,
>  };
> 
>  /**
> @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
>  	this.name = name;
>  	this.len = strlen(name);
>  	this.hash = 0;
> -	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
> +	dentry = d_alloc_single(&this, anon_inode_inode);
>  	if (!dentry)
>  		goto err_put_unused_fd;
> 
> @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
>  	atomic_inc(&anon_inode_inode->i_count);
> 
>  	dentry->d_op = &anon_inodefs_dentry_operations;
> -	/* Do not publish this dentry inside the global dentry hash table */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, anon_inode_inode);
> 
>  	error = -ENFILE;
>  	file = alloc_file(anon_inode_mnt, dentry,
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f463a81..af3bfb3 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
>   */
> 
>  /*
> + * special version of dput() for pipes/sockets/anon.
> + * These dentries are not present in hash table, we can avoid
> + * taking/dirtying dcache_lock
> + */
> +static void dput_single(struct dentry *dentry)
> +{
> +	struct inode *inode;
> +
> +	if (!atomic_dec_and_test(&dentry->d_count))
> +		return;
> +	inode = dentry->d_inode;
> +	if (inode)
> +		iput(inode);
> +	d_free(dentry);
> +}
> +
> +/*
>   * dput - release a dentry
>   * @dentry: dentry to release 
>   *
> @@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
>  {
>  	if (!dentry)
>  		return;
> +	/*
> +	 * single dentries (sockets/pipes/anon) fast path
> +	 */
> +	if (dentry->d_flags & DCACHE_SINGLE)
> +		return dput_single(dentry);
> 
>  repeat:
>  	if (atomic_read(&dentry->d_count) == 1)
> @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
>  	return res;
>  }
> 
> +/**
> + * d_alloc_single - allocate SINGLE dentry
> + * @name: dentry name, given in a qstr structure
> + * @inode: inode to allocate the dentry for
> + *
> + * Allocate an SINGLE dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory.
> + * - SINGLE dentries have themselves as a parent.
> + * - SINGLE dentries are not hashed into global hash table
> + * - their d_alias list is empty
> + */
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;
> +}
> +
> +
>  static inline struct hlist_head *d_hash(struct dentry *parent,
>  					unsigned long hash)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 7aea8b8..4de6dd5 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
>  }
> 
>  static struct vfsmount *pipe_mnt __read_mostly;
> -static int pipefs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * At creation time, we pretended this dentry was hashed
> -	 * (by clearing DCACHE_UNHASHED bit in d_flags)
> -	 * At delete time, we restore the truth : not hashed.
> -	 * (so that dput() can proceed correctly)
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 0;
> -}
> 
>  /*
>   * pipefs_dname() is called from d_path().
> @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
>  }
> 
>  static struct dentry_operations pipefs_dentry_operations = {
> -	.d_delete	= pipefs_delete_dentry,
>  	.d_dname	= pipefs_dname,
>  };
> 
> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
> 
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
> @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
>  		goto err;
> 
>  	err = -ENOMEM;
> -	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
> +	dentry = d_alloc_single(&name, inode);
>  	if (!dentry)
>  		goto err_inode;
> 
>  	dentry->d_op = &pipefs_dentry_operations;
> -	/*
> -	 * We dont want to publish this dentry into global dentry hash table.
> -	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> -	 * This permits a working /proc/$pid/fd/XXX on pipes
> -	 */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, inode);
> 
>  	err = -ENFILE;
>  	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index a37359d..ca8d269 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
>  #define DCACHE_UNHASHED		0x0010	
> 
>  #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
> +#define DCACHE_SINGLE		0x0040
> +	/*
> +	 * socket, pipe or anonymous fd dentry
> +	 * - SINGLE dentries have themselves as a parent.
> +	 * - SINGLE dentries are not hashed into global hash table
> +	 * - Their d_alias list is empty
> +	 * - They dont need dcache_lock synchronization
> +	 */
> 
>  extern spinlock_t dcache_lock;
>  extern seqlock_t rename_lock;
> @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
>  extern void shrink_dcache_parent(struct dentry *);
>  extern void shrink_dcache_for_umount(struct super_block *);
>  extern int d_invalidate(struct dentry *);
> +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
> 
>  /* only used at mount-time */
>  extern struct dentry * d_alloc_root(struct inode *);
> diff --git a/net/socket.c b/net/socket.c
> index 92764d8..353c928 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
>  	.kill_sb =	kill_anon_super,
>  };
> 
> -static int sockfs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * At creation time, we pretended this dentry was hashed
> -	 * (by clearing DCACHE_UNHASHED bit in d_flags)
> -	 * At delete time, we restore the truth : not hashed.
> -	 * (so that dput() can proceed correctly)
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 0;
> -}
> -
>  /*
>   * sockfs_dname() is called from d_path().
>   */
> @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
>  }
> 
>  static struct dentry_operations sockfs_dentry_operations = {
> -	.d_delete = sockfs_delete_dentry,
>  	.d_dname  = sockfs_dname,
>  };
> 
> @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
> 
> -	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
> +	dentry = d_alloc_single(&name, SOCK_INODE(sock));
>  	if (unlikely(!dentry))
>  		return -ENOMEM;
> 
>  	dentry->d_op = &sockfs_dentry_operations;
> -	/*
> -	 * We dont want to push this dentry into global dentry hash table.
> -	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> -	 * This permits a working /proc/$pid/fd/XXX on sockets
> -	 */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, SOCK_INODE(sock));
> 
>  	sock->file = file;
>  	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/7] fs: new_inode_single() and iput_single()
  2008-12-11 22:40                     ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
@ 2008-12-16 21:41                       ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:40:07PM +0100, Eric Dumazet wrote:
> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
> inodes allocation/freeing.
> 
> SINGLE dentries are attached to inodes that dont need to be linked
> in a list of inodes, being "inode_in_use" or "sb->s_inodes"
> As inode_lock was taken only to protect these lists, we avoid taking it
> as well.
> 
> Using iput_single() from dput_single() avoids taking inode_lock
> at freeing time.
> 
> This patch has a very noticeable effect, because we avoid dirtying of
> three contended cache lines in new_inode(), and five cache lines in iput()
> 
> ("socketallocbench -n 8" result : from 19.9s to 3.01s)

Nice!

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/anon_inodes.c   |    2 +-
>  fs/dcache.c        |    2 +-
>  fs/inode.c         |   29 ++++++++++++++++++++---------
>  fs/pipe.c          |    2 +-
>  include/linux/fs.h |   12 +++++++++++-
>  net/socket.c       |    2 +-
>  6 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 8bf83cb..89fd36d 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
>   */
>  static struct inode *anon_inode_mkinode(void)
>  {
> -	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
> +	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
> 
>  	if (!inode)
>  		return ERR_PTR(-ENOMEM);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index af3bfb3..3363853 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
>  		return;
>  	inode = dentry->d_inode;
>  	if (inode)
> -		iput(inode);
> +		iput_single(inode);
>  	d_free(dentry);
>  }
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index dc8e72a..0fdfe1b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
>  		kmem_cache_free(inode_cachep, (inode));
>  }
> 
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}
> 
>  /*
>   * These are initializations that only need to be done
> @@ -587,8 +594,9 @@ static int last_ino_get(void)
>  #endif
> 
>  /**
> - *	new_inode 	- obtain an inode
> + *	__new_inode 	- obtain an inode
>   *	@sb: superblock
> + *  @single: if true, dont link new inode in a list
>   *
>   *	Allocates a new inode for given superblock. The default gfp_mask
>   *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
> @@ -598,7 +606,7 @@ static int last_ino_get(void)
>   *	newly created inode's mapping
>   *
>   */
> -struct inode *new_inode(struct super_block *sb)
> +struct inode *__new_inode(struct super_block *sb, int single)
>  {
>  	/*
>  	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
>  	 */
>  	struct inode * inode;
> 
> -	spin_lock_prefetch(&inode_lock);
> -	
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
>  		inode->i_state = 0;
>  		inode->i_ino = last_ino_get();
> -		spin_lock(&inode_lock);
> -		list_add(&inode->i_list, &inode_in_use);
> -		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		spin_unlock(&inode_lock);
> + 		if (single) {
> +  			INIT_LIST_HEAD(&inode->i_list);
> +  			INIT_LIST_HEAD(&inode->i_sb_list);
> + 		} else {
> +			spin_lock(&inode_lock);
> +			list_add(&inode->i_list, &inode_in_use);
> +			list_add(&inode->i_sb_list, &sb->s_inodes);
> +			spin_unlock(&inode_lock);
> +		}
>  	}
>  	return inode;
>  }
> 
> -EXPORT_SYMBOL(new_inode);
> +EXPORT_SYMBOL(__new_inode);
> 
>  void unlock_new_inode(struct inode *inode)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4de6dd5..8c51a0d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
> 
>  static struct inode * get_pipe_inode(void)
>  {
> -	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
> +	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
>  	struct pipe_inode_info *pipe;
> 
>  	if (!inode)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a789346..a702d81 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
>  extern void iget_failed(struct inode *);
>  extern void clear_inode(struct inode *);
>  extern void destroy_inode(struct inode *);
> -extern struct inode *new_inode(struct super_block *);
> +extern struct inode *__new_inode(struct super_block *, int);
> +static inline struct inode *new_inode(struct super_block *sb)
> +{
> +	return __new_inode(sb, 0);
> +}
> +static inline struct inode *new_inode_single(struct super_block *sb)
> +{
> +	return __new_inode(sb, 1);
> +}
> +extern void iput_single(struct inode *);
> +
>  extern int should_remove_suid(struct dentry *);
>  extern int file_remove_suid(struct file *);
> 
> diff --git a/net/socket.c b/net/socket.c
> index 353c928..4017409 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
>  	struct inode *inode;
>  	struct socket *sock;
> 
> -	inode = new_inode(sock_mnt->mnt_sb);
> +	inode = new_inode_single(sock_mnt->mnt_sb);
>  	if (!inode)
>  		return NULL;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
       [not found]                                     ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
@ 2008-12-17 20:25                                       ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2008-12-17 20:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar,
	Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

Christoph Lameter a écrit :
> On Fri, 12 Dec 2008, Eric Dumazet wrote:
> 
>>> a truly allocated file. At this point the file is
>>> a truly allocated file but not anymore ours.
> 
> Its a valid file. Does ownership matter here?
> 
>> Reading again this mail I realise we call put_filp(file), while this should
>> be fput(file) or put_filp(file), we dont know.
>>
>> Damned, this patch is wrong as is.
>>
>> Christoph, Paul, do you see the problem ?
> 
> Yes.
> 
>> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
>> and tried to close it while we got a reference on file) had to call put_filp() or fput()
>> to release its own reference. So we call atomic_long_dec_and_test() and cannot
>> take the appropriate action (calling the full __fput() version or the small one,
>> that some systems use to 'close' an not really opened file.
> 
> The difference is mainly that fput() does full processing whereas
> put_filp() is used when we know that the file was not fully operational.
> If the checks in __fput are able to handle the put_filp() situation by not
> releasing resources that were not allocated then we should be fine.
> 
>> I believe put_filp() is only called on slowpath (error cases).
> 
> Looks like it. It seems to assume that no dentry is associated.
> 
>> Should we just zap it and always call fput() ?
> 
> Only if fput() can handle partially setup files.

It can do that if we add a check for NULL dentry in __fput(), so put_filp() can disappear.

But there is a remaining point where we do an atomic_long_dec_and_test(&...->f_count),
in fs/aio.c, function __aio_put_req(). This one is tricky :(

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2008-12-17 20:25 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0811201727070.9089@quilx.com>
     [not found] ` <20081121083044.GL16242@elte.hu>
     [not found]   ` <49267694.1030506@cosmosbay.com>
     [not found]     ` <20081121.010508.40225532.davem@davemloft.net>
     [not found]       ` <4926AEDB.10007@cosmosbay.com>
     [not found]         ` <4926D022.5060008@cosmosbay.com>
2008-11-21 15:36           ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
2008-11-21 17:58             ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
     [not found]               ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 18:43                 ` Matthew Wilcox
2008-11-23  3:53                   ` Eric Dumazet
     [not found]           ` <20081121152148.GA20388@elte.hu>
     [not found]             ` <4926D39D.9050603@cosmosbay.com>
     [not found]               ` <20081121153453.GA23713@elte.hu>
     [not found]                 ` <492DDB6A.8090806@cosmosbay.com>
2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
2008-12-11 22:38                     ` [PATCH v3 0/7] " Eric Dumazet
2008-12-11 22:38                     ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2007-07-24  1:24                       ` Nick Piggin
     [not found]                       ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:04                         ` Paul E. McKenney
2008-12-11 22:39                     ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
     [not found]                       ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2007-07-24  1:30                         ` Nick Piggin
     [not found]                           ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
2008-12-12  5:11                             ` Eric Dumazet
2008-12-16 21:10                         ` Paul E. McKenney
2008-12-11 22:39                     ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2007-07-24  1:34                       ` Nick Piggin
2008-12-16 21:26                       ` Paul E. McKenney
2008-12-11 22:39                     ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
     [not found]                       ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:40                         ` Paul E. McKenney
2008-12-11 22:40                     ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
2008-12-16 21:41                       ` Paul E. McKenney
     [not found]                     ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-11 22:40                       ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet
2007-07-24  1:13                         ` Nick Piggin
2008-12-12  2:50                           ` Nick Piggin
2008-12-12  4:45                           ` Eric Dumazet
     [not found]                             ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-12 16:48                               ` Eric Dumazet
     [not found]                                 ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-13  2:07                                   ` Christoph Lameter
     [not found]                                     ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
2008-12-17 20:25                                       ` Eric Dumazet
2008-12-13  1:41                               ` Christoph Lameter
2008-12-11 22:41                     ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
     [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29  8:43                     ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-11-29  8:43                     ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-11-29  8:44                     ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
     [not found]                       ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29 10:38                         ` Jörn Engel
     [not found]                           ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>
2008-11-29 11:14                             ` Eric Dumazet
2008-11-29  8:45                     ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
2008-11-29 11:14                       ` Jörn Engel
2008-11-29  8:44                   ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).