linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>,
	Christoph Hellwig <hch@infradead.org>,
	David Miller <davem@davemloft.net>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	linux-kernel@vger.kernel.org,
	"kernel-testers@vger.kernel.org >> Kernel Testers List"
	<kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Linux Netdev List <netdev@vger.kernel.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
Date: Fri, 12 Dec 2008 12:50:11 +1000	[thread overview]
Message-ID: <200812121350.13291.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <200707241113.46834.nickpiggin@yahoo.com.au>

On Tuesday 24 July 2007 11:13, Nick Piggin wrote:
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> > From: Christoph Lameter <cl@linux-foundation.org>
> >
> > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
> >
> > Currently we schedule RCU frees for each file we free separately. That
> > has several drawbacks against the earlier file handling (in 2.6.5 f.e.),
> > which did not require RCU callbacks:
> >
> > 1. Excessive number of RCU callbacks can be generated causing long RCU
> >   queues that in turn cause long latencies. We hit SLUB page allocation
> >   more often than necessary.
> >
> > 2. The cache hot object is not preserved between free and realloc. A
> > close followed by another open is very fast with the RCUless approach
> > because the last freed object is returned by the slab allocator that is
> > still cache hot. RCU free means that the object is not immediately
> > available again. The new object is cache cold and therefore open/close
> > performance tests show a significant degradation with the RCU
> >   implementation.
> >
> > One solution to this problem is to move the RCU freeing into the Slab
> > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> > time. The slab allocator will do RCU frees only when it is necessary
> > to dispose of slabs of objects (rare). So with that approach we can cut
> > out the RCU overhead significantly.
> >
> > However, the slab allocator may return the object for another use even
> > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> > there is the (unlikely) possibility that the object is going to be
> > switched under us in sections protected by rcu_read_lock() and
> > rcu_read_unlock(). So we need to verify that we have acquired the correct
> > object after establishing a stable object reference (incrementing the
> > refcounter does that).
> >
> >
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  Documentation/filesystems/files.txt |   21 ++++++++++++++--
> >  fs/file_table.c                     |   33 ++++++++++++++++++--------
> >  include/linux/fs.h                  |    5 ---
> >  3 files changed, 42 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/filesystems/files.txt
> > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> > --- a/Documentation/filesystems/files.txt
> > +++ b/Documentation/filesystems/files.txt
> > @@ -78,13 +78,28 @@ the fdtable structure -
> >     that look-up may race with the last put() operation on the
> >     file structure. This is avoided using atomic_long_inc_not_zero()
> >     on ->f_count :
> > +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> > +   they can also be freed before a RCU grace period, and reused,
> > +   but still as a struct file.
> > +   It is necessary to check again after getting
> > +   a stable reference (ie after atomic_long_inc_not_zero()),
> > +   that fcheck_files(files, fd) points to the same file.
> >
> >  	rcu_read_lock();
> >  	file = fcheck_files(files, fd);
> >  	if (file) {
> > -		if (atomic_long_inc_not_zero(&file->f_count))
> > +		if (atomic_long_inc_not_zero(&file->f_count)) {
> >  			*fput_needed = 1;
> > -		else
> > +			/*
> > +			 * Now we have a stable reference to an object.
> > +			 * Check if other threads freed file and reallocated it.
> > +			 */
> > +			if (file != fcheck_files(files, fd)) {
> > +				*fput_needed = 0;
> > +				put_filp(file);
> > +				file = NULL;
> > +			}
> > +		} else
> >  		/* Didn't get the reference, someone's freed */
> >  			file = NULL;
> >  	}
> > @@ -95,6 +110,8 @@ the fdtable structure -
> >     atomic_long_inc_not_zero() detects if refcounts is already zero or
> >     goes to zero during increment. If it does, we fail
> >     fget()/fget_light().
> > +   The second call to fcheck_files(files, fd) checks that this filp
> > +   was not freed, then reused by an other thread.
> >
> >  6. Since both fdtable and file structures can be looked up
> >     lock-free, they must be installed using rcu_assign_pointer()
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index a46e880..3e9259d 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
> >
> >  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> >
> > -static inline void file_free_rcu(struct rcu_head *head)
> > -{
> > -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> > -	kmem_cache_free(filp_cachep, f);
> > -}
> > -
> >  static inline void file_free(struct file *f)
> >  {
> >  	percpu_counter_dec(&nr_files);
> >  	file_check_state(f);
> > -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > +	kmem_cache_free(filp_cachep, f);
> >  }
> >
> >  /*
> > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> >  			rcu_read_unlock();
> >  			return NULL;
> >  		}
> > +		/*
> > +		 * Now we have a stable reference to an object.
> > +		 * Check if other threads freed file and re-allocated it.
> > +		 */
> > +		if (unlikely(file != fcheck_files(files, fd))) {
> > +			put_filp(file);
> > +			file = NULL;
> > +		}
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.
>
> From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.
>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Sorry, my clock screwed up and I didn't notice :(

  reply	other threads:[~2008-12-12  2:50 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Pine.LNX.4.64.0811201727070.9089@quilx.com>
     [not found] ` <20081121083044.GL16242@elte.hu>
     [not found]   ` <49267694.1030506@cosmosbay.com>
     [not found]     ` <20081121.010508.40225532.davem@davemloft.net>
     [not found]       ` <4926AEDB.10007@cosmosbay.com>
     [not found]         ` <4926D022.5060008@cosmosbay.com>
2008-11-21 15:36           ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
2008-11-21 17:58             ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
     [not found]               ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 18:43                 ` Matthew Wilcox
2008-11-23  3:53                   ` Eric Dumazet
     [not found]           ` <20081121152148.GA20388@elte.hu>
     [not found]             ` <4926D39D.9050603@cosmosbay.com>
     [not found]               ` <20081121153453.GA23713@elte.hu>
     [not found]                 ` <492DDB6A.8090806@cosmosbay.com>
2008-11-29  8:43                   ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
2008-12-11 22:38                     ` [PATCH v3 0/7] " Eric Dumazet
2008-12-11 22:38                     ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2007-07-24  1:24                       ` Nick Piggin
     [not found]                       ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:04                         ` Paul E. McKenney
2008-12-11 22:39                     ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
     [not found]                       ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2007-07-24  1:30                         ` Nick Piggin
     [not found]                           ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
2008-12-12  5:11                             ` Eric Dumazet
2008-12-16 21:10                         ` Paul E. McKenney
2008-12-11 22:39                     ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2007-07-24  1:34                       ` Nick Piggin
2008-12-16 21:26                       ` Paul E. McKenney
2008-12-11 22:39                     ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
     [not found]                       ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:40                         ` Paul E. McKenney
2008-12-11 22:40                     ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
2008-12-16 21:41                       ` Paul E. McKenney
     [not found]                     ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-11 22:40                       ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet
2007-07-24  1:13                         ` Nick Piggin
2008-12-12  2:50                           ` Nick Piggin [this message]
2008-12-12  4:45                           ` Eric Dumazet
     [not found]                             ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-12 16:48                               ` Eric Dumazet
     [not found]                                 ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-13  2:07                                   ` Christoph Lameter
     [not found]                                     ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
2008-12-17 20:25                                       ` Eric Dumazet
2008-12-13  1:41                               ` Christoph Lameter
2008-12-11 22:41                     ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
     [not found]                   ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29  8:43                     ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-11-29  8:43                     ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-11-29  8:44                     ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
     [not found]                       ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29 10:38                         ` Jörn Engel
     [not found]                           ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>
2008-11-29 11:14                             ` Eric Dumazet
2008-11-29  8:45                     ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
2008-11-29 11:14                       ` Jörn Engel
2008-11-29  8:44                   ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200812121350.13291.nickpiggin@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=dada1@cosmosbay.com \
    --cc=davem@davemloft.net \
    --cc=efault@gmx.de \
    --cc=hch@infradead.org \
    --cc=kernel-testers@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rjw@sisk.pl \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).