From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>,
Christoph Hellwig <hch@infradead.org>,
David Miller <davem@davemloft.net>,
"Rafael J. Wysocki" <rjw@sisk.pl>,
linux-kernel@vger.kernel.org,
"kernel-testers@vger.kernel.org >> Kernel Testers List"
<kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Linux Netdev List <netdev@vger.kernel.org>,
Christoph Lameter <cl@linux-foundation.org>,
linux-fsdevel@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
Date: Fri, 12 Dec 2008 12:50:11 +1000 [thread overview]
Message-ID: <200812121350.13291.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <200707241113.46834.nickpiggin@yahoo.com.au>
On Tuesday 24 July 2007 11:13, Nick Piggin wrote:
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> > From: Christoph Lameter <cl@linux-foundation.org>
> >
> > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
> >
> > Currently we schedule RCU frees for each file we free separately. That
> > has several drawbacks against the earlier file handling (in 2.6.5 f.e.),
> > which did not require RCU callbacks:
> >
> > 1. Excessive number of RCU callbacks can be generated causing long RCU
> > queues that in turn cause long latencies. We hit SLUB page allocation
> > more often than necessary.
> >
> > 2. The cache hot object is not preserved between free and realloc. A
> > close followed by another open is very fast with the RCUless approach
> > because the last freed object is returned by the slab allocator that is
> > still cache hot. RCU free means that the object is not immediately
> > available again. The new object is cache cold and therefore open/close
> > performance tests show a significant degradation with the RCU
> > implementation.
> >
> > One solution to this problem is to move the RCU freeing into the Slab
> > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> > time. The slab allocator will do RCU frees only when it is necessary
> > to dispose of slabs of objects (rare). So with that approach we can cut
> > out the RCU overhead significantly.
> >
> > However, the slab allocator may return the object for another use even
> > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> > there is the (unlikely) possibility that the object is going to be
> > switched under us in sections protected by rcu_read_lock() and
> > rcu_read_unlock(). So we need to verify that we have acquired the correct
> > object after establishing a stable object reference (incrementing the
> > refcounter does that).
> >
> >
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> > Documentation/filesystems/files.txt | 21 ++++++++++++++--
> > fs/file_table.c | 33 ++++++++++++++++++--------
> > include/linux/fs.h | 5 ---
> > 3 files changed, 42 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/filesystems/files.txt
> > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> > --- a/Documentation/filesystems/files.txt
> > +++ b/Documentation/filesystems/files.txt
> > @@ -78,13 +78,28 @@ the fdtable structure -
> > that look-up may race with the last put() operation on the
> > file structure. This is avoided using atomic_long_inc_not_zero()
> > on ->f_count :
> > + As file structures are allocated with SLAB_DESTROY_BY_RCU,
> > + they can also be freed before a RCU grace period, and reused,
> > + but still as a struct file.
> > + It is necessary to check again after getting
> > + a stable reference (ie after atomic_long_inc_not_zero()),
> > + that fcheck_files(files, fd) points to the same file.
> >
> > rcu_read_lock();
> > file = fcheck_files(files, fd);
> > if (file) {
> > - if (atomic_long_inc_not_zero(&file->f_count))
> > + if (atomic_long_inc_not_zero(&file->f_count)) {
> > *fput_needed = 1;
> > - else
> > + /*
> > + * Now we have a stable reference to an object.
> > + * Check if other threads freed file and reallocated it.
> > + */
> > + if (file != fcheck_files(files, fd)) {
> > + *fput_needed = 0;
> > + put_filp(file);
> > + file = NULL;
> > + }
> > + } else
> > /* Didn't get the reference, someone's freed */
> > file = NULL;
> > }
> > @@ -95,6 +110,8 @@ the fdtable structure -
> > atomic_long_inc_not_zero() detects if refcounts is already zero or
> > goes to zero during increment. If it does, we fail
> > fget()/fget_light().
> > + The second call to fcheck_files(files, fd) checks that this filp
> > + was not freed, then reused by an other thread.
> >
> > 6. Since both fdtable and file structures can be looked up
> > lock-free, they must be installed using rcu_assign_pointer()
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index a46e880..3e9259d 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
> >
> > static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> >
> > -static inline void file_free_rcu(struct rcu_head *head)
> > -{
> > - struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
> > - kmem_cache_free(filp_cachep, f);
> > -}
> > -
> > static inline void file_free(struct file *f)
> > {
> > percpu_counter_dec(&nr_files);
> > file_check_state(f);
> > - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > + kmem_cache_free(filp_cachep, f);
> > }
> >
> > /*
> > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> > rcu_read_unlock();
> > return NULL;
> > }
> > + /*
> > + * Now we have a stable reference to an object.
> > + * Check if other threads freed file and re-allocated it.
> > + */
> > + if (unlikely(file != fcheck_files(files, fd))) {
> > + put_filp(file);
> > + file = NULL;
> > + }
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.
>
> From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.
>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?
Sorry, my clock screwed up and I didn't notice :(
next prev parent reply other threads:[~2008-12-12 2:50 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Pine.LNX.4.64.0811201727070.9089@quilx.com>
[not found] ` <20081121083044.GL16242@elte.hu>
[not found] ` <49267694.1030506@cosmosbay.com>
[not found] ` <20081121.010508.40225532.davem@davemloft.net>
[not found] ` <4926AEDB.10007@cosmosbay.com>
[not found] ` <4926D022.5060008@cosmosbay.com>
2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
2008-11-21 17:58 ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
[not found] ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 18:43 ` Matthew Wilcox
2008-11-23 3:53 ` Eric Dumazet
[not found] ` <20081121152148.GA20388@elte.hu>
[not found] ` <4926D39D.9050603@cosmosbay.com>
[not found] ` <20081121153453.GA23713@elte.hu>
[not found] ` <492DDB6A.8090806@cosmosbay.com>
2008-11-29 8:43 ` [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet
2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2007-07-24 1:24 ` Nick Piggin
[not found] ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:04 ` Paul E. McKenney
2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
[not found] ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2007-07-24 1:30 ` Nick Piggin
[not found] ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
2008-12-12 5:11 ` Eric Dumazet
2008-12-16 21:10 ` Paul E. McKenney
2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2007-07-24 1:34 ` Nick Piggin
2008-12-16 21:26 ` Paul E. McKenney
2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
[not found] ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:40 ` Paul E. McKenney
2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
2008-12-16 21:41 ` Paul E. McKenney
[not found] ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet
2007-07-24 1:13 ` Nick Piggin
2008-12-12 2:50 ` Nick Piggin [this message]
2008-12-12 4:45 ` Eric Dumazet
[not found] ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-12 16:48 ` Eric Dumazet
[not found] ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-13 2:07 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
2008-12-17 20:25 ` Eric Dumazet
2008-12-13 1:41 ` Christoph Lameter
2008-12-11 22:41 ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
[not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
[not found] ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29 10:38 ` Jörn Engel
[not found] ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>
2008-11-29 11:14 ` Eric Dumazet
2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
2008-11-29 11:14 ` Jörn Engel
2008-11-29 8:44 ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200812121350.13291.nickpiggin@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=cl@linux-foundation.org \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=efault@gmx.de \
--cc=hch@infradead.org \
--cc=kernel-testers@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=netdev@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=rjw@sisk.pl \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).