public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
	Christoph Hellwig <hch@infradead.org>,
	David Miller <davem@davemloft.net>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	linux-kernel@vger.kernel.org,
	"kernel-testers@vger.kernel.org >> Kernel Testers List" 
	<kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Linux Netdev List <netdev@vger.kernel.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, Al Viro <viro@ZenIV.linux.org.uk>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
Date: Thu, 11 Dec 2008 23:40:29 +0100	[thread overview]
Message-ID: <494196DD.5070600@cosmosbay.com> (raw)
In-Reply-To: <493100B0.6090104@cosmosbay.com>

From: Christoph Lameter <cl@linux-foundation.org>

[PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/filesystems/files.txt |   21 ++++++++++++++--
 fs/file_table.c                     |   33 ++++++++++++++++++--------
 include/linux/fs.h                  |    5 ---
 3 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
index ac2facc..6916baa 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.txt
@@ -78,13 +78,28 @@ the fdtable structure -
    that look-up may race with the last put() operation on the
    file structure. This is avoided using atomic_long_inc_not_zero()
    on ->f_count :
+   As file structures are allocated with SLAB_DESTROY_BY_RCU,
+   they can also be freed before a RCU grace period, and reused,
+   but still as a struct file.
+   It is necessary to check again after getting
+   a stable reference (ie after atomic_long_inc_not_zero()),
+   that fcheck_files(files, fd) points to the same file.
 
 	rcu_read_lock();
 	file = fcheck_files(files, fd);
 	if (file) {
-		if (atomic_long_inc_not_zero(&file->f_count))
+		if (atomic_long_inc_not_zero(&file->f_count)) {
 			*fput_needed = 1;
-		else
+			/*
+			 * Now we have a stable reference to an object.
+			 * Check if other threads freed file and reallocated it.
+			 */
+			if (file != fcheck_files(files, fd)) {
+				*fput_needed = 0;
+				put_filp(file);
+				file = NULL;
+			}
+		} else
 		/* Didn't get the reference, someone's freed */
 			file = NULL;
 	}
@@ -95,6 +110,8 @@ the fdtable structure -
    atomic_long_inc_not_zero() detects if refcounts is already zero or
    goes to zero during increment. If it does, we fail
    fget()/fget_light().
+   The second call to fcheck_files(files, fd) checks that this filp
+   was not freed, then reused by an other thread.
 
 6. Since both fdtable and file structures can be looked up
    lock-free, they must be installed using rcu_assign_pointer()
diff --git a/fs/file_table.c b/fs/file_table.c
index a46e880..3e9259d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
-static inline void file_free_rcu(struct rcu_head *head)
-{
-	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
-	kmem_cache_free(filp_cachep, f);
-}
-
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
 	file_check_state(f);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+	kmem_cache_free(filp_cachep, f);
 }
 
 /*
@@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
 			rcu_read_unlock();
 			return NULL;
 		}
+		/*
+		 * Now we have a stable reference to an object.
+		 * Check if other threads freed file and re-allocated it.
+		 */
+		if (unlikely(file != fcheck_files(files, fd))) {
+			put_filp(file);
+			file = NULL;
+		}
 	}
 	rcu_read_unlock();
 
@@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 		rcu_read_lock();
 		file = fcheck_files(files, fd);
 		if (file) {
-			if (atomic_long_inc_not_zero(&file->f_count))
+			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
-			else
+				/*
+				 * Now we have a stable reference to an object.
+				 * Check if other threads freed this file and
+				 * re-allocated it.
+				 */
+				if (unlikely(file != fcheck_files(files, fd))) {
+					*fput_needed = 0;
+					put_filp(file);
+					file = NULL;
+				}
+			} else
 				/* Didn't get the reference, someone's freed */
 				file = NULL;
 		}
@@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages)
 	int n; 
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC,
+			NULL);
 
 	/*
 	 * One file with associated inode and dcache is very roughly 1K. 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a702d81..a1f56d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 #define FILE_MNT_WRITE_RELEASED	2
 
 struct file {
-	/*
-	 * fu_list becomes invalid after file_free is called and queued via
-	 * fu_rcuhead for RCU freeing
-	 */
 	union {
 		struct list_head	fu_list;
-		struct rcu_head 	fu_rcuhead;
 	} f_u;
 	struct path		f_path;
 #define f_dentry	f_path.dentry

  parent reply	other threads:[~2008-12-11 22:42 UTC|newest]

Thread overview: 185+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki
2008-11-16 17:38 ` [Bug #11207] VolanoMark regression with 2.6.27-rc1 Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28 Rafael J. Wysocki
2008-11-17  9:06   ` Ingo Molnar
2008-11-17  9:14     ` David Miller
2008-11-17 11:01       ` Ingo Molnar
2008-11-17 11:20         ` Eric Dumazet
2008-11-17 16:11           ` Ingo Molnar
2008-11-17 16:35             ` Eric Dumazet
2008-11-17 17:08               ` Ingo Molnar
2008-11-17 17:25                 ` Ingo Molnar
2008-11-17 17:33                   ` Eric Dumazet
2008-11-17 17:38                     ` Linus Torvalds
2008-11-17 17:42                       ` Eric Dumazet
2008-11-17 18:23                       ` Ingo Molnar
2008-11-17 18:33                         ` Linus Torvalds
2008-11-17 18:49                         ` Ingo Molnar
2008-11-17 19:30                           ` Eric Dumazet
2008-11-17 19:39                           ` David Miller
2008-11-17 19:43                             ` Eric Dumazet
2008-11-17 19:55                             ` Linus Torvalds
2008-11-17 20:16                               ` David Miller
2008-11-17 20:30                                 ` Linus Torvalds
2008-11-17 20:58                                   ` David Miller
2008-11-18  9:44                                     ` Nick Piggin
2008-11-18 15:58                                       ` Linus Torvalds
2008-11-19  4:31                                         ` Nick Piggin
2008-11-20  9:14                                         ` David Miller
2008-11-20  9:06                                       ` David Miller
2008-11-18 12:29                             ` Mike Galbraith
2008-11-17 19:57                           ` Ingo Molnar
2008-11-17 20:20                           ` (avc_has_perm_noaudit()) " Ingo Molnar
2008-11-17 20:32                           ` ip_queue_xmit(): " Ingo Molnar
2008-11-17 20:57                             ` Eric Dumazet
2008-11-18  9:12                             ` Nick Piggin
2008-11-17 20:47                           ` Ingo Molnar
2008-11-17 20:56                             ` Eric Dumazet
2008-11-17 20:55                           ` skb_release_head_state(): " Ingo Molnar
2008-11-17 21:01                             ` David Miller
2008-11-17 21:04                             ` Eric Dumazet
2008-11-17 21:34                             ` Linus Torvalds
2008-11-17 21:38                               ` Ingo Molnar
2008-11-17 21:09                           ` tcp_ack(): " Ingo Molnar
2008-11-17 21:19                           ` tcp_recvmsg(): " Ingo Molnar
2008-11-17 21:26                           ` eth_type_trans(): " Ingo Molnar
2008-11-17 21:40                             ` Eric Dumazet
2008-11-17 23:41                               ` Eric Dumazet
2008-11-18  0:01                                 ` Linus Torvalds
2008-11-18  8:35                                   ` Eric Dumazet
2008-11-17 21:52                             ` Linus Torvalds
2008-11-18  5:16                             ` David Miller
2008-11-18  5:35                               ` Eric Dumazet
2008-11-18  7:00                                 ` David Miller
2008-11-18  8:30                               ` Ingo Molnar
2008-11-18  8:49                                 ` Eric Dumazet
2008-11-17 21:35                           ` __inet_lookup_established(): " Ingo Molnar
2008-11-17 22:14                             ` Eric Dumazet
2008-11-17 21:59                           ` system_call() - " Ingo Molnar
2008-11-17 22:09                             ` Linus Torvalds
2008-11-17 22:08                           ` Ingo Molnar
2008-11-17 22:15                             ` Eric Dumazet
2008-11-17 22:26                               ` Ingo Molnar
2008-11-17 22:39                                 ` Eric Dumazet
2008-11-18  5:23                               ` David Miller
2008-11-18  8:45                                 ` Ingo Molnar
2008-11-17 22:14                           ` tcp_transmit_skb() - " Ingo Molnar
2008-11-17 22:19                           ` Ingo Molnar
2008-11-17 19:36                 ` David Miller
2008-11-17 19:31             ` David Miller
2008-11-17 19:47               ` Linus Torvalds
2008-11-17 19:51                 ` David Miller
2008-11-17 19:53                 ` Ingo Molnar
2008-11-17 22:47               ` Ingo Molnar
2008-11-17 19:21         ` David Miller
2008-11-17 19:48           ` Linus Torvalds
2008-11-17 19:52             ` David Miller
2008-11-17 19:57               ` Linus Torvalds
2008-11-17 20:18                 ` David Miller
2008-11-19 19:43     ` Christoph Lameter
2008-11-19 20:14       ` Ingo Molnar
2008-11-20 23:52       ` Christoph Lameter
2008-11-21  8:30         ` Ingo Molnar
2008-11-21  8:51           ` Eric Dumazet
2008-11-21  9:05             ` David Miller
2008-11-21 12:51               ` Eric Dumazet
2008-11-21 15:13                 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet
2008-11-21 15:21                   ` Ingo Molnar
2008-11-21 15:28                     ` Eric Dumazet
2008-11-21 15:34                       ` Ingo Molnar
2008-11-26 23:27                         ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
2008-11-27  1:37                           ` Christoph Lameter
2008-11-27  6:27                             ` Eric Dumazet
2008-11-27 14:44                               ` Christoph Lameter
2008-11-27  9:39                           ` Christoph Hellwig
2008-11-28 18:03                           ` Ingo Molnar
2008-11-28 18:47                             ` Peter Zijlstra
2008-11-29  6:38                               ` Christoph Hellwig
2008-11-29  8:07                                 ` Eric Dumazet
2008-11-29  8:43                           ` [PATCH v2 0/5] " Eric Dumazet
2008-12-11 22:38                             ` [PATCH v3 0/7] " Eric Dumazet
2008-12-11 22:38                             ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2007-07-24  1:24                               ` Nick Piggin
2008-12-16 21:04                               ` Paul E. McKenney
2008-12-11 22:39                             ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2007-07-24  1:30                               ` Nick Piggin
2008-12-12  5:11                                 ` Eric Dumazet
2008-12-16 21:10                               ` Paul E. McKenney
2008-12-11 22:39                             ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2007-07-24  1:34                               ` Nick Piggin
2008-12-16 21:26                               ` Paul E. McKenney
2008-12-11 22:39                             ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
2008-12-16 21:40                               ` Paul E. McKenney
2008-12-11 22:40                             ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
2008-12-16 21:41                               ` Paul E. McKenney
2008-12-11 22:40                             ` Eric Dumazet [this message]
2007-07-24  1:13                               ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Nick Piggin
2008-12-12  2:50                                 ` Nick Piggin
2008-12-12  4:45                                 ` Eric Dumazet
2008-12-12 16:48                                   ` Eric Dumazet
2008-12-13  2:07                                     ` Christoph Lameter
2008-12-17 20:25                                       ` Eric Dumazet
2008-12-13  1:41                                   ` Christoph Lameter
2008-12-11 22:41                             ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
2008-11-29  8:43                           ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-11-29  8:43                           ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-11-29  8:44                           ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2008-11-29  8:44                           ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
2008-11-29 10:38                             ` Jörn Engel
2008-11-29 11:14                               ` Eric Dumazet
2008-11-29  8:45                           ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
2008-11-29 11:14                             ` Jörn Engel
2008-11-26 23:30                         ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet
2008-11-27  9:41                           ` Christoph Hellwig
2008-11-26 23:32                         ` [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2008-11-27  9:46                           ` Christoph Hellwig
2008-11-26 23:32                         ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet
2008-11-27  9:32                           ` Peter Zijlstra
2008-11-27  9:39                             ` Peter Zijlstra
2008-11-27  9:48                               ` Christoph Hellwig
2008-11-27 10:01                             ` Eric Dumazet
2008-11-27 10:07                             ` Andi Kleen
2008-11-27 14:46                             ` Christoph Lameter
2008-11-26 23:32                         ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet
2008-11-27  8:20                           ` David Miller
2008-11-26 23:32                         ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet
2008-11-27  8:21                           ` David Miller
2008-11-27  9:53                           ` Christoph Hellwig
2008-11-27 10:04                             ` Eric Dumazet
2008-11-27 10:10                               ` Christoph Hellwig
2008-11-28  9:26                           ` Al Viro
2008-11-28  9:34                             ` Al Viro
2008-11-28 18:02                             ` Ingo Molnar
2008-11-28 18:58                               ` Ingo Molnar
2008-11-28 22:20                               ` Eric Dumazet
2008-11-28 22:37                             ` Eric Dumazet
2008-11-28 22:43                               ` Eric Dumazet
2008-11-21 15:36                   ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
2008-11-21 17:58                     ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
2008-11-21 18:43                       ` Matthew Wilcox
2008-11-23  3:53                         ` Eric Dumazet
2008-11-21  9:18             ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28 Ingo Molnar
2008-11-21  9:03           ` David Miller
2008-11-21 16:11           ` Christoph Lameter
2008-11-21 18:06             ` Christoph Lameter
2008-11-21 18:16               ` Eric Dumazet
2008-11-21 18:19                 ` Eric Dumazet
2008-11-16 17:40 ` [Bug #11215] INFO: possible recursive locking detected ps2_command Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr Rafael J. Wysocki
2008-11-17 16:19   ` Randy Dunlap
2008-11-16 17:40 ` [Bug #11543] kernel panic: softlockup in tick_periodic() ??? Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11569] Panic stop CPUs regression Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11698] 2.6.27-rc7, freezes with &gt; 1 s2ram cycle Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11664] acpi errors and random freeze on sony vaio sr Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki
2008-11-17 14:44   ` Christoph Hellwig
2008-11-16 17:40 ` [Bug #11886] without serial console system doesn't poweroff Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11843] usb hdd problems with 2.6.27.2 Rafael J. Wysocki
2008-11-16 21:37   ` Luciano Rocha
2008-11-16 17:40 ` [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8 Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11865] WOL for E100 Doesn't Work Anymore Rafael J. Wysocki
2008-11-16 17:41 ` [Bug #11983] iwlagn: wrong command queue 31, command id 0x0 Rafael J. Wysocki
2008-11-16 17:41 ` [Bug #12039] Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6 Rafael J. Wysocki
2008-11-16 17:41 ` [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6 Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=494196DD.5070600@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=efault@gmx.de \
    --cc=hch@infradead.org \
    --cc=kernel-testers@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rjw@sisk.pl \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox