linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* real_lookup() lock contention causes poor performance
@ 2010-01-25 17:09 Quentin Barnes
  2010-01-25 23:15 ` Andreas Dilger
  0 siblings, 1 reply; 3+ messages in thread
From: Quentin Barnes @ 2010-01-25 17:09 UTC (permalink / raw)
  To: linux-fsdevel

In house, we developed our own variation of Theodore Ts'o and
Robert Love's open-by-inode patch for the ext3 file system.  The
open-by-inode patch allows user space to open a file by its inode
number under a special directory directly under the mount point
called ".inode" (e.g.: "<>/.inode/12345").  I also adapted the
concept for the NFS file system for open-by-file-handle using a
magic directory ".file_handle".

With the open-by-* extensions when used under load, we quickly
noticed a performance problem in the VFS layer.  We tracked it
down to processes blocking on the taking of the dir->i_mutex lock
in real_lookup().  That makes sense since the magic directories
constantly get hammered with opens for "new" files that aren't
yet in the dentry cache.  We create normal dentry entries for the
special files once opened, but there are so many unique files
in the file systems that dentry cache hits are low enough that
real_lookup() gets called often enough to be a real problem.

As a hack for this bottleneck, the developer for the ext3 extension
creates a 100 ".inodeXXX" directories that behave identically.
Then in an app he sequentially spreads the open-by-inode requests
among them.  That dilutes the directory lock contention in
real_lookup() enough for his situation.

Depending on the number of disks configured, his ".inodeXXX"
change gained him a considerable 5X-10X overall improvement
to his application.

For the NFS case, I took a different approach, I side-step the
lock.  I don't think the parent directory lock in real_lookup() of
the magic directory for these special needs cases is particularly
useful.  The parent directory lock protects against the parent
directory being renamed (moved) or deleted while the lookup is in
progress.  (Is that correct?  What else does it do?)  What I did
instead is set a flag in the magic directory's inode's "i_flags".
Then in real_lookup(), test for that flag and use it to bypass
taking dir->i_mutex lock.  Rather than create a new flag, I hijacked
S_IMMUTABLE since the magic directory is conceptually in a way
immutable.  This solved the lock contention for my NFS work gaining
me about a 2.5X improvement.

I consider what's been done so far just a quick and dirty way to
work around a problem.  What I'd like to discuss though is if there
is any practical way to design a real change to the VFS layer that
would be acceptable by the community that can either bypass the
dir->i_mutex lock for special cases like mentioned above or just
design out the bottleneck.  Or is there a radically different
approach for implementing open-by-* functionality that would bypass
the real_lookup() bottleneck?  Any suggestions?


For illustrative purposes, below is my real_lookup() patch.  I
noticed that the S_IMMUTABLE flag is already being set on some
directories by the kernel in the "/proc" fs code.  Any particular
reason this was done?  Would my patch cause any bad side-effects
for the /proc file system, or would actually help its performance
bypassing the dir->i_mutex lock for those directories?

Quentin


--- linux-2.6.32.5/fs/namei.c	2010-01-22 17:23:21.000000000 -0600
+++ linux-2.6.32.5-immdir/fs/namei.c	2010-01-24 23:10:43.000000000 -0600
@@ -478,8 +478,13 @@ static struct dentry * real_lookup(struc
 {
 	struct dentry * result;
 	struct inode *dir = parent->d_inode;
+	int dir_locked = 0;
+
+	if (likely(!IS_IMMUTABLE(dir))) {
+		mutex_lock(&dir->i_mutex);
+		dir_locked = 1;
+	}
 
-	mutex_lock(&dir->i_mutex);
 	/*
 	 * First re-do the cached lookup just in case it was created
 	 * while we waited for the directory semaphore..
@@ -513,7 +518,8 @@ static struct dentry * real_lookup(struc
 				result = dentry;
 		}
 out_unlock:
-		mutex_unlock(&dir->i_mutex);
+		if (likely(dir_locked))
+			mutex_unlock(&dir->i_mutex);
 		return result;
 	}
 
@@ -521,7 +527,8 @@ out_unlock:
 	 * Uhhuh! Nasty case: the cache was re-populated while
 	 * we waited on the semaphore. Need to revalidate.
 	 */
-	mutex_unlock(&dir->i_mutex);
+	if (likely(dir_locked))
+		mutex_unlock(&dir->i_mutex);
 	if (result->d_op && result->d_op->d_revalidate) {
 		result = do_revalidate(result, nd);
 		if (!result)
Quentin


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: real_lookup() lock contention causes poor performance
  2010-01-25 17:09 real_lookup() lock contention causes poor performance Quentin Barnes
@ 2010-01-25 23:15 ` Andreas Dilger
  2010-01-26  7:55   ` Christoph Hellwig
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Dilger @ 2010-01-25 23:15 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: linux-fsdevel

On 2010-01-25, at 10:09, Quentin Barnes wrote:
> In house, we developed our own variation of Theodore Ts'o and
> Robert Love's open-by-inode patch for the ext3 file system.  The
> open-by-inode patch allows user space to open a file by its inode
> number under a special directory directly under the mount point
> called ".inode" (e.g.: "<>/.inode/12345").  I also adapted the
> concept for the NFS file system for open-by-file-handle using a
> magic directory ".file_handle".

We have had something similar for Lustre patched into ext3/ext4 for a  
long time -"__iopen__/{inum}", which depends on a mount option "iopen"  
to allow this functionality, and "iopen_nopriv" to allow access to non- 
root users.

The problem we had in the past is that this needed a bunch of changes  
in ext3/4 and the VFS in order to handle subtle races in the dcache.   
We're moving away from changing ext3/4 and changing the API to use  
export methods (i.e. ext4_get_dentry()) on each filesystem, which  
should simplify this a lot, though I'm not sure it will solve your  
scalability problem.

> With the open-by-* extensions when used under load, we quickly
> noticed a performance problem in the VFS layer.  We tracked it
> down to processes blocking on the taking of the dir->i_mutex lock
> in real_lookup().  That makes sense since the magic directories
> constantly get hammered with opens for "new" files that aren't
> yet in the dentry cache.  We create normal dentry entries for the
> special files once opened, but there are so many unique files
> in the file systems that dentry cache hits are low enough that
> real_lookup() gets called often enough to be a real problem.
>
> As a hack for this bottleneck, the developer for the ext3 extension
> creates a 100 ".inodeXXX" directories that behave identically.
> Then in an app he sequentially spreads the open-by-inode requests
> among them.  That dilutes the directory lock contention in
> real_lookup() enough for his situation.

We have a patch that avoids this problem, called dynlocks (previously  
pdirops) that allows locking only the hash of the filename instead of  
the whole directory.  For in-cache lookups this is perfectly  
sufficient (chance of hash collision is vanishingly small), but an  
additional patch is needed at the ext3/4 level to allow parallel  
lookup/add/remove in an htree directory.  Since the number of threads  
in the kernel at one time is relatively small, this is currently only  
a linked list of entries, but it would be trivial to also put the  
locks into a hashtable with 2*num_online_cpus() buckets and avoid most  
hash chains.

This was implemented by having a per-filesystem ->dir_lock() and - 
 >dir_unlock() method, defaulting to ->mutex_lock() and - 
 >mutex_unlock() if unspecified.

> For the NFS case, I took a different approach, I side-step the
> lock.  I don't think the parent directory lock in real_lookup() of
> the magic directory for these special needs cases is particularly
> useful.  The parent directory lock protects against the parent
> directory being renamed (moved) or deleted while the lookup is in
> progress.  (Is that correct?  What else does it do?)  What I did
> instead is set a flag in the magic directory's inode's "i_flags".
> Then in real_lookup(), test for that flag and use it to bypass
> taking dir->i_mutex lock.  Rather than create a new flag, I hijacked
> S_IMMUTABLE since the magic directory is conceptually in a way
> immutable.  This solved the lock contention for my NFS work gaining
> me about a 2.5X improvement.

The problem with using S_IMMUTABLE is that this can be changed by  
userspace, and if that happens in the middle of a lookup you will get  
imbalanced up/down calls and normal directories may become corrupt and/ 
or unusable.

> I consider what's been done so far just a quick and dirty way to
> work around a problem.  What I'd like to discuss though is if there
> is any practical way to design a real change to the VFS layer that
> would be acceptable by the community that can either bypass the
> dir->i_mutex lock for special cases like mentioned above or just
> design out the bottleneck.  Or is there a radically different
> approach for implementing open-by-* functionality that would bypass
> the real_lookup() bottleneck?  Any suggestions?
>
>
> For illustrative purposes, below is my real_lookup() patch.  I
> noticed that the S_IMMUTABLE flag is already being set on some
> directories by the kernel in the "/proc" fs code.  Any particular
> reason this was done?  Would my patch cause any bad side-effects
> for the /proc file system, or would actually help its performance
> bypassing the dir->i_mutex lock for those directories?
>
> Quentin
>
>
> --- linux-2.6.32.5/fs/namei.c	2010-01-22 17:23:21.000000000 -0600
> +++ linux-2.6.32.5-immdir/fs/namei.c	2010-01-24 23:10:43.000000000  
> -0600
> @@ -478,8 +478,13 @@ static struct dentry * real_lookup(struc
> {
> 	struct dentry * result;
> 	struct inode *dir = parent->d_inode;
> +	int dir_locked = 0;
> +
> +	if (likely(!IS_IMMUTABLE(dir))) {
> +		mutex_lock(&dir->i_mutex);
> +		dir_locked = 1;
> +	}
>
> -	mutex_lock(&dir->i_mutex);
> 	/*
> 	 * First re-do the cached lookup just in case it was created
> 	 * while we waited for the directory semaphore..
> @@ -513,7 +518,8 @@ static struct dentry * real_lookup(struc
> 				result = dentry;
> 		}
> out_unlock:
> -		mutex_unlock(&dir->i_mutex);
> +		if (likely(dir_locked))
> +			mutex_unlock(&dir->i_mutex);
> 		return result;
> 	}
>
> @@ -521,7 +527,8 @@ out_unlock:
> 	 * Uhhuh! Nasty case: the cache was re-populated while
> 	 * we waited on the semaphore. Need to revalidate.
> 	 */
> -	mutex_unlock(&dir->i_mutex);
> +	if (likely(dir_locked))
> +		mutex_unlock(&dir->i_mutex);
> 	if (result->d_op && result->d_op->d_revalidate) {
> 		result = do_revalidate(result, nd);
> 		if (!result)
> Quentin
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux- 
> fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: real_lookup() lock contention causes poor performance
  2010-01-25 23:15 ` Andreas Dilger
@ 2010-01-26  7:55   ` Christoph Hellwig
  0 siblings, 0 replies; 3+ messages in thread
From: Christoph Hellwig @ 2010-01-26  7:55 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Quentin Barnes, linux-fsdevel

On Mon, Jan 25, 2010 at 04:15:03PM -0700, Andreas Dilger wrote:
> We have had something similar for Lustre patched into ext3/ext4 for a  
> long time -"__iopen__/{inum}", which depends on a mount option "iopen"  
> to allow this functionality, and "iopen_nopriv" to allow access to non- 
> root users.
>
> The problem we had in the past is that this needed a bunch of changes in 
> ext3/4 and the VFS in order to handle subtle races in the dcache.  We're 
> moving away from changing ext3/4 and changing the API to use export 
> methods (i.e. ext4_get_dentry()) on each filesystem, which should 
> simplify this a lot, though I'm not sure it will solve your scalability 
> problem.

the get_dentry export operation is long gone.  Take a look at
xfs_handle_to_dentry() and friends for a semi-sane (not the horribe
legacy ABI but the implementation) open by handle implementation using 
the full exportfs infrastructure.  It's still going to hit the dcache
locking quite hard, though.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-01-26  7:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-25 17:09 real_lookup() lock contention causes poor performance Quentin Barnes
2010-01-25 23:15 ` Andreas Dilger
2010-01-26  7:55   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).