* [patch 1/2] kernel: introduce brlock
@ 2010-03-16 12:22 Nick Piggin
2010-03-16 12:23 ` [patch 2/2] fs: scale vfsmount_lock Nick Piggin
2010-03-16 19:01 ` [patch 1/2] kernel: introduce brlock Andreas Dilger
0 siblings, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2010-03-16 12:22 UTC (permalink / raw)
To: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
This second patchset scales the vfsmount lock. When it was last posted,
you were worried about commenting of lock requirements, and impact on
the slowpath. I have added comments and also done some slowpath measurements.
--
brlock: introduce special brlocks
This patch introduces special brlocks, these can only be used as global
locks, and use some preprocessor trickery to allow us to retain a more
optimal per-cpu lock implementation. We don't bother working around
lockdep yet.
The other thing we can do in future is a really neat atomic-free
implementation like Dave M did for the old brlocks, so we might actually
be able to speed up the single-thread path for these things.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
include/linux/brlock.h | 112 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 112 insertions(+)
Index: linux-2.6/include/linux/brlock.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/brlock.h
@@ -0,0 +1,112 @@
+/*
+ * Specialised big-reader spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * Copyright 2009, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_BRLOCK_H
+#define __LINUX_BRLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <asm/atomic.h>
+
+#if defined(CONFIG_SMP) && !defined(CONFIG_LOCKDEP)
+#define DECLARE_BRLOCK(name) \
+ DECLARE_PER_CPU(spinlock_t, name##_lock); \
+ extern void name##_lock_init(void); \
+ static inline void name##_rlock(void) { \
+ spinlock_t *lock; \
+ lock = &get_cpu_var(name##_lock); \
+ spin_lock(lock); \
+ put_cpu_var(name##_lock); \
+ } \
+ static inline void name##_runlock(void) { \
+ spinlock_t *lock; \
+ lock = &__get_cpu_var(name##_lock); \
+ spin_unlock(lock); \
+ } \
+ extern void name##_wlock(void); \
+ extern void name##_wunlock(void); \
+ static inline int name##_atomic_dec_and_rlock(atomic_t *a) { \
+ int ret; \
+ spinlock_t *lock; \
+ lock = &get_cpu_var(name##_lock); \
+ ret = atomic_dec_and_lock(a, lock); \
+ put_cpu_var(name##_lock); \
+ return ret; \
+ } \
+ extern int name##_atomic_dec_and_wlock__failed(atomic_t *a); \
+ static inline int name##_atomic_dec_and_wlock(atomic_t *a) { \
+ if (atomic_add_unless(a, -1, 1)) \
+ return 0; \
+ return name##_atomic_dec_and_wlock__failed(a); \
+ }
+
+#define DEFINE_BRLOCK(name) \
+ DEFINE_PER_CPU(spinlock_t, name##_lock); \
+ void name##_lock_init(void) { \
+ int i; \
+ for_each_possible_cpu(i) { \
+ spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ spin_lock_init(lock); \
+ } \
+ } \
+ void name##_wlock(void) { \
+ int i; \
+ for_each_online_cpu(i) { \
+ spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ spin_lock(lock); \
+ } \
+ } \
+ void name##_wunlock(void) { \
+ int i; \
+ for_each_online_cpu(i) { \
+ spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ spin_unlock(lock); \
+ } \
+ } \
+ int name##_atomic_dec_and_wlock__failed(atomic_t *a) { \
+ name##_wlock(); \
+ if (!atomic_dec_and_test(a)) { \
+ name##_wunlock(); \
+ return 0; \
+ } \
+ return 1; \
+ }
+
+#else
+
+#define DECLARE_BRLOCK(name) \
+ extern spinlock_t name##_lock; \
+ static inline void name##_lock_init(void) { \
+ spin_lock_init(&name##_lock); \
+ } \
+ static inline void name##_rlock(void) { \
+ spin_lock(&name##_lock); \
+ } \
+ static inline void name##_runlock(void) { \
+ spin_unlock(&name##_lock); \
+ } \
+ static inline void name##_wlock(void) { \
+ spin_lock(&name##_lock); \
+ } \
+ static inline void name##_wunlock(void) { \
+ spin_unlock(&name##_lock); \
+ } \
+ static inline int name##_atomic_dec_and_rlock(atomic_t *a) { \
+ return atomic_dec_and_lock(a, &name##_lock); \
+ } \
+ static inline int name##_atomic_dec_and_wlock(atomic_t *a) { \
+ return atomic_dec_and_lock(a, &name##_lock); \
+ }
+
+#define DEFINE_BRLOCK(name) \
+ spinlock_t name##_lock
+#endif
+
+#endif
^ permalink raw reply [flat|nested] 9+ messages in thread
* [patch 2/2] fs: scale vfsmount_lock
2010-03-16 12:22 [patch 1/2] kernel: introduce brlock Nick Piggin
@ 2010-03-16 12:23 ` Nick Piggin
2010-03-16 12:28 ` Nick Piggin
2010-03-17 14:20 ` Nick Piggin
2010-03-16 19:01 ` [patch 1/2] kernel: introduce brlock Andreas Dilger
1 sibling, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2010-03-16 12:23 UTC (permalink / raw)
To: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
fs: scale vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
The slowpath will be made significantly slower due to use of brlock. On a 64
core, 64 socket, 32 node Altix system (so a decent amount of latency to remote
nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2
loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, for
about 5% increase in elapsed time.
Number of atomics should remain the same for fastpath rlock cases, though code
will be slightly larger due to per-cpu access. Scalability will probably not be
much improved in common cases yet, due to other locks getting in the way.
However independent path lookups over mountpoints should be one case where
scalability is improved.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
fs/dcache.c | 4 -
fs/namei.c | 13 +--
fs/namespace.c | 178 ++++++++++++++++++++++++++++-----------------
fs/pnode.c | 11 ++
fs/proc/base.c | 4 -
include/linux/mount.h | 4 -
kernel/audit_tree.c | 6 -
security/tomoyo/realpath.c | 4 -
8 files changed, 141 insertions(+), 83 deletions(-)
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1928,7 +1928,7 @@ char *__d_path(const struct path *path,
char *end = buffer + buflen;
char *retval;
- spin_lock(&vfsmount_lock);
+ vfsmount_rlock();
prepend(&end, &buflen, "\0", 1);
if (d_unlinked(dentry) &&
(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -1964,7 +1964,7 @@ char *__d_path(const struct path *path,
}
out:
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
return retval;
global_root:
@@ -2195,11 +2195,12 @@ int path_is_under(struct path *path1, st
struct vfsmount *mnt = path1->mnt;
struct dentry *dentry = path1->dentry;
int res;
- spin_lock(&vfsmount_lock);
+
+ vfsmount_rlock();
if (mnt != path2->mnt) {
for (;;) {
if (mnt->mnt_parent == mnt) {
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
return 0;
}
if (mnt->mnt_parent == path2->mnt)
@@ -2209,7 +2210,7 @@ int path_is_under(struct path *path1, st
dentry = mnt->mnt_mountpoint;
}
res = is_subdir(dentry, path2->dentry);
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
return res;
}
EXPORT_SYMBOL(path_is_under);
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -600,15 +600,16 @@ int follow_up(struct path *path)
{
struct vfsmount *parent;
struct dentry *mountpoint;
- spin_lock(&vfsmount_lock);
+
+ vfsmount_rlock();
parent = path->mnt->mnt_parent;
if (parent == path->mnt) {
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
return 0;
}
mntget(parent);
mountpoint = dget(path->mnt->mnt_mountpoint);
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
dput(path->dentry);
path->dentry = mountpoint;
mntput(path->mnt);
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -11,6 +11,8 @@
#include <linux/syscalls.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -37,12 +39,10 @@
#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
#define HASH_SIZE (1UL << HASH_SHIFT)
-/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
-
static int event;
static DEFINE_IDA(mnt_id_ida);
static DEFINE_IDA(mnt_group_ida);
+static DEFINE_SPINLOCK(mnt_id_lock);
static int mnt_id_start = 0;
static int mnt_group_start = 1;
@@ -54,6 +54,16 @@ static struct rw_semaphore namespace_sem
struct kobject *fs_kobj;
EXPORT_SYMBOL_GPL(fs_kobj);
+/*
+ * vfsmount lock may be taken for read to prevent changes to the
+ * vfsmount hash, ie. during mountpoint lookups or walking back
+ * up the tree.
+ *
+ * It should be taken for write in all cases where the vfsmount
+ * tree or hash is modified or when a vfsmount structure is modified.
+ */
+DEFINE_BRLOCK(vfsmount);
+
static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
{
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -64,18 +74,21 @@ static inline unsigned long hash(struct
#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
-/* allocation is serialized by namespace_sem */
+/*
+ * allocation is serialized by namespace_sem, but we need the spinlock to
+ * serialize with freeing.
+ */
static int mnt_alloc_id(struct vfsmount *mnt)
{
int res;
retry:
ida_pre_get(&mnt_id_ida, GFP_KERNEL);
- spin_lock(&vfsmount_lock);
+ spin_lock(&mnt_id_lock);
res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
if (!res)
mnt_id_start = mnt->mnt_id + 1;
- spin_unlock(&vfsmount_lock);
+ spin_unlock(&mnt_id_lock);
if (res == -EAGAIN)
goto retry;
@@ -85,11 +98,11 @@ retry:
static void mnt_free_id(struct vfsmount *mnt)
{
int id = mnt->mnt_id;
- spin_lock(&vfsmount_lock);
+ spin_lock(&mnt_id_lock);
ida_remove(&mnt_id_ida, id);
if (mnt_id_start > id)
mnt_id_start = id;
- spin_unlock(&vfsmount_lock);
+ spin_unlock(&mnt_id_lock);
}
/*
@@ -344,7 +357,7 @@ static int mnt_make_readonly(struct vfsm
{
int ret = 0;
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
mnt->mnt_flags |= MNT_WRITE_HOLD;
/*
* After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -378,15 +391,15 @@ static int mnt_make_readonly(struct vfsm
*/
smp_wmb();
mnt->mnt_flags &= ~MNT_WRITE_HOLD;
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
return ret;
}
static void __mnt_unmake_readonly(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
mnt->mnt_flags &= ~MNT_READONLY;
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
}
void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -410,6 +423,7 @@ void free_vfsmnt(struct vfsmount *mnt)
/*
* find the first or last mount at @dentry on vfsmount @mnt depending on
* @dir. If @dir is set return the first mount else return the last mount.
+ * vfsmount_lock must be held for read or write.
*/
struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
int dir)
@@ -439,10 +453,11 @@ struct vfsmount *__lookup_mnt(struct vfs
struct vfsmount *lookup_mnt(struct path *path)
{
struct vfsmount *child_mnt;
- spin_lock(&vfsmount_lock);
+
+ vfsmount_rlock();
if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
mntget(child_mnt);
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
return child_mnt;
}
@@ -451,6 +466,9 @@ static inline int check_mnt(struct vfsmo
return mnt->mnt_ns == current->nsproxy->mnt_ns;
}
+/*
+ * vfsmount lock must be held for write
+ */
static void touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns) {
@@ -459,6 +477,9 @@ static void touch_mnt_namespace(struct m
}
}
+/*
+ * vfsmount lock must be held for write
+ */
static void __touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns && ns->event != event) {
@@ -467,6 +488,9 @@ static void __touch_mnt_namespace(struct
}
}
+/*
+ * vfsmount lock must be held for write
+ */
static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
{
old_path->dentry = mnt->mnt_mountpoint;
@@ -478,6 +502,9 @@ static void detach_mnt(struct vfsmount *
old_path->dentry->d_mounted--;
}
+/*
+ * vfsmount lock must be held for write
+ */
void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
struct vfsmount *child_mnt)
{
@@ -486,6 +513,9 @@ void mnt_set_mountpoint(struct vfsmount
dentry->d_mounted++;
}
+/*
+ * vfsmount lock must be held for write
+ */
static void attach_mnt(struct vfsmount *mnt, struct path *path)
{
mnt_set_mountpoint(path->mnt, path->dentry, mnt);
@@ -495,7 +525,7 @@ static void attach_mnt(struct vfsmount *
}
/*
- * the caller must hold vfsmount_lock
+ * vfsmount lock must be held for write
*/
static void commit_tree(struct vfsmount *mnt)
{
@@ -618,40 +648,41 @@ static inline void __mntput(struct vfsmo
void mntput_no_expire(struct vfsmount *mnt)
{
repeat:
- if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
- if (likely(!mnt->mnt_pinned)) {
- spin_unlock(&vfsmount_lock);
- __mntput(mnt);
- return;
- }
- atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
- mnt->mnt_pinned = 0;
- spin_unlock(&vfsmount_lock);
- acct_auto_close_mnt(mnt);
- security_sb_umount_close(mnt);
- goto repeat;
+ if (!vfsmount_atomic_dec_and_wlock(&mnt->mnt_count))
+ return;
+
+ if (likely(!mnt->mnt_pinned)) {
+ vfsmount_wunlock();
+ __mntput(mnt);
+ return;
}
+ atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+ mnt->mnt_pinned = 0;
+ vfsmount_wunlock();
+ acct_auto_close_mnt(mnt);
+ security_sb_umount_close(mnt);
+ goto repeat;
}
EXPORT_SYMBOL(mntput_no_expire);
void mnt_pin(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
mnt->mnt_pinned++;
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
}
EXPORT_SYMBOL(mnt_pin);
void mnt_unpin(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
if (mnt->mnt_pinned) {
atomic_inc(&mnt->mnt_count);
mnt->mnt_pinned--;
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
}
EXPORT_SYMBOL(mnt_unpin);
@@ -742,12 +773,12 @@ int mnt_had_events(struct proc_mounts *p
struct mnt_namespace *ns = p->ns;
int res = 0;
- spin_lock(&vfsmount_lock);
+ vfsmount_rlock();
if (p->event != ns->event) {
p->event = ns->event;
res = 1;
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
return res;
}
@@ -949,12 +980,12 @@ int may_umount_tree(struct vfsmount *mnt
int minimum_refs = 0;
struct vfsmount *p;
- spin_lock(&vfsmount_lock);
+ vfsmount_rlock();
for (p = mnt; p; p = next_mnt(p, mnt)) {
actual_refs += atomic_read(&p->mnt_count);
minimum_refs += 2;
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
if (actual_refs > minimum_refs)
return 0;
@@ -981,10 +1012,10 @@ int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ vfsmount_rlock();
if (propagate_mount_busy(mnt, 2))
ret = 0;
- spin_unlock(&vfsmount_lock);
+ vfsmount_runlock();
up_read(&namespace_sem);
return ret;
}
@@ -1000,13 +1031,14 @@ void release_mounts(struct list_head *he
if (mnt->mnt_parent != mnt) {
struct dentry *dentry;
struct vfsmount *m;
- spin_lock(&vfsmount_lock);
+
+ vfsmount_wlock();
dentry = mnt->mnt_mountpoint;
m = mnt->mnt_parent;
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
m->mnt_ghosts--;
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
dput(dentry);
mntput(m);
}
@@ -1014,6 +1046,10 @@ void release_mounts(struct list_head *he
}
}
+/*
+ * vfsmount lock must be held for write
+ * namespace_sem must be held for write
+ */
void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
{
struct vfsmount *p;
@@ -1104,7 +1140,7 @@ static int do_umount(struct vfsmount *mn
}
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
event++;
if (!(flags & MNT_DETACH))
@@ -1116,7 +1152,7 @@ static int do_umount(struct vfsmount *mn
umount_tree(mnt, 1, &umount_list);
retval = 0;
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
if (retval)
security_sb_umount_busy(mnt);
up_write(&namespace_sem);
@@ -1230,19 +1266,19 @@ struct vfsmount *copy_tree(struct vfsmou
q = clone_mnt(p, p->mnt_root, flag);
if (!q)
goto Enomem;
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, &path);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
}
}
return res;
Enomem:
if (res) {
LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
umount_tree(res, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
release_mounts(&umount_list);
}
return NULL;
@@ -1261,9 +1297,9 @@ void drop_collected_mounts(struct vfsmou
{
LIST_HEAD(umount_list);
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
up_write(&namespace_sem);
release_mounts(&umount_list);
}
@@ -1391,7 +1427,7 @@ static int attach_recursive_mnt(struct v
if (err)
goto out_cleanup_ids;
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
if (IS_MNT_SHARED(dest_mnt)) {
for (p = source_mnt; p; p = next_mnt(p, source_mnt))
@@ -1410,7 +1446,8 @@ static int attach_recursive_mnt(struct v
list_del_init(&child->mnt_hash);
commit_tree(child);
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
+
return 0;
out_cleanup_ids:
@@ -1472,10 +1509,10 @@ static int do_change_type(struct path *p
goto out_unlock;
}
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
change_mnt_propagation(m, type);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
out_unlock:
up_write(&namespace_sem);
@@ -1519,9 +1556,10 @@ static int do_loopback(struct path *path
err = graft_tree(mnt, path);
if (err) {
LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
+
+ vfsmount_wlock();
umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
release_mounts(&umount_list);
}
@@ -1574,18 +1612,18 @@ static int do_remount(struct path *path,
else
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK;
path->mnt->mnt_flags = mnt_flags;
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
}
up_write(&sb->s_umount);
if (!err) {
security_sb_post_remount(path->mnt, flags, data);
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
touch_mnt_namespace(path->mnt->mnt_ns);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
}
return err;
}
@@ -1762,7 +1800,7 @@ void mark_mounts_for_expiry(struct list_
return;
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
/* extract from the expiration list every vfsmount that matches the
* following criteria:
@@ -1781,7 +1819,7 @@ void mark_mounts_for_expiry(struct list_
touch_mnt_namespace(mnt->mnt_ns);
umount_tree(mnt, 1, &umounts);
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
up_write(&namespace_sem);
release_mounts(&umounts);
@@ -1838,6 +1876,8 @@ resume:
/*
* process a list of expirable mountpoints with the intent of discarding any
* submounts of a specific parent mountpoint
+ *
+ * vfsmount_lock must be held for write
*/
static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
{
@@ -2056,9 +2096,9 @@ static struct mnt_namespace *dup_mnt_ns(
kfree(new_ns);
return ERR_PTR(-ENOMEM);
}
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
/*
* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2255,7 +2295,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
goto out2; /* not attached */
/* make sure we can reach put_old from new_root */
tmp = old.mnt;
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
if (tmp != new.mnt) {
for (;;) {
if (tmp->mnt_parent == tmp)
@@ -2275,7 +2315,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
/* mount new_root on / */
attach_mnt(new.mnt, &root_parent);
touch_mnt_namespace(current->nsproxy->mnt_ns);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
chroot_fs_refs(&root, &new);
security_sb_post_pivotroot(&root, &new);
error = 0;
@@ -2291,7 +2331,7 @@ out1:
out0:
return error;
out3:
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
goto out2;
}
@@ -2338,6 +2378,8 @@ void __init mnt_init(void)
for (u = 0; u < HASH_SIZE; u++)
INIT_LIST_HEAD(&mount_hashtable[u]);
+ vfsmount_lock_init();
+
err = sysfs_init();
if (err)
printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2356,9 +2398,9 @@ void put_mnt_ns(struct mnt_namespace *ns
if (!atomic_dec_and_test(&ns->count))
return;
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
umount_tree(ns->root, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
up_write(&namespace_sem);
release_mounts(&umount_list);
kfree(ns);
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -126,6 +126,9 @@ static int do_make_slave(struct vfsmount
return 0;
}
+/*
+ * vfsmount lock must be held for write
+ */
void change_mnt_propagation(struct vfsmount *mnt, int type)
{
if (type == MS_SHARED) {
@@ -270,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_
prev_src_mnt = child;
}
out:
- spin_lock(&vfsmount_lock);
+ vfsmount_wlock();
while (!list_empty(&tmp_list)) {
child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
umount_tree(child, 0, &umount_list);
}
- spin_unlock(&vfsmount_lock);
+ vfsmount_wunlock();
release_mounts(&umount_list);
return ret;
}
@@ -296,6 +299,8 @@ static inline int do_refcount_check(stru
* other mounts its parent propagates to.
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
+ *
+ * vfsmount lock must be held for read or write
*/
int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
{
@@ -353,6 +358,8 @@ static void __propagate_umount(struct vf
* collect all mounts that receive propagation from the mount in @list,
* and return these additional mounts in the same list.
* @list: the list of mounts to be unmounted.
+ *
+ * vfsmount lock must be held for write
*/
int propagate_umount(struct list_head *list)
{
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h
+++ linux-2.6/fs/internal.h
@@ -9,6 +9,8 @@
* 2 of the License, or (at your option) any later version.
*/
+#include <linux/brlock.h>
+
struct super_block;
struct linux_binprm;
struct path;
@@ -70,7 +72,8 @@ extern struct vfsmount *copy_tree(struct
extern void __init mnt_init(void);
-extern spinlock_t vfsmount_lock;
+DECLARE_BRLOCK(vfsmount);
+
/*
* fs_struct.c
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 2/2] fs: scale vfsmount_lock
2010-03-16 12:23 ` [patch 2/2] fs: scale vfsmount_lock Nick Piggin
@ 2010-03-16 12:28 ` Nick Piggin
2010-03-17 14:20 ` Nick Piggin
1 sibling, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-03-16 12:28 UTC (permalink / raw)
To: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
On Tue, Mar 16, 2010 at 11:23:20PM +1100, Nick Piggin wrote:
> fs: scale vfsmount_lock
>
> Use a brlock for the vfsmount lock. It must be taken for write whenever
> modifying the mount hash or associated fields, and may be taken for read when
> performing mount hash lookups.
>
> The slowpath will be made significantly slower due to use of brlock. On a 64
> core, 64 socket, 32 node Altix system (so a decent amount of latency to remote
> nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2
> loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, for
> about 5% increase in elapsed time.
>
> Number of atomics should remain the same for fastpath rlock cases, though code
> will be slightly larger due to per-cpu access. Scalability will probably not be
> much improved in common cases yet, due to other locks getting in the way.
> However independent path lookups over mountpoints should be one case where
> scalability is improved.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> fs/dcache.c | 4 -
> fs/namei.c | 13 +--
> fs/namespace.c | 178 ++++++++++++++++++++++++++++-----------------
> fs/pnode.c | 11 ++
> fs/proc/base.c | 4 -
> include/linux/mount.h | 4 -
> kernel/audit_tree.c | 6 -
> security/tomoyo/realpath.c | 4 -
> 8 files changed, 141 insertions(+), 83 deletions(-)
This diffstat is obviously not refreshed since your vfsmount lock
cleanup, sorry.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 1/2] kernel: introduce brlock
2010-03-16 12:22 [patch 1/2] kernel: introduce brlock Nick Piggin
2010-03-16 12:23 ` [patch 2/2] fs: scale vfsmount_lock Nick Piggin
@ 2010-03-16 19:01 ` Andreas Dilger
2010-03-16 20:12 ` Frank Mayhar
2010-03-16 23:44 ` Nick Piggin
1 sibling, 2 replies; 9+ messages in thread
From: Andreas Dilger @ 2010-03-16 19:01 UTC (permalink / raw)
To: Nick Piggin; +Cc: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
On 2010-03-16, at 06:22, Nick Piggin wrote:
> +#define DEFINE_BRLOCK(name) \
> + DEFINE_PER_CPU(spinlock_t, name##_lock); \
> + void name##_lock_init(void) { \
> + void name##_wlock(void) { \
> + void name##_wunlock(void) { \
> + int name##_atomic_dec_and_wlock__failed(atomic_t *a) {
What makes these macros unpleasant is that it is no longer possible to
tag to the implementation to see what it does, since there is no real
declaration for these locks.
Is it possible to change the macros to take the lock name as a
parameter, like normal lock/unlock functions do, and then have a
single declaration for br_lock_init(), br_wlock(), etc. macros?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 1/2] kernel: introduce brlock
2010-03-16 19:01 ` [patch 1/2] kernel: introduce brlock Andreas Dilger
@ 2010-03-16 20:12 ` Frank Mayhar
2010-03-16 23:44 ` Nick Piggin
1 sibling, 0 replies; 9+ messages in thread
From: Frank Mayhar @ 2010-03-16 20:12 UTC (permalink / raw)
To: Andreas Dilger
Cc: Nick Piggin, Al Viro, John Stultz, Andi Kleen, linux-fsdevel
On Tue, 2010-03-16 at 13:01 -0600, Andreas Dilger wrote:
> On 2010-03-16, at 06:22, Nick Piggin wrote:
> > +#define DEFINE_BRLOCK(name) \
> > + DEFINE_PER_CPU(spinlock_t, name##_lock); \
> > + void name##_lock_init(void) { \
> > + void name##_wlock(void) { \
> > + void name##_wunlock(void) { \
> > + int name##_atomic_dec_and_wlock__failed(atomic_t *a) {
>
> What makes these macros unpleasant is that it is no longer possible to
> tag to the implementation to see what it does, since there is no real
> declaration for these locks.
>
> Is it possible to change the macros to take the lock name as a
> parameter, like normal lock/unlock functions do, and then have a
> single declaration for br_lock_init(), br_wlock(), etc. macros?
This gets my vote as well. (I've been repeatedly annoyed by some of the
buffer routines that are constructed this way.)
--
Frank Mayhar <fmayhar@google.com>
Google, Inc.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 1/2] kernel: introduce brlock
2010-03-16 19:01 ` [patch 1/2] kernel: introduce brlock Andreas Dilger
2010-03-16 20:12 ` Frank Mayhar
@ 2010-03-16 23:44 ` Nick Piggin
2010-03-17 14:18 ` Nick Piggin
1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2010-03-16 23:44 UTC (permalink / raw)
To: Andreas Dilger
Cc: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
On Tue, Mar 16, 2010 at 01:01:09PM -0600, Andreas Dilger wrote:
> On 2010-03-16, at 06:22, Nick Piggin wrote:
> >+#define DEFINE_BRLOCK(name) \
> >+ DEFINE_PER_CPU(spinlock_t, name##_lock); \
> >+ void name##_lock_init(void) { \
> >+ void name##_wlock(void) { \
> >+ void name##_wunlock(void) { \
> >+ int name##_atomic_dec_and_wlock__failed(atomic_t *a) {
>
> What makes these macros unpleasant is that it is no longer possible
> to tag to the implementation to see what it does, since there is no
> real declaration for these locks.
>
> Is it possible to change the macros to take the lock name as a
> parameter, like normal lock/unlock functions do, and then have a
> single declaration for br_lock_init(), br_wlock(), etc. macros?
The problem is that then you can't do out of line functions, and
things like wlock/wunlock are rather large.
What I think I can do is add macros in the brlock.h file
#define br_rlock(name) ##name_rlock()
So the macro calls the right function and your tag should take
you pretty close to the right place.
Any better ideas how to implement this nicely would be welcome.
It must be as light-weight as possible in the rlock path though.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 1/2] kernel: introduce brlock
2010-03-16 23:44 ` Nick Piggin
@ 2010-03-17 14:18 ` Nick Piggin
0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-03-17 14:18 UTC (permalink / raw)
To: Andreas Dilger
Cc: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
On Wed, Mar 17, 2010 at 10:44:40AM +1100, Nick Piggin wrote:
> On Tue, Mar 16, 2010 at 01:01:09PM -0600, Andreas Dilger wrote:
> > On 2010-03-16, at 06:22, Nick Piggin wrote:
> > What makes these macros unpleasant is that it is no longer possible
> > to tag to the implementation to see what it does, since there is no
> > real declaration for these locks.
> >
> > Is it possible to change the macros to take the lock name as a
> > parameter, like normal lock/unlock functions do, and then have a
> > single declaration for br_lock_init(), br_wlock(), etc. macros?
>
> The problem is that then you can't do out of line functions, and
> things like wlock/wunlock are rather large.
>
> What I think I can do is add macros in the brlock.h file
>
> #define br_rlock(name) ##name_rlock()
>
> So the macro calls the right function and your tag should take
> you pretty close to the right place.
>
> Any better ideas how to implement this nicely would be welcome.
> It must be as light-weight as possible in the rlock path though.
It looks like this. Is it better?
--
brlock: introduce special brlocks
This patch introduces special brlocks, these can only be used as global
locks, and use some preprocessor trickery to allow us to retain a more
optimal per-cpu lock implementation. We don't bother working around
lockdep yet.
The other thing we can do in future is a really neat atomic-free
implementation like Dave M did for the old brlocks, so we might actually
be able to speed up the single-thread path for these things.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
include/linux/brlock.h | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 120 insertions(+)
Index: linux-2.6/include/linux/brlock.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/brlock.h
@@ -0,0 +1,120 @@
+/*
+ * Specialised big-reader spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * Copyright 2009, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_BRLOCK_H
+#define __LINUX_BRLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <asm/atomic.h>
+
+#define br_lock_init(name) name##_lock_init()
+#define br_read_lock(name) name##_read_lock()
+#define br_read_unlock(name) name##_read_unlock()
+#define br_write_lock(name) name##_write_lock()
+#define br_write_unlock(name) name##_write_unlock()
+#define atomic_dec_and_br_read_lock(atomic, name) name##_atomic_dec_and_read_lock(atomic)
+#define atomic_dec_and_br_write_lock(atomic, name) name##_atomic_dec_and_write_lock(atomic)
+
+#if defined(CONFIG_SMP) && !defined(CONFIG_LOCKDEP)
+#define DECLARE_BRLOCK(name) \
+ DECLARE_PER_CPU(spinlock_t, name##_lock); \
+ extern void name##_lock_init(void); \
+ static inline void name##_read_lock(void) { \
+ spinlock_t *lock; \
+ lock = &get_cpu_var(name##_lock); \
+ spin_lock(lock); \
+ put_cpu_var(name##_lock); \
+ } \
+ static inline void name##_read_unlock(void) { \
+ spinlock_t *lock; \
+ lock = &__get_cpu_var(name##_lock); \
+ spin_unlock(lock); \
+ } \
+ extern void name##_write_lock(void); \
+ extern void name##_write_unlock(void); \
+ static inline int name##_atomic_dec_and_read_lock(atomic_t *a) { \
+ int ret; \
+ spinlock_t *lock; \
+ lock = &get_cpu_var(name##_lock); \
+ ret = atomic_dec_and_lock(a, lock); \
+ put_cpu_var(name##_lock); \
+ return ret; \
+ } \
+ extern int name##_atomic_dec_and_write_lock__failed(atomic_t *a); \
+ static inline int name##_atomic_dec_and_write_lock(atomic_t *a) { \
+ if (atomic_add_unless(a, -1, 1)) \
+ return 0; \
+ return name##_atomic_dec_and_write_lock__failed(a); \
+ }
+
+#define DEFINE_BRLOCK(name) \
+ DEFINE_PER_CPU(spinlock_t, name##_lock); \
+ void name##_lock_init(void) { \
+ int i; \
+ for_each_possible_cpu(i) { \
+ spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ spin_lock_init(lock); \
+ } \
+ } \
+ void name##_write_lock(void) { \
+ int i; \
+ for_each_online_cpu(i) { \
+ spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ spin_lock(lock); \
+ } \
+ } \
+ void name##_write_unlock(void) { \
+ int i; \
+ for_each_online_cpu(i) { \
+ spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ spin_unlock(lock); \
+ } \
+ } \
+ int name##_atomic_dec_and_write_lock__failed(atomic_t *a) { \
+ name##_write_lock(); \
+ if (!atomic_dec_and_test(a)) { \
+ name##_write_unlock(); \
+ return 0; \
+ } \
+ return 1; \
+ }
+
+#else
+
+#define DECLARE_BRLOCK(name) \
+ extern spinlock_t name##_lock; \
+ static inline void name##_lock_init(void) { \
+ spin_lock_init(&name##_lock); \
+ } \
+ static inline void name##_read_lock(void) { \
+ spin_lock(&name##_lock); \
+ } \
+ static inline void name##_read_unlock(void) { \
+ spin_unlock(&name##_lock); \
+ } \
+ static inline void name##_write_lock(void) { \
+ spin_lock(&name##_lock); \
+ } \
+ static inline void name##_write_unlock(void) { \
+ spin_unlock(&name##_lock); \
+ } \
+ static inline int name##_atomic_dec_and_read_lock(atomic_t *a) { \
+ return atomic_dec_and_lock(a, &name##_lock); \
+ } \
+ static inline int name##_atomic_dec_and_write_lock(atomic_t *a) { \
+ return atomic_dec_and_lock(a, &name##_lock); \
+ }
+
+#define DEFINE_BRLOCK(name) \
+ spinlock_t name##_lock
+#endif
+
+#endif
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 2/2] fs: scale vfsmount_lock
2010-03-16 12:23 ` [patch 2/2] fs: scale vfsmount_lock Nick Piggin
2010-03-16 12:28 ` Nick Piggin
@ 2010-03-17 14:20 ` Nick Piggin
2010-03-17 20:33 ` Andreas Dilger
1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2010-03-17 14:20 UTC (permalink / raw)
To: Andreas Dilger, Al Viro, Frank Mayhar, John Stultz, Andi Kleen,
linux-fsde
Here is patch 2 with the new syntax.
--
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
The slowpath will be made significantly slower due to use of brlock. On a 64
core, 64 socket, 32 node Altix system (so a decent amount of latency to remote
nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2
loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, for
about 5% increase in elapsed time.
Number of atomics should remain the same for fastpath rlock cases, though code
will be slightly larger due to per-cpu access. Scalability will probably not be
much improved in common cases yet, due to other locks getting in the way.
However independent path lookups over mountpoints should be one case where
scalability is improved.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
fs/dcache.c | 11 +--
fs/internal.h | 5 +
fs/namei.c | 7 +-
fs/namespace.c | 174 +++++++++++++++++++++++++++++++++++----------------------
fs/pnode.c | 11 ++-
5 files changed, 131 insertions(+), 77 deletions(-)
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1928,7 +1928,7 @@ char *__d_path(const struct path *path,
char *end = buffer + buflen;
char *retval;
- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
prepend(&end, &buflen, "\0", 1);
if (d_unlinked(dentry) &&
(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -1964,7 +1964,7 @@ char *__d_path(const struct path *path,
}
out:
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return retval;
global_root:
@@ -2195,11 +2195,12 @@ int path_is_under(struct path *path1, st
struct vfsmount *mnt = path1->mnt;
struct dentry *dentry = path1->dentry;
int res;
- spin_lock(&vfsmount_lock);
+
+ br_read_lock(vfsmount_lock);
if (mnt != path2->mnt) {
for (;;) {
if (mnt->mnt_parent == mnt) {
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return 0;
}
if (mnt->mnt_parent == path2->mnt)
@@ -2209,7 +2210,7 @@ int path_is_under(struct path *path1, st
dentry = mnt->mnt_mountpoint;
}
res = is_subdir(dentry, path2->dentry);
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return res;
}
EXPORT_SYMBOL(path_is_under);
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -600,15 +600,16 @@ int follow_up(struct path *path)
{
struct vfsmount *parent;
struct dentry *mountpoint;
- spin_lock(&vfsmount_lock);
+
+ br_read_lock(vfsmount_lock);
parent = path->mnt->mnt_parent;
if (parent == path->mnt) {
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return 0;
}
mntget(parent);
mountpoint = dget(path->mnt->mnt_mountpoint);
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
dput(path->dentry);
path->dentry = mountpoint;
mntput(path->mnt);
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -11,6 +11,8 @@
#include <linux/syscalls.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -37,12 +39,10 @@
#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
#define HASH_SIZE (1UL << HASH_SHIFT)
-/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
-
static int event;
static DEFINE_IDA(mnt_id_ida);
static DEFINE_IDA(mnt_group_ida);
+static DEFINE_SPINLOCK(mnt_id_lock);
static int mnt_id_start = 0;
static int mnt_group_start = 1;
@@ -54,6 +54,16 @@ static struct rw_semaphore namespace_sem
struct kobject *fs_kobj;
EXPORT_SYMBOL_GPL(fs_kobj);
+/*
+ * vfsmount lock may be taken for read to prevent changes to the
+ * vfsmount hash, ie. during mountpoint lookups or walking back
+ * up the tree.
+ *
+ * It should be taken for write in all cases where the vfsmount
+ * tree or hash is modified or when a vfsmount structure is modified.
+ */
+DEFINE_BRLOCK(vfsmount_lock);
+
static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
{
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -64,18 +74,21 @@ static inline unsigned long hash(struct
#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
-/* allocation is serialized by namespace_sem */
+/*
+ * allocation is serialized by namespace_sem, but we need the spinlock to
+ * serialize with freeing.
+ */
static int mnt_alloc_id(struct vfsmount *mnt)
{
int res;
retry:
ida_pre_get(&mnt_id_ida, GFP_KERNEL);
- spin_lock(&vfsmount_lock);
+ spin_lock(&mnt_id_lock);
res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
if (!res)
mnt_id_start = mnt->mnt_id + 1;
- spin_unlock(&vfsmount_lock);
+ spin_unlock(&mnt_id_lock);
if (res == -EAGAIN)
goto retry;
@@ -85,11 +98,11 @@ retry:
static void mnt_free_id(struct vfsmount *mnt)
{
int id = mnt->mnt_id;
- spin_lock(&vfsmount_lock);
+ spin_lock(&mnt_id_lock);
ida_remove(&mnt_id_ida, id);
if (mnt_id_start > id)
mnt_id_start = id;
- spin_unlock(&vfsmount_lock);
+ spin_unlock(&mnt_id_lock);
}
/*
@@ -344,7 +357,7 @@ static int mnt_make_readonly(struct vfsm
{
int ret = 0;
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt->mnt_flags |= MNT_WRITE_HOLD;
/*
* After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -378,15 +391,15 @@ static int mnt_make_readonly(struct vfsm
*/
smp_wmb();
mnt->mnt_flags &= ~MNT_WRITE_HOLD;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
return ret;
}
static void __mnt_unmake_readonly(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt->mnt_flags &= ~MNT_READONLY;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -410,6 +423,7 @@ void free_vfsmnt(struct vfsmount *mnt)
/*
* find the first or last mount at @dentry on vfsmount @mnt depending on
* @dir. If @dir is set return the first mount else return the last mount.
+ * vfsmount_lock must be held for read or write.
*/
struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
int dir)
@@ -439,10 +453,11 @@ struct vfsmount *__lookup_mnt(struct vfs
struct vfsmount *lookup_mnt(struct path *path)
{
struct vfsmount *child_mnt;
- spin_lock(&vfsmount_lock);
+
+ br_read_lock(vfsmount_lock);
if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
mntget(child_mnt);
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return child_mnt;
}
@@ -451,6 +466,9 @@ static inline int check_mnt(struct vfsmo
return mnt->mnt_ns == current->nsproxy->mnt_ns;
}
+/*
+ * vfsmount lock must be held for write
+ */
static void touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns) {
@@ -459,6 +477,9 @@ static void touch_mnt_namespace(struct m
}
}
+/*
+ * vfsmount lock must be held for write
+ */
static void __touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns && ns->event != event) {
@@ -467,6 +488,9 @@ static void __touch_mnt_namespace(struct
}
}
+/*
+ * vfsmount lock must be held for write
+ */
static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
{
old_path->dentry = mnt->mnt_mountpoint;
@@ -478,6 +502,9 @@ static void detach_mnt(struct vfsmount *
old_path->dentry->d_mounted--;
}
+/*
+ * vfsmount lock must be held for write
+ */
void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
struct vfsmount *child_mnt)
{
@@ -486,6 +513,9 @@ void mnt_set_mountpoint(struct vfsmount
dentry->d_mounted++;
}
+/*
+ * vfsmount lock must be held for write
+ */
static void attach_mnt(struct vfsmount *mnt, struct path *path)
{
mnt_set_mountpoint(path->mnt, path->dentry, mnt);
@@ -495,7 +525,7 @@ static void attach_mnt(struct vfsmount *
}
/*
- * the caller must hold vfsmount_lock
+ * vfsmount lock must be held for write
*/
static void commit_tree(struct vfsmount *mnt)
{
@@ -618,40 +648,41 @@ static inline void __mntput(struct vfsmo
void mntput_no_expire(struct vfsmount *mnt)
{
repeat:
- if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
- if (likely(!mnt->mnt_pinned)) {
- spin_unlock(&vfsmount_lock);
- __mntput(mnt);
- return;
- }
- atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
- mnt->mnt_pinned = 0;
- spin_unlock(&vfsmount_lock);
- acct_auto_close_mnt(mnt);
- security_sb_umount_close(mnt);
- goto repeat;
+ if (!atomic_dec_and_br_write_lock(&mnt->mnt_count, vfsmount_lock))
+ return;
+
+ if (likely(!mnt->mnt_pinned)) {
+ br_write_unlock(vfsmount_lock);
+ __mntput(mnt);
+ return;
}
+ atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+ mnt->mnt_pinned = 0;
+ br_write_unlock(vfsmount_lock);
+ acct_auto_close_mnt(mnt);
+ security_sb_umount_close(mnt);
+ goto repeat;
}
EXPORT_SYMBOL(mntput_no_expire);
void mnt_pin(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt->mnt_pinned++;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
EXPORT_SYMBOL(mnt_pin);
void mnt_unpin(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (mnt->mnt_pinned) {
atomic_inc(&mnt->mnt_count);
mnt->mnt_pinned--;
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
EXPORT_SYMBOL(mnt_unpin);
@@ -742,12 +773,12 @@ int mnt_had_events(struct proc_mounts *p
struct mnt_namespace *ns = p->ns;
int res = 0;
- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
if (p->event != ns->event) {
p->event = ns->event;
res = 1;
}
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return res;
}
@@ -949,12 +980,12 @@ int may_umount_tree(struct vfsmount *mnt
int minimum_refs = 0;
struct vfsmount *p;
- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
for (p = mnt; p; p = next_mnt(p, mnt)) {
actual_refs += atomic_read(&p->mnt_count);
minimum_refs += 2;
}
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
if (actual_refs > minimum_refs)
return 0;
@@ -981,10 +1012,10 @@ int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
if (propagate_mount_busy(mnt, 2))
ret = 0;
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
up_read(&namespace_sem);
return ret;
}
@@ -1000,13 +1031,14 @@ void release_mounts(struct list_head *he
if (mnt->mnt_parent != mnt) {
struct dentry *dentry;
struct vfsmount *m;
- spin_lock(&vfsmount_lock);
+
+ br_write_lock(vfsmount_lock);
dentry = mnt->mnt_mountpoint;
m = mnt->mnt_parent;
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
m->mnt_ghosts--;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
dput(dentry);
mntput(m);
}
@@ -1014,6 +1046,10 @@ void release_mounts(struct list_head *he
}
}
+/*
+ * vfsmount lock must be held for write
+ * namespace_sem must be held for write
+ */
void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
{
struct vfsmount *p;
@@ -1104,7 +1140,7 @@ static int do_umount(struct vfsmount *mn
}
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
event++;
if (!(flags & MNT_DETACH))
@@ -1116,7 +1152,7 @@ static int do_umount(struct vfsmount *mn
umount_tree(mnt, 1, &umount_list);
retval = 0;
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
if (retval)
security_sb_umount_busy(mnt);
up_write(&namespace_sem);
@@ -1230,19 +1266,19 @@ struct vfsmount *copy_tree(struct vfsmou
q = clone_mnt(p, p->mnt_root, flag);
if (!q)
goto Enomem;
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, &path);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
}
return res;
Enomem:
if (res) {
LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
umount_tree(res, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
}
return NULL;
@@ -1261,9 +1297,9 @@ void drop_collected_mounts(struct vfsmou
{
LIST_HEAD(umount_list);
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);
release_mounts(&umount_list);
}
@@ -1391,7 +1427,7 @@ static int attach_recursive_mnt(struct v
if (err)
goto out_cleanup_ids;
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (IS_MNT_SHARED(dest_mnt)) {
for (p = source_mnt; p; p = next_mnt(p, source_mnt))
@@ -1410,7 +1446,8 @@ static int attach_recursive_mnt(struct v
list_del_init(&child->mnt_hash);
commit_tree(child);
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
+
return 0;
out_cleanup_ids:
@@ -1472,10 +1509,10 @@ static int do_change_type(struct path *p
goto out_unlock;
}
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
change_mnt_propagation(m, type);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
out_unlock:
up_write(&namespace_sem);
@@ -1519,9 +1556,10 @@ static int do_loopback(struct path *path
err = graft_tree(mnt, path);
if (err) {
LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
+
+ br_write_lock(vfsmount_lock);
umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
}
@@ -1574,18 +1612,18 @@ static int do_remount(struct path *path,
else
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK;
path->mnt->mnt_flags = mnt_flags;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
up_write(&sb->s_umount);
if (!err) {
security_sb_post_remount(path->mnt, flags, data);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
touch_mnt_namespace(path->mnt->mnt_ns);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
return err;
}
@@ -1762,7 +1800,7 @@ void mark_mounts_for_expiry(struct list_
return;
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
/* extract from the expiration list every vfsmount that matches the
* following criteria:
@@ -1781,7 +1819,7 @@ void mark_mounts_for_expiry(struct list_
touch_mnt_namespace(mnt->mnt_ns);
umount_tree(mnt, 1, &umounts);
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);
release_mounts(&umounts);
@@ -1838,6 +1876,8 @@ resume:
/*
* process a list of expirable mountpoints with the intent of discarding any
* submounts of a specific parent mountpoint
+ *
+ * vfsmount_lock must be held for write
*/
static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
{
@@ -2056,9 +2096,9 @@ static struct mnt_namespace *dup_mnt_ns(
kfree(new_ns);
return ERR_PTR(-ENOMEM);
}
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
/*
* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2255,7 +2295,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
goto out2; /* not attached */
/* make sure we can reach put_old from new_root */
tmp = old.mnt;
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (tmp != new.mnt) {
for (;;) {
if (tmp->mnt_parent == tmp)
@@ -2275,7 +2315,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
/* mount new_root on / */
attach_mnt(new.mnt, &root_parent);
touch_mnt_namespace(current->nsproxy->mnt_ns);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
chroot_fs_refs(&root, &new);
security_sb_post_pivotroot(&root, &new);
error = 0;
@@ -2291,7 +2331,7 @@ out1:
out0:
return error;
out3:
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
goto out2;
}
@@ -2338,6 +2378,8 @@ void __init mnt_init(void)
for (u = 0; u < HASH_SIZE; u++)
INIT_LIST_HEAD(&mount_hashtable[u]);
+ br_lock_init(vfsmount_lock);
+
err = sysfs_init();
if (err)
printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2356,9 +2398,9 @@ void put_mnt_ns(struct mnt_namespace *ns
if (!atomic_dec_and_test(&ns->count))
return;
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
umount_tree(ns->root, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);
release_mounts(&umount_list);
kfree(ns);
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -126,6 +126,9 @@ static int do_make_slave(struct vfsmount
return 0;
}
+/*
+ * vfsmount lock must be held for write
+ */
void change_mnt_propagation(struct vfsmount *mnt, int type)
{
if (type == MS_SHARED) {
@@ -270,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_
prev_src_mnt = child;
}
out:
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
while (!list_empty(&tmp_list)) {
child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
umount_tree(child, 0, &umount_list);
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
return ret;
}
@@ -296,6 +299,8 @@ static inline int do_refcount_check(stru
* other mounts its parent propagates to.
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
+ *
+ * vfsmount lock must be held for read or write
*/
int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
{
@@ -353,6 +358,8 @@ static void __propagate_umount(struct vf
* collect all mounts that receive propagation from the mount in @list,
* and return these additional mounts in the same list.
* @list: the list of mounts to be unmounted.
+ *
+ * vfsmount lock must be held for write
*/
int propagate_umount(struct list_head *list)
{
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h
+++ linux-2.6/fs/internal.h
@@ -9,6 +9,8 @@
* 2 of the License, or (at your option) any later version.
*/
+#include <linux/brlock.h>
+
struct super_block;
struct linux_binprm;
struct path;
@@ -70,7 +72,8 @@ extern struct vfsmount *copy_tree(struct
extern void __init mnt_init(void);
-extern spinlock_t vfsmount_lock;
+DECLARE_BRLOCK(vfsmount_lock);
+
/*
* fs_struct.c
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch 2/2] fs: scale vfsmount_lock
2010-03-17 14:20 ` Nick Piggin
@ 2010-03-17 20:33 ` Andreas Dilger
0 siblings, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2010-03-17 20:33 UTC (permalink / raw)
To: Nick Piggin; +Cc: Al Viro, Frank Mayhar, John Stultz, Andi Kleen, linux-fsdevel
On 2010-03-17, at 08:20, Nick Piggin wrote:
> Here is patch 2 with the new syntax.
> --
> fs: brlock vfsmount_lock
>
> @@ -1928,7 +1928,7 @@ char *__d_path(const struct path *path,
> - spin_lock(&vfsmount_lock);
> + br_read_lock(vfsmount_lock);
> +DECLARE_BRLOCK(vfsmount_lock);
This is definitely a lot nicer usage. Thanks.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-03-17 20:33 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-16 12:22 [patch 1/2] kernel: introduce brlock Nick Piggin
2010-03-16 12:23 ` [patch 2/2] fs: scale vfsmount_lock Nick Piggin
2010-03-16 12:28 ` Nick Piggin
2010-03-17 14:20 ` Nick Piggin
2010-03-17 20:33 ` Andreas Dilger
2010-03-16 19:01 ` [patch 1/2] kernel: introduce brlock Andreas Dilger
2010-03-16 20:12 ` Frank Mayhar
2010-03-16 23:44 ` Nick Piggin
2010-03-17 14:18 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).