From: Dave Hansen <haveblue@us.ibm.com>
To: akpm@osdl.org
Cc: linux-kernel@vger.kernel.org, miklos@szeredi.hu,
hch@infradead.org, Dave Hansen <haveblue@us.ibm.com>
Subject: [PATCH 24/27] r-o-bind-mounts-track-number-of-mount-writers
Date: Thu, 01 Nov 2007 16:08:58 -0700 [thread overview]
Message-ID: <20071101230858.6F76E20C@kernel> (raw)
In-Reply-To: <20071101230826.9A4F6E00@kernel>
This is the real meat of the entire series. It actually implements the
tracking of the number of writers to a mount. However, it causes scalability
problems because there can be hundreds of cpus doing open()/close() on files
on the same mnt at the same time. Even an atomic_t in the mnt has massive
scalaing problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All operations are local to
a cpu as long that cpu operates on the same mount, and there are no writer
count imbalances. Writer count imbalances happen when a write is taken on one
cpu, and released on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu variables is
collected (expensive, but very rare) and we determine if there are any
outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of seconds in
several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It does opens on
a _pair_ of files in two different mounts in parallel. This should cause my
code to lose its "operate on the same mount" optimization completely. This
worst-case scenario causes a 3% degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more complex than
what I have here, and I think this is getting into acceptable territory. In
practice, I expect writing more than 3 bytes to a file, as well as disk I/O to
mask any effects that this has.
(To get rid of that 3%, we could have an #defined number of mounts in the
percpu variable. So, instead of a CPU getting operate only on percpu data
when it accesses only one mount, it could stay on percpu data when it only
accesses N or fewer mounts.)
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/namespace.c | 205 ++++++++++++++++++++++++++++---
linux-2.6.git-dave/include/linux/mount.h | 8 +
2 files changed, 198 insertions(+), 15 deletions(-)
diff -puN fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-01 14:46:20.000000000 -0700
@@ -17,6 +17,7 @@
#include <linux/quotaops.h>
#include <linux/acct.h>
#include <linux/capability.h>
+#include <linux/cpumask.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/seq_file.h>
@@ -52,6 +53,8 @@ static inline unsigned long hash(struct
return tmp & hash_mask;
}
+#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -65,6 +68,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ atomic_set(&mnt->__mnt_writers, 0);
if (name) {
int size = strlen(name) + 1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -85,6 +89,84 @@ struct vfsmount *alloc_vfsmnt(const char
* we can determine when writes are able to occur to
* a filesystem.
*/
+/*
+ * __mnt_is_readonly: check whether a mount is read-only
+ * @mnt: the mount to check for its write status
+ *
+ * This shouldn't be used directly ouside of the VFS.
+ * It does not guarantee that the filesystem will stay
+ * r/w, just that it is right *now*. This can not and
+ * should not be used in place of IS_RDONLY(inode).
+ * mnt_want/drop_write() will _keep_ the filesystem
+ * r/w.
+ */
+int __mnt_is_readonly(struct vfsmount *mnt)
+{
+ return (mnt->mnt_sb->s_flags & MS_RDONLY);
+}
+EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+
+struct mnt_writer {
+ /*
+ * If holding multiple instances of this lock, they
+ * must be ordered by cpu number.
+ */
+ spinlock_t lock;
+ unsigned long count;
+ struct vfsmount *mnt;
+} ____cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
+
+static int __init init_mnt_writers(void)
+{
+ int cpu;
+ for_each_possible_cpu(cpu) {
+ struct mnt_writer *writer = &per_cpu(mnt_writers, cpu);
+ spin_lock_init(&writer->lock);
+ writer->count = 0;
+ }
+ return 0;
+}
+fs_initcall(init_mnt_writers);
+
+static void mnt_unlock_cpus(void)
+{
+ int cpu;
+ struct mnt_writer *cpu_writer;
+
+ for_each_possible_cpu(cpu) {
+ cpu_writer = &per_cpu(mnt_writers, cpu);
+ spin_unlock(&cpu_writer->lock);
+ }
+}
+
+static inline void __clear_mnt_count(struct mnt_writer *cpu_writer)
+{
+ if (!cpu_writer->mnt)
+ return;
+ atomic_add(cpu_writer->count, &cpu_writer->mnt->__mnt_writers);
+ cpu_writer->count = 0;
+}
+ /*
+ * must hold cpu_writer->lock
+ */
+static inline void use_cpu_writer_for_mount(struct mnt_writer *cpu_writer,
+ struct vfsmount *mnt)
+{
+ if (cpu_writer->mnt == mnt)
+ return;
+ __clear_mnt_count(cpu_writer);
+ cpu_writer->mnt = mnt;
+}
+
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
/**
* mnt_want_write - get write access to a mount
* @mnt: the mount on which to take a write
@@ -97,12 +179,58 @@ struct vfsmount *alloc_vfsmnt(const char
*/
int mnt_want_write(struct vfsmount *mnt)
{
- if (__mnt_is_readonly(mnt))
- return -EROFS;
- return 0;
+ int ret = 0;
+ struct mnt_writer *cpu_writer;
+
+ cpu_writer = &get_cpu_var(mnt_writers);
+ spin_lock(&cpu_writer->lock);
+ if (__mnt_is_readonly(mnt)) {
+ ret = -EROFS;
+ goto out;
+ }
+ use_cpu_writer_for_mount(cpu_writer, mnt);
+ cpu_writer->count++;
+out:
+ spin_unlock(&cpu_writer->lock);
+ put_cpu_var(mnt_writers);
+ return ret;
}
EXPORT_SYMBOL_GPL(mnt_want_write);
+static void lock_and_coalesce_cpu_mnt_writer_counts(void)
+{
+ int cpu;
+ struct mnt_writer *cpu_writer;
+
+ for_each_possible_cpu(cpu) {
+ cpu_writer = &per_cpu(mnt_writers, cpu);
+ spin_lock(&cpu_writer->lock);
+ __clear_mnt_count(cpu_writer);
+ cpu_writer->mnt = NULL;
+ }
+}
+
+/*
+ * These per-cpu write counts are not guaranteed to have
+ * matched increments and decrements on any given cpu.
+ * A file open()ed for write on one cpu and close()d on
+ * another cpu will imbalance this count. Make sure it
+ * does not get too far out of whack.
+ */
+static void handle_write_count_underflow(struct vfsmount *mnt)
+{
+ while (atomic_read(&mnt->__mnt_writers) <
+ MNT_WRITER_UNDERFLOW_LIMIT) {
+ /*
+ * It isn't necessary to hold all of the locks
+ * at the same time, but doing it this way makes
+ * us share a lot more code.
+ */
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ mnt_unlock_cpus();
+ }
+}
+
/**
* mnt_drop_write - give up write access to a mount
* @mnt: the mount on which to give up write access
@@ -113,23 +241,61 @@ EXPORT_SYMBOL_GPL(mnt_want_write);
*/
void mnt_drop_write(struct vfsmount *mnt)
{
+ int must_check_underflow = 0;
+ struct mnt_writer *cpu_writer;
+
+ cpu_writer = &get_cpu_var(mnt_writers);
+ spin_lock(&cpu_writer->lock);
+
+ use_cpu_writer_for_mount(cpu_writer, mnt);
+ if (cpu_writer->count > 0) {
+ cpu_writer->count--;
+ } else {
+ must_check_underflow = 1;
+ atomic_dec(&mnt->__mnt_writers);
+ }
+
+ spin_unlock(&cpu_writer->lock);
+ /*
+ * Logically, we could call this each time,
+ * but the __mnt_writers cacheline tends to
+ * be cold, and makes this expensive.
+ */
+ if (must_check_underflow)
+ handle_write_count_underflow(mnt);
+ /*
+ * This could be done right after the spinlock
+ * is taken because the spinlock keeps us on
+ * the cpu, and disables preemption. However,
+ * putting it here bounds the amount that
+ * __mnt_writers can underflow. Without it,
+ * we could theoretically wrap __mnt_writers.
+ */
+ put_cpu_var(mnt_writers);
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
-/*
- * __mnt_is_readonly: check whether a mount is read-only
- * @mnt: the mount to check for its write status
- *
- * This shouldn't be used directly ouside of the VFS.
- * It does not guarantee that the filesystem will stay
- * r/w, just that it is right *now*. This can not and
- * should not be used in place of IS_RDONLY(inode).
- */
-int __mnt_is_readonly(struct vfsmount *mnt)
+int mnt_make_readonly(struct vfsmount *mnt)
{
- return (mnt->mnt_sb->s_flags & MS_RDONLY);
+ int ret = 0;
+
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ /*
+ * With all the locks held, this value is stable
+ */
+ if (atomic_read(&mnt->__mnt_writers) > 0) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /*
+ * actually set mount's r/o flag here to make
+ * __mnt_is_readonly() true, which keeps anyone
+ * from doing a successful mnt_want_write().
+ */
+out:
+ mnt_unlock_cpus();
+ return ret;
}
-EXPORT_SYMBOL_GPL(__mnt_is_readonly);
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
@@ -325,6 +491,15 @@ static struct vfsmount *clone_mnt(struct
static inline void __mntput(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ mnt_unlock_cpus();
+ /*
+ * This probably indicates that somebody messed
+ * up a mnt_want/drop_write() pair. If this
+ * happens, the filesystem was probably unable
+ * to make r/w->r/o transitions.
+ */
+ WARN_ON(atomic_read(&mnt->__mnt_writers));
dput(mnt->mnt_root);
free_vfsmnt(mnt);
deactivate_super(sb);
diff -puN include/linux/mount.h~r-o-bind-mounts-track-number-of-mount-writers include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-track-number-of-mount-writers 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/include/linux/mount.h 2007-11-01 14:46:20.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/list.h>
+#include <linux/nodemask.h>
#include <linux/spinlock.h>
#include <asm/atomic.h>
@@ -61,6 +62,13 @@ struct vfsmount {
atomic_t mnt_count;
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
+ /*
+ * This value is not stable unless all of the
+ * mnt_writers[] spinlocks are held, and all
+ * mnt_writer[]s on this mount have 0 as
+ * their ->count
+ */
+ atomic_t __mnt_writers;
};
static inline struct vfsmount *mntget(struct vfsmount *mnt)
_
next prev parent reply other threads:[~2007-11-01 23:17 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-11-01 23:08 [PATCH 00/27] Read-only bind mounts (-mm resend) Dave Hansen
2007-11-01 23:08 ` [PATCH 01/27] do namei_flags calculation inside open_namei() Dave Hansen
2007-11-01 23:08 ` [PATCH 02/27] make open_namei() return a filp Dave Hansen
2007-11-01 23:08 ` [PATCH 03/27] kill do_filp_open() Dave Hansen
2007-11-01 23:08 ` [PATCH 04/27] kill filp_open() Dave Hansen
2008-01-16 8:52 ` Andrew Morton
2008-01-16 17:04 ` Dave Hansen
2008-01-16 17:10 ` Christoph Hellwig
2008-01-16 17:41 ` Dave Hansen
2008-01-16 17:47 ` Christoph Hellwig
2008-01-16 17:12 ` Bryn M. Reeves
2007-11-01 23:08 ` [PATCH 05/27] rename open_namei() to open_pathname() Dave Hansen
2007-11-26 14:33 ` Christoph Hellwig
2007-11-01 23:08 ` [PATCH 06/27] r-o-bind-mounts-stub-functions Dave Hansen
2007-11-01 23:08 ` [PATCH 07/27] r-o-bind-mounts-do_rmdir-elevate-write-count Dave Hansen
2007-11-01 23:08 ` [PATCH 08/27] r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir Dave Hansen
2007-11-01 23:08 ` [PATCH 09/27] r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers Dave Hansen
2007-11-01 23:08 ` [PATCH 10/27] r-o-bind-mounts-elevate-mount-count-for-extended-attributes Dave Hansen
2007-11-01 23:08 ` [PATCH 11/27] r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl Dave Hansen
2007-11-01 23:08 ` [PATCH 12/27] r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime Dave Hansen
2007-11-01 23:08 ` [PATCH 13/27] r-o-bind-mounts-elevate-write-count-for-do_utimes Dave Hansen
2007-11-01 23:08 ` [PATCH 14/27] r-o-bind-mounts-elevate-write-count-for-file_update_time Dave Hansen
2007-11-01 23:08 ` [PATCH 15/27] r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls Dave Hansen
2007-11-01 23:08 ` [PATCH 16/27] r-o-bind-mounts-elevate-write-count-for-some-ioctls Dave Hansen
2007-11-05 23:23 ` Andrew Morton
2007-11-06 9:01 ` Jan Kara
2007-11-06 9:12 ` Andrew Morton
2007-11-01 23:08 ` [PATCH 17/27] r-o-bind-mounts-elevate-write-count-opend-files Dave Hansen
2007-11-01 23:08 ` [PATCH 18/27] r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename Dave Hansen
2007-11-01 23:08 ` [PATCH 19/27] r-o-bind-mounts-elevate-writer-count-for-chown-and-friends Dave Hansen
2007-11-01 23:08 ` [PATCH 20/27] r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate Dave Hansen
2007-11-01 23:08 ` [PATCH 21/27] r-o-bind-mounts-make-access-use-mnt-check Dave Hansen
2007-11-01 23:08 ` [PATCH 22/27] r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly Dave Hansen
2007-11-01 23:08 ` [PATCH 23/27] r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create Dave Hansen
2007-11-01 23:08 ` Dave Hansen [this message]
2007-11-01 23:09 ` [PATCH 25/27] r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts Dave Hansen
2007-11-05 23:35 ` Andrew Morton
2007-11-01 23:09 ` [PATCH 26/27] r-o-bind-mounts-honor-r-w-changes-at-do_remount-time Dave Hansen
2007-11-01 23:09 ` [PATCH 27/27] keep track of mnt_writer state of struct file Dave Hansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071101230858.6F76E20C@kernel \
--to=haveblue@us.ibm.com \
--cc=akpm@osdl.org \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.