[RFC][PATCH 0/13] Per-container dcache management (and a bit more)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
@ 2011-05-03 12:14 Pavel Emelyanov
  2011-05-03 12:15 ` [PATCH 1/13] vfs: Lighten r/o rename_lock lockers Pavel Emelyanov
                   ` (14 more replies)
  0 siblings, 15 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:14 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

Hi.

According to the "release early, release often" strategy :) I'm
glad to propose this scratch implementation of what I was talking
about at the LSF - the way to limit the dcache grow for both
containerized and not systems (the set applies to 2.6.38).

The first 5 patches are preparations for this, descriptive (I hope)
comments are inside them.

The general idea of this set is -- make the dentries subtrees be
limited in size and shrink them as they hit the configured limit.

Why subtrees? Because this lets having the [dentry -> group] reference
without the reference count, letting the [dentry -> parent] one handle
this.

Why limited? For containers the answer is simple -- a container
should not be allowed to consume too much of the host memory. For
non-containerized systems the answer is -- to protect the kernel
from the non-privileged attacks on the dcache memory like the 
"while :; do mkdir x; cd x; done" one and similar.

What isn't in this patch yet, but should be done after the discussion

* API. I haven't managed to invent any perfect solution, and would
really like to have it discussed. In order to be able to play with it 
the ioctls + proc for listing are proposed.

* New mounts management. Right now if you mount some new FS to a
dentry which belongs to some managed set (I named it "mob" in this
patchset), the new mount is managed with the system settings. This is
not OK, the new mount should be managed with the settings of the
mountpoint's mob.

* Elegant shrink_dcache_memory on global memory shortage. By now the
code walks the mobs and shinks some equal amount of dentries from them.
Better shrinking policy can and probably should be implemented.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/13] vfs: Lighten r/o rename_lock lockers
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
@ 2011-05-03 12:15 ` Pavel Emelyanov
  2011-05-03 12:15 ` [PATCH 2/13] vfs: Factor out rename_lock locking Pavel Emelyanov
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:15 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

There are some places, that need to _lock_ the rename_lock, but they
do it for reading the dentry tree. When doing the write_seqlock they
force the existing "willing to restart" readers do the restart, which
is not actually required.

Introduce two more seqlock helpers, that lock the lock, but do not
update the seq count.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c             |   40 ++++++++++++++++++++--------------------
 include/linux/seqlock.h |    9 +++++++++
 2 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 611ffe9..c31c149 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1091,18 +1091,18 @@ resume:
 	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		read_sequnlock(&rename_lock);
 	return 0; /* No mount points found in tree */
 positive:
 	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		read_sequnlock(&rename_lock);
 	return 1;
 
 rename_retry:
 	locked = 1;
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	goto again;
 }
 EXPORT_SYMBOL(have_submounts);
@@ -1207,14 +1207,14 @@ out:
 	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		read_sequnlock(&rename_lock);
 	return found;
 
 rename_retry:
 	if (found)
 		return found;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	goto again;
 }
 
@@ -2572,9 +2572,9 @@ char *__d_path(const struct path *path, struct path *root,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_sequnlock(&rename_lock);
 
 	if (error)
 		return ERR_PTR(error);
@@ -2636,12 +2636,12 @@ char *d_path(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (error)
 		res = ERR_PTR(error);
-	write_sequnlock(&rename_lock);
+	read_sequnlock(&rename_lock);
 	path_put(&root);
 	return res;
 }
@@ -2667,12 +2667,12 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (!error && !path_equal(&tmp, &root))
 		error = prepend_unreachable(&res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_sequnlock(&rename_lock);
 	path_put(&root);
 	if (error)
 		res =  ERR_PTR(error);
@@ -2739,9 +2739,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_sequnlock(&rename_lock);
 
 	return retval;
 }
@@ -2752,7 +2752,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2760,7 +2760,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_sequnlock(&rename_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
@@ -2798,7 +2798,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	get_fs_root_and_pwd(current->fs, &root, &pwd);
 
 	error = -ENOENT;
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		struct path tmp = root;
@@ -2807,7 +2807,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &tmp, &cwd, &buflen);
-		write_sequnlock(&rename_lock);
+		read_sequnlock(&rename_lock);
 
 		if (error)
 			goto out;
@@ -2827,7 +2827,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 				error = -EFAULT;
 		}
 	} else {
-		write_sequnlock(&rename_lock);
+		read_sequnlock(&rename_lock);
 	}
 
 out:
@@ -2971,12 +2971,12 @@ resume:
 	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		read_sequnlock(&rename_lock);
 	return;
 
 rename_retry:
 	locked = 1;
-	write_seqlock(&rename_lock);
+	read_seqlock(&rename_lock);
 	goto again;
 }
 
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index e98cd2e..808e124 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -110,6 +110,15 @@ static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
 	return unlikely(sl->sequence != start);
 }
 
+static __always_inline void read_seqlock(seqlock_t *sl)
+{
+	spin_lock(&sl->lock);
+}
+
+static __always_inline void read_sequnlock(seqlock_t *sl)
+{
+	spin_unlock(&sl->lock);
+}
 
 /*
  * Version using sequence counter only.
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/13] vfs: Factor out rename_lock locking
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
  2011-05-03 12:15 ` [PATCH 1/13] vfs: Lighten r/o rename_lock lockers Pavel Emelyanov
@ 2011-05-03 12:15 ` Pavel Emelyanov
  2011-05-03 12:16 ` [PATCH 3/13] vfs: Make the rename_lock per-sb Pavel Emelyanov
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:15 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

The set of routnes that try to get the dentry path with various
flavours end up in two "core" routines -- the prepend_path and
the __dentry_path.

Since the rename_lock only protects the path integrity, it's OK
to move the rename_lock locking inside these two.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c |   27 ++++++---------------------
 1 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c31c149..485fc4a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2499,6 +2499,7 @@ static int prepend_path(const struct path *path, struct path *root,
 	bool slash = false;
 	int error = 0;
 
+	read_seqlock(&rename_lock);
 	br_read_lock(vfsmount_lock);
 	while (dentry != root->dentry || vfsmnt != root->mnt) {
 		struct dentry * parent;
@@ -2531,6 +2532,7 @@ out:
 		error = prepend(buffer, buflen, "/", 1);
 
 	br_read_unlock(vfsmount_lock);
+	read_sequnlock(&rename_lock);
 	return error;
 
 global_root:
@@ -2572,10 +2574,7 @@ char *__d_path(const struct path *path, struct path *root,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	read_seqlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
-	read_sequnlock(&rename_lock);
-
 	if (error)
 		return ERR_PTR(error);
 	return res;
@@ -2636,12 +2635,10 @@ char *d_path(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	read_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (error)
 		res = ERR_PTR(error);
-	read_sequnlock(&rename_lock);
 	path_put(&root);
 	return res;
 }
@@ -2667,12 +2664,10 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	read_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (!error && !path_equal(&tmp, &root))
 		error = prepend_unreachable(&res, &buflen);
-	read_sequnlock(&rename_lock);
 	path_put(&root);
 	if (error)
 		res =  ERR_PTR(error);
@@ -2716,6 +2711,7 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 	retval = end-1;
 	*retval = '/';
 
+	read_seqlock(&rename_lock);
 	while (!IS_ROOT(dentry)) {
 		struct dentry *parent = dentry->d_parent;
 		int error;
@@ -2730,20 +2726,16 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 		retval = end;
 		dentry = parent;
 	}
+	read_sequnlock(&rename_lock);
 	return retval;
 Elong:
+	read_sequnlock(&rename_lock);
 	return ERR_PTR(-ENAMETOOLONG);
 }
 
 char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
-	char *retval;
-
-	read_seqlock(&rename_lock);
-	retval = __dentry_path(dentry, buf, buflen);
-	read_sequnlock(&rename_lock);
-
-	return retval;
+	return __dentry_path(dentry, buf, buflen);
 }
 EXPORT_SYMBOL(dentry_path_raw);
 
@@ -2752,7 +2744,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	read_seqlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2760,7 +2751,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
-	read_sequnlock(&rename_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
@@ -2798,7 +2788,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	get_fs_root_and_pwd(current->fs, &root, &pwd);
 
 	error = -ENOENT;
-	read_seqlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		struct path tmp = root;
@@ -2807,7 +2796,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &tmp, &cwd, &buflen);
-		read_sequnlock(&rename_lock);
 
 		if (error)
 			goto out;
@@ -2826,10 +2814,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 			if (copy_to_user(buf, cwd, len))
 				error = -EFAULT;
 		}
-	} else {
-		read_sequnlock(&rename_lock);
 	}
-
 out:
 	path_put(&pwd);
 	path_put(&root);
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/13] vfs: Make the rename_lock per-sb
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
  2011-05-03 12:15 ` [PATCH 1/13] vfs: Lighten r/o rename_lock lockers Pavel Emelyanov
  2011-05-03 12:15 ` [PATCH 2/13] vfs: Factor out rename_lock locking Pavel Emelyanov
@ 2011-05-03 12:16 ` Pavel Emelyanov
  2011-05-03 12:16 ` [PATCH 4/13] vfs: Factor out tree (of four) shrinkers code Pavel Emelyanov
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:16 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

There are two things I try to get with this patch.

The first one -- renames on different sb-s will not depend on each other.
The second one -- I plan to factor out the dcache shrinking code (see
further patches). In order to make the srink_dcache_for_umount work OK
on a dying SB it's better to have this lock non-global.

There's some trickery in the prepend_path -- this one has to relock
the rename lock as it ascends the vfsmount tree. This relock seems OK,
as the dentry path of each mount renames solid, and even if we "race"
with the rename on some of the super blocks, the overall path we get
will be either the one we had before rename or them one we had after
one, i.e. -- some existing one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 drivers/staging/pohmelfs/path_entry.c |    4 +-
 fs/autofs4/waitq.c                    |    6 ++--
 fs/dcache.c                           |   64 ++++++++++++++++----------------
 fs/nfs/namespace.c                    |    6 ++--
 fs/super.c                            |    1 +
 include/linux/dcache.h                |    2 -
 include/linux/fs.h                    |    1 +
 kernel/auditsc.c                      |    4 +-
 8 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/drivers/staging/pohmelfs/path_entry.c b/drivers/staging/pohmelfs/path_entry.c
index 400a9fc..d3de52e 100644
--- a/drivers/staging/pohmelfs/path_entry.c
+++ b/drivers/staging/pohmelfs/path_entry.c
@@ -99,7 +99,7 @@ int pohmelfs_path_length(struct pohmelfs_inode *pi)
 rename_retry:
 	len = 1; /* Root slash */
 	d = first;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&root->d_sb->s_rename_lock);
 	rcu_read_lock();
 
 	if (!IS_ROOT(d) && d_unhashed(d))
@@ -110,7 +110,7 @@ rename_retry:
 		d = d->d_parent;
 	}
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqretry(&root->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 
 	dput(root);
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 5601005..920db0b 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -195,7 +195,7 @@ rename_retry:
 	buf = *name;
 	len = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&sbi->sb->s_rename_lock);
 	rcu_read_lock();
 	spin_lock(&autofs4_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -204,7 +204,7 @@ rename_retry:
 	if (!len || --len > NAME_MAX) {
 		spin_unlock(&autofs4_lock);
 		rcu_read_unlock();
-		if (read_seqretry(&rename_lock, seq))
+		if (read_seqretry(&sbi->sb->s_rename_lock, seq))
 			goto rename_retry;
 		return 0;
 	}
@@ -220,7 +220,7 @@ rename_retry:
 	}
 	spin_unlock(&autofs4_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqretry(&sbi->sb->s_rename_lock, seq))
 		goto rename_retry;
 
 	return len;
diff --git a/fs/dcache.c b/fs/dcache.c
index 485fc4a..976a917 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -79,10 +79,6 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
-
-EXPORT_SYMBOL(rename_lock);
-
 static struct kmem_cache *dentry_cache __read_mostly;
 
 /*
@@ -1031,7 +1027,7 @@ int have_submounts(struct dentry *parent)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&parent->d_sb->s_rename_lock);
 again:
 	this_parent = parent;
 
@@ -1078,7 +1074,7 @@ resume:
 		/* might go back up the wrong parent if we have had a rename
 		 * or deletion */
 		if (this_parent != child->d_parent ||
-			 (!locked && read_seqretry(&rename_lock, seq))) {
+			 (!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
 			rcu_read_unlock();
 			goto rename_retry;
@@ -1088,21 +1084,21 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		read_sequnlock(&rename_lock);
+		read_sequnlock(&parent->d_sb->s_rename_lock);
 	return 0; /* No mount points found in tree */
 positive:
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		read_sequnlock(&rename_lock);
+		read_sequnlock(&parent->d_sb->s_rename_lock);
 	return 1;
 
 rename_retry:
 	locked = 1;
-	read_seqlock(&rename_lock);
+	read_seqlock(&parent->d_sb->s_rename_lock);
 	goto again;
 }
 EXPORT_SYMBOL(have_submounts);
@@ -1129,7 +1125,7 @@ static int select_parent(struct dentry * parent)
 	int found = 0;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&parent->d_sb->s_rename_lock);
 again:
 	this_parent = parent;
 	spin_lock(&this_parent->d_lock);
@@ -1193,7 +1189,7 @@ resume:
 		/* might go back up the wrong parent if we have had a rename
 		 * or deletion */
 		if (this_parent != child->d_parent ||
-			(!locked && read_seqretry(&rename_lock, seq))) {
+			(!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
 			rcu_read_unlock();
 			goto rename_retry;
@@ -1204,17 +1200,17 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		read_sequnlock(&rename_lock);
+		read_sequnlock(&parent->d_sb->s_rename_lock);
 	return found;
 
 rename_retry:
 	if (found)
 		return found;
 	locked = 1;
-	read_seqlock(&rename_lock);
+	read_seqlock(&parent->d_sb->s_rename_lock);
 	goto again;
 }
 
@@ -1871,11 +1867,11 @@ struct dentry *d_lookup(struct dentry *parent, struct qstr *name)
 	unsigned seq;
 
         do {
-                seq = read_seqbegin(&rename_lock);
+                seq = read_seqbegin(&parent->d_sb->s_rename_lock);
                 dentry = __d_lookup(parent, name);
                 if (dentry)
 			break;
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqretry(&parent->d_sb->s_rename_lock, seq));
 	return dentry;
 }
 EXPORT_SYMBOL(d_lookup);
@@ -2237,8 +2233,9 @@ void d_move(struct dentry * dentry, struct dentry * target)
 
 	BUG_ON(d_ancestor(dentry, target));
 	BUG_ON(d_ancestor(target, dentry));
+	BUG_ON(dentry->d_sb != target->d_sb);
 
-	write_seqlock(&rename_lock);
+	write_seqlock(&dentry->d_sb->s_rename_lock);
 
 	dentry_lock_for_move(dentry, target);
 
@@ -2285,7 +2282,7 @@ void d_move(struct dentry * dentry, struct dentry * target)
 	spin_unlock(&target->d_lock);
 	fsnotify_d_move(dentry);
 	spin_unlock(&dentry->d_lock);
-	write_sequnlock(&rename_lock);
+	write_sequnlock(&dentry->d_sb->s_rename_lock);
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2499,8 +2496,8 @@ static int prepend_path(const struct path *path, struct path *root,
 	bool slash = false;
 	int error = 0;
 
-	read_seqlock(&rename_lock);
 	br_read_lock(vfsmount_lock);
+	read_seqlock(&vfsmnt->mnt_sb->s_rename_lock);
 	while (dentry != root->dentry || vfsmnt != root->mnt) {
 		struct dentry * parent;
 
@@ -2510,7 +2507,9 @@ static int prepend_path(const struct path *path, struct path *root,
 				goto global_root;
 			}
 			dentry = vfsmnt->mnt_mountpoint;
+			read_sequnlock(&vfsmnt->mnt_sb->s_rename_lock);
 			vfsmnt = vfsmnt->mnt_parent;
+			read_seqlock(&vfsmnt->mnt_sb->s_rename_lock);
 			continue;
 		}
 		parent = dentry->d_parent;
@@ -2531,8 +2530,8 @@ out:
 	if (!error && !slash)
 		error = prepend(buffer, buflen, "/", 1);
 
+	read_sequnlock(&vfsmnt->mnt_sb->s_rename_lock);
 	br_read_unlock(vfsmount_lock);
-	read_sequnlock(&rename_lock);
 	return error;
 
 global_root:
@@ -2703,6 +2702,7 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 {
 	char *end = buf + buflen;
 	char *retval;
+	struct super_block *sb = dentry->d_sb;
 
 	prepend(&end, &buflen, "\0", 1);
 	if (buflen < 1)
@@ -2711,7 +2711,7 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 	retval = end-1;
 	*retval = '/';
 
-	read_seqlock(&rename_lock);
+	read_seqlock(&sb->s_rename_lock);
 	while (!IS_ROOT(dentry)) {
 		struct dentry *parent = dentry->d_parent;
 		int error;
@@ -2726,10 +2726,10 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 		retval = end;
 		dentry = parent;
 	}
-	read_sequnlock(&rename_lock);
+	read_sequnlock(&sb->s_rename_lock);
 	return retval;
 Elong:
-	read_sequnlock(&rename_lock);
+	read_sequnlock(&sb->s_rename_lock);
 	return ERR_PTR(-ENAMETOOLONG);
 }
 
@@ -2848,7 +2848,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 
 	do {
 		/* for restarting inner loop in case of seq retry */
-		seq = read_seqbegin(&rename_lock);
+		seq = read_seqbegin(&old_dentry->d_sb->s_rename_lock);
 		/*
 		 * Need rcu_readlock to protect against the d_parent trashing
 		 * due to d_move
@@ -2859,7 +2859,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 		else
 			result = 0;
 		rcu_read_unlock();
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqretry(&old_dentry->d_sb->s_rename_lock, seq));
 
 	return result;
 }
@@ -2896,7 +2896,7 @@ void d_genocide(struct dentry *root)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&root->d_sb->s_rename_lock);
 again:
 	this_parent = root;
 	spin_lock(&this_parent->d_lock);
@@ -2943,7 +2943,7 @@ resume:
 		/* might go back up the wrong parent if we have had a rename
 		 * or deletion */
 		if (this_parent != child->d_parent ||
-			 (!locked && read_seqretry(&rename_lock, seq))) {
+			 (!locked && read_seqretry(&root->d_sb->s_rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
 			rcu_read_unlock();
 			goto rename_retry;
@@ -2953,15 +2953,15 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&root->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		read_sequnlock(&rename_lock);
+		read_sequnlock(&root->d_sb->s_rename_lock);
 	return;
 
 rename_retry:
 	locked = 1;
-	read_seqlock(&rename_lock);
+	read_seqlock(&root->d_sb->s_rename_lock);
 	goto again;
 }
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index f32b860..bca6732 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -58,7 +58,7 @@ rename_retry:
 	*--end = '\0';
 	buflen--;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&droot->d_sb->s_rename_lock);
 	rcu_read_lock();
 	while (!IS_ROOT(dentry) && dentry != droot) {
 		namelen = dentry->d_name.len;
@@ -71,7 +71,7 @@ rename_retry:
 		dentry = dentry->d_parent;
 	}
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqretry(&droot->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 	if (*end != '/') {
 		if (--buflen < 0)
@@ -90,7 +90,7 @@ rename_retry:
 	return end;
 Elong_unlock:
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqretry(&droot->d_sb->s_rename_lock, seq))
 		goto rename_retry;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
diff --git a/fs/super.c b/fs/super.c
index 7e9dd4c..0293e6e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -103,6 +103,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_count = 1;
 		atomic_set(&s->s_active, 1);
 		mutex_init(&s->s_vfs_rename_mutex);
+		seqlock_init(&s->s_rename_lock);
 		lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
 		mutex_init(&s->s_dquot.dqio_mutex);
 		mutex_init(&s->s_dquot.dqonoff_mutex);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f958c19..8834f58 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -219,8 +219,6 @@ struct dentry_operations {
 #define DCACHE_MANAGED_DENTRY \
 	(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
-extern seqlock_t rename_lock;
-
 static inline int dname_external(struct dentry *dentry)
 {
 	return dentry->d_name.name != dentry->d_iname;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e38b50a..63b4b7a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1414,6 +1414,7 @@ struct super_block {
 	 * even looking at it. You had been warned.
 	 */
 	struct mutex s_vfs_rename_mutex;	/* Kludge */
+	seqlock_t s_rename_lock;
 
 	/*
 	 * Filesystem subtype.  If non-empty the filesystem type field
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index f49a031..a5e26bf 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1772,7 +1772,7 @@ retry:
 	drop = NULL;
 	d = dentry;
 	rcu_read_lock();
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqbegin(&dentry->d_sb->s_rename_lock);
 	for(;;) {
 		struct inode *inode = d->d_inode;
 		if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1790,7 +1790,7 @@ retry:
 			break;
 		d = parent;
 	}
-	if (unlikely(read_seqretry(&rename_lock, seq) || drop)) {  /* in this order */
+	if (unlikely(read_seqretry(&dentry->d_sb->s_rename_lock, seq) || drop)) {  /* in this order */
 		rcu_read_unlock();
 		if (!drop) {
 			/* just a race with rename */
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/13] vfs: Factor out tree (of four) shrinkers code
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2011-05-03 12:16 ` [PATCH 3/13] vfs: Make the rename_lock per-sb Pavel Emelyanov
@ 2011-05-03 12:16 ` Pavel Emelyanov
  2011-05-03 12:17 ` [PATCH 5/13] vfs: Make dentry LRU list global Pavel Emelyanov
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:16 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

There are 4 dcache shrinkers in dcache.c now.

* prune_dcache one, which shrinks the dentries from the LRU list;
* shrink_dcache_parent, which tries to remove unused dentries from a
subtree starting from a given point;
* shrink_dcache_sb, which wants to acheive the same result as the
previous one, but starting from the given sb root;
* shink_dcache_for_umount, which also kills the dcache for a super
block, but it believes that all the subtree is unreachable and unused
and thus uses lighter locking and more strict checks.

That said, 3 last shrinkers do the same and deserve being factored out.

The srhink_dcache_parent uses the same tree walk as I propose to
shrink the dentry subtree from leaves to the roots, so things do not
get worse after this change.

The shrink_dcache_for_umount_subtree does the same walk as I propose,
but it will take the rename_lock. This is OK, since the rename_lock
is now per-sb and is taken on a dead super block and thus isn't
contented.

The shink_dcache_sb is O(n) before and after the change and seems OK
as well.

The walker code I use was copy-n-paste-d from have_submounts routine.
Yes, it's worth merging them (and the dcache_genocide ;) ).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c |  365 +++++++++++++++++++++--------------------------------------
 1 files changed, 130 insertions(+), 235 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 976a917..67d6590 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -262,19 +262,6 @@ static void dentry_lru_del(struct dentry *dentry)
 	}
 }
 
-static void dentry_lru_move_tail(struct dentry *dentry)
-{
-	spin_lock(&dcache_lru_lock);
-	if (list_empty(&dentry->d_lru)) {
-		list_add_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
-	} else {
-		list_move_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-	}
-	spin_unlock(&dcache_lru_lock);
-}
-
 /**
  * d_kill - kill dentry and return parent
  * @dentry: dentry to kill
@@ -649,6 +636,132 @@ restart:
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
+static void isolate_shrinkable(struct dentry *dentry, struct list_head *list)
+{
+	if (!dentry->d_count) {
+		if (!IS_ROOT(dentry)) {
+			__d_drop(dentry);
+			list_del_init(&dentry->d_u.d_child);
+			dentry->d_parent->d_count--;
+			dentry->d_parent = dentry;
+		}
+
+		spin_lock(&dcache_lru_lock);
+		list_move_tail(&dentry->d_lru, list);
+		spin_unlock(&dcache_lru_lock);
+	} else
+		dentry_lru_del(dentry);
+}
+
+static void collect_shrinkable(struct dentry *root, struct list_head *list)
+{
+	struct dentry *this_parent;
+	struct list_head *next;
+	unsigned seq;
+	int locked = 0;
+
+	seq = read_seqbegin(&root->d_sb->s_rename_lock);
+again:
+	this_parent = root;
+	spin_lock(&this_parent->d_lock);
+repeat:
+	next = this_parent->d_subdirs.next;
+resume:
+	while (next != &this_parent->d_subdirs) {
+		struct list_head *tmp = next;
+		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
+		next = tmp->next;
+
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		/*
+		 * Descend a level if the d_subdirs list is non-empty.
+		 */
+		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
+			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
+			goto repeat;
+		}
+
+		isolate_shrinkable(dentry, list);
+
+		spin_unlock(&dentry->d_lock);
+	}
+	/*
+	 * All done at this level ... ascend and resume the search.
+	 */
+	if (this_parent != root) {
+		struct dentry *tmp;
+		struct dentry *child;
+
+		tmp = this_parent->d_parent;
+		rcu_read_lock();
+		spin_unlock(&this_parent->d_lock);
+		child = this_parent;
+		this_parent = tmp;
+		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+			(!locked && read_seqretry(&root->d_sb->s_rename_lock, seq))) {
+			spin_unlock(&this_parent->d_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
+		next = child->d_u.d_child.next;
+
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		isolate_shrinkable(child, list);
+		spin_unlock(&child->d_lock);
+
+		goto resume;
+	}
+
+	spin_unlock(&this_parent->d_lock);
+	if (!locked && read_seqretry(&root->d_sb->s_rename_lock, seq))
+		goto rename_retry;
+	if (locked)
+		read_sequnlock(&root->d_sb->s_rename_lock);
+	return;
+
+rename_retry:
+	locked = 1;
+	read_seqlock(&root->d_sb->s_rename_lock);
+	goto again;
+}
+
+static void shrink_dcache_subtree(struct dentry *root, int with_root)
+{
+	LIST_HEAD(tmp);
+	struct dentry *de;
+
+	collect_shrinkable(root, &tmp);
+
+	if (with_root) {
+		if (root->d_count) {
+			printk("Root count is %d, kids:\n", root->d_count);
+			list_for_each_entry(de, &root->d_subdirs, d_u.d_child)
+				printk("  %s [%p, %d, %p, %s]\n",
+						de->d_name.name, de, de->d_count,
+						de->d_parent, de->d_parent->d_name.name);
+			BUG();
+		}
+
+		isolate_shrinkable(root, &tmp);
+	}
+
+	while (!list_empty(&tmp)) {
+		de = list_first_entry(&tmp, struct dentry, d_lru);
+		do {
+			spin_lock(&de->d_lock);
+			BUG_ON(de->d_count);
+			de = dentry_kill(de, 0);
+		} while (de);
+	}
+}
+
 /*
  * Try to throw away a dentry - free the inode, dput the parent.
  * Requires dentry->d_lock is held, and dentry->d_count == 0.
@@ -865,16 +978,7 @@ static void prune_dcache(int count)
  */
 void shrink_dcache_sb(struct super_block *sb)
 {
-	LIST_HEAD(tmp);
-
-	spin_lock(&dcache_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		list_splice_init(&sb->s_dentry_lru, &tmp);
-		spin_unlock(&dcache_lru_lock);
-		shrink_dentry_list(&tmp);
-		spin_lock(&dcache_lru_lock);
-	}
-	spin_unlock(&dcache_lru_lock);
+	shrink_dcache_subtree(sb->s_root, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -883,100 +987,6 @@ EXPORT_SYMBOL(shrink_dcache_sb);
  * - see the comments on shrink_dcache_for_umount() for a description of the
  *   locking
  */
-static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
-{
-	struct dentry *parent;
-	unsigned detached = 0;
-
-	BUG_ON(!IS_ROOT(dentry));
-
-	/* detach this root from the system */
-	spin_lock(&dentry->d_lock);
-	dentry_lru_del(dentry);
-	__d_drop(dentry);
-	spin_unlock(&dentry->d_lock);
-
-	for (;;) {
-		/* descend to the first leaf in the current subtree */
-		while (!list_empty(&dentry->d_subdirs)) {
-			struct dentry *loop;
-
-			/* this is a branch with children - detach all of them
-			 * from the system in one go */
-			spin_lock(&dentry->d_lock);
-			list_for_each_entry(loop, &dentry->d_subdirs,
-					    d_u.d_child) {
-				spin_lock_nested(&loop->d_lock,
-						DENTRY_D_LOCK_NESTED);
-				dentry_lru_del(loop);
-				__d_drop(loop);
-				spin_unlock(&loop->d_lock);
-			}
-			spin_unlock(&dentry->d_lock);
-
-			/* move to the first child */
-			dentry = list_entry(dentry->d_subdirs.next,
-					    struct dentry, d_u.d_child);
-		}
-
-		/* consume the dentries from this leaf up through its parents
-		 * until we find one with children or run out altogether */
-		do {
-			struct inode *inode;
-
-			if (dentry->d_count != 0) {
-				printk(KERN_ERR
-				       "BUG: Dentry %p{i=%lx,n=%s}"
-				       " still in use (%d)"
-				       " [unmount of %s %s]\n",
-				       dentry,
-				       dentry->d_inode ?
-				       dentry->d_inode->i_ino : 0UL,
-				       dentry->d_name.name,
-				       dentry->d_count,
-				       dentry->d_sb->s_type->name,
-				       dentry->d_sb->s_id);
-				BUG();
-			}
-
-			if (IS_ROOT(dentry)) {
-				parent = NULL;
-				list_del(&dentry->d_u.d_child);
-			} else {
-				parent = dentry->d_parent;
-				spin_lock(&parent->d_lock);
-				parent->d_count--;
-				list_del(&dentry->d_u.d_child);
-				spin_unlock(&parent->d_lock);
-			}
-
-			detached++;
-
-			inode = dentry->d_inode;
-			if (inode) {
-				dentry->d_inode = NULL;
-				list_del_init(&dentry->d_alias);
-				if (dentry->d_op && dentry->d_op->d_iput)
-					dentry->d_op->d_iput(dentry, inode);
-				else
-					iput(inode);
-			}
-
-			d_free(dentry);
-
-			/* finished when we fall off the top of the tree,
-			 * otherwise we ascend to the parent and move to the
-			 * next sibling if there is one */
-			if (!parent)
-				return;
-			dentry = parent;
-		} while (list_empty(&dentry->d_subdirs));
-
-		dentry = list_entry(dentry->d_subdirs.next,
-				    struct dentry, d_u.d_child);
-	}
-}
-
 /*
  * destroy the dentries attached to a superblock on unmounting
  * - we don't need to use dentry->d_lock because:
@@ -999,11 +1009,11 @@ void shrink_dcache_for_umount(struct super_block *sb)
 	spin_lock(&dentry->d_lock);
 	dentry->d_count--;
 	spin_unlock(&dentry->d_lock);
-	shrink_dcache_for_umount_subtree(dentry);
+	shrink_dcache_subtree(dentry, 1);
 
 	while (!hlist_bl_empty(&sb->s_anon)) {
 		dentry = hlist_bl_entry(hlist_bl_first(&sb->s_anon), struct dentry, d_hash);
-		shrink_dcache_for_umount_subtree(dentry);
+		shrink_dcache_subtree(dentry, 1);
 	}
 }
 
@@ -1103,117 +1113,6 @@ rename_retry:
 }
 EXPORT_SYMBOL(have_submounts);
 
-/*
- * Search the dentry child list for the specified parent,
- * and move any unused dentries to the end of the unused
- * list for prune_dcache(). We descend to the next level
- * whenever the d_subdirs list is non-empty and continue
- * searching.
- *
- * It returns zero iff there are no unused children,
- * otherwise  it returns the number of children moved to
- * the end of the unused list. This may not be the total
- * number of unused children, because select_parent can
- * drop the lock and return early due to latency
- * constraints.
- */
-static int select_parent(struct dentry * parent)
-{
-	struct dentry *this_parent;
-	struct list_head *next;
-	unsigned seq;
-	int found = 0;
-	int locked = 0;
-
-	seq = read_seqbegin(&parent->d_sb->s_rename_lock);
-again:
-	this_parent = parent;
-	spin_lock(&this_parent->d_lock);
-repeat:
-	next = this_parent->d_subdirs.next;
-resume:
-	while (next != &this_parent->d_subdirs) {
-		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
-		next = tmp->next;
-
-		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-
-		/* 
-		 * move only zero ref count dentries to the end 
-		 * of the unused list for prune_dcache
-		 */
-		if (!dentry->d_count) {
-			dentry_lru_move_tail(dentry);
-			found++;
-		} else {
-			dentry_lru_del(dentry);
-		}
-
-		/*
-		 * We can return to the caller if we have found some (this
-		 * ensures forward progress). We'll be coming back to find
-		 * the rest.
-		 */
-		if (found && need_resched()) {
-			spin_unlock(&dentry->d_lock);
-			goto out;
-		}
-
-		/*
-		 * Descend a level if the d_subdirs list is non-empty.
-		 */
-		if (!list_empty(&dentry->d_subdirs)) {
-			spin_unlock(&this_parent->d_lock);
-			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
-			this_parent = dentry;
-			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
-			goto repeat;
-		}
-
-		spin_unlock(&dentry->d_lock);
-	}
-	/*
-	 * All done at this level ... ascend and resume the search.
-	 */
-	if (this_parent != parent) {
-		struct dentry *tmp;
-		struct dentry *child;
-
-		tmp = this_parent->d_parent;
-		rcu_read_lock();
-		spin_unlock(&this_parent->d_lock);
-		child = this_parent;
-		this_parent = tmp;
-		spin_lock(&this_parent->d_lock);
-		/* might go back up the wrong parent if we have had a rename
-		 * or deletion */
-		if (this_parent != child->d_parent ||
-			(!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))) {
-			spin_unlock(&this_parent->d_lock);
-			rcu_read_unlock();
-			goto rename_retry;
-		}
-		rcu_read_unlock();
-		next = child->d_u.d_child.next;
-		goto resume;
-	}
-out:
-	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&parent->d_sb->s_rename_lock, seq))
-		goto rename_retry;
-	if (locked)
-		read_sequnlock(&parent->d_sb->s_rename_lock);
-	return found;
-
-rename_retry:
-	if (found)
-		return found;
-	locked = 1;
-	read_seqlock(&parent->d_sb->s_rename_lock);
-	goto again;
-}
-
 /**
  * shrink_dcache_parent - prune dcache
  * @parent: parent of entries to prune
@@ -1223,11 +1122,7 @@ rename_retry:
  
 void shrink_dcache_parent(struct dentry * parent)
 {
-	struct super_block *sb = parent->d_sb;
-	int found;
-
-	while ((found = select_parent(parent)) != 0)
-		__shrink_dcache_sb(sb, &found, 0);
+	shrink_dcache_subtree(parent, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/13] vfs: Make dentry LRU list global
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (3 preceding siblings ...)
  2011-05-03 12:16 ` [PATCH 4/13] vfs: Factor out tree (of four) shrinkers code Pavel Emelyanov
@ 2011-05-03 12:17 ` Pavel Emelyanov
  2011-05-03 12:17 ` [PATCH 6/13] vfs: Turn the nr_dentry into percpu_counter Pavel Emelyanov
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:17 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

The main intention of the per-sb LRU list was to make the umount
faster (commit da3bbdd4: fix soft lock up at NFS mount via per-SB
LRU-list of unused dentries). Since now the per-sb shrinker on
umount doesn't depend on this list this per-sb LRU is not required.

And this change is also helpful for further patching where I try
to make the dentry LRU lists "per-container".

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c        |  110 +++++++--------------------------------------------
 fs/super.c         |    1 -
 include/linux/fs.h |    4 --
 3 files changed, 15 insertions(+), 100 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 67d6590..1c56593 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -232,6 +232,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 		iput(inode);
 }
 
+static LIST_HEAD(dentry_lru);
 /*
  * dentry_lru_(add|del|move_tail) must be called with d_lock held.
  */
@@ -239,8 +240,7 @@ static void dentry_lru_add(struct dentry *dentry)
 {
 	if (list_empty(&dentry->d_lru)) {
 		spin_lock(&dcache_lru_lock);
-		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-		dentry->d_sb->s_nr_dentry_unused++;
+		list_add(&dentry->d_lru, &dentry_lru);
 		dentry_stat.nr_unused++;
 		spin_unlock(&dcache_lru_lock);
 	}
@@ -249,7 +249,6 @@ static void dentry_lru_add(struct dentry *dentry)
 static void __dentry_lru_del(struct dentry *dentry)
 {
 	list_del_init(&dentry->d_lru);
-	dentry->d_sb->s_nr_dentry_unused--;
 	dentry_stat.nr_unused--;
 }
 
@@ -839,27 +838,26 @@ static void shrink_dentry_list(struct list_head *list)
 }
 
 /**
- * __shrink_dcache_sb - shrink the dentry LRU on a given superblock
- * @sb:		superblock to shrink dentry LRU.
- * @count:	number of entries to prune
- * @flags:	flags to control the dentry processing
+ * prune_dcache - shrink the dcache
+ * @count: number of entries to try to free
+ *
+ * Shrink the dcache. This is done when we need more memory, or simply when we
+ * need to unmount something (at which point we need to unuse all dentries).
  *
- * If flags contains DCACHE_REFERENCED reference dentries will not be pruned.
+ * This function may fail to free any resources if all the dentries are in use.
  */
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+
+static void prune_dcache(int count)
 {
 	/* called from prune_dcache() and shrink_dcache_parent() */
 	struct dentry *dentry;
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
-	int cnt = *count;
 
 relock:
 	spin_lock(&dcache_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		dentry = list_entry(sb->s_dentry_lru.prev,
-				struct dentry, d_lru);
-		BUG_ON(dentry->d_sb != sb);
+	while (!list_empty(&dentry_lru)) {
+		dentry = list_entry(dentry_lru.prev, struct dentry, d_lru);
 
 		if (!spin_trylock(&dentry->d_lock)) {
 			spin_unlock(&dcache_lru_lock);
@@ -872,101 +870,23 @@ relock:
 		 * dentry has this flag set, don't free it.  Clear the flag
 		 * and put it back on the LRU.
 		 */
-		if (flags & DCACHE_REFERENCED &&
-				dentry->d_flags & DCACHE_REFERENCED) {
+		if (dentry->d_flags & DCACHE_REFERENCED) {
 			dentry->d_flags &= ~DCACHE_REFERENCED;
 			list_move(&dentry->d_lru, &referenced);
 			spin_unlock(&dentry->d_lock);
 		} else {
 			list_move_tail(&dentry->d_lru, &tmp);
 			spin_unlock(&dentry->d_lock);
-			if (!--cnt)
+			if (!--count)
 				break;
 		}
 		cond_resched_lock(&dcache_lru_lock);
 	}
 	if (!list_empty(&referenced))
-		list_splice(&referenced, &sb->s_dentry_lru);
+		list_splice(&referenced, &dentry_lru);
 	spin_unlock(&dcache_lru_lock);
 
 	shrink_dentry_list(&tmp);
-
-	*count = cnt;
-}
-
-/**
- * prune_dcache - shrink the dcache
- * @count: number of entries to try to free
- *
- * Shrink the dcache. This is done when we need more memory, or simply when we
- * need to unmount something (at which point we need to unuse all dentries).
- *
- * This function may fail to free any resources if all the dentries are in use.
- */
-static void prune_dcache(int count)
-{
-	struct super_block *sb, *p = NULL;
-	int w_count;
-	int unused = dentry_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_dentry_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_dentry_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_dentry_lru))) {
-				__shrink_dcache_sb(sb, &w_count,
-						DCACHE_REFERENCED);
-				pruned -= w_count;
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		if (p)
-			__put_super(p);
-		count -= pruned;
-		p = sb;
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	if (p)
-		__put_super(p);
-	spin_unlock(&sb_lock);
 }
 
 /**
diff --git a/fs/super.c b/fs/super.c
index 0293e6e..e598558 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -74,7 +74,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
-		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63b4b7a..a236ea3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1387,10 +1387,6 @@ struct super_block {
 #else
 	struct list_head	s_files;
 #endif
-	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
-	struct list_head	s_dentry_lru;	/* unused dentry lru */
-	int			s_nr_dentry_unused;	/* # of dentry on lru */
-
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/13] vfs: Turn the nr_dentry into percpu_counter
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (4 preceding siblings ...)
  2011-05-03 12:17 ` [PATCH 5/13] vfs: Make dentry LRU list global Pavel Emelyanov
@ 2011-05-03 12:17 ` Pavel Emelyanov
  2011-05-03 12:18 ` [PATCH 7/13] vfs: Limit the number of dentries globally Pavel Emelyanov
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:17 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

The plain percpu was OK when we needed to just accout for the
number of dentries allocated. We will have to check that the amount
of dentries is kept below some limit, and the percpu_counter is
more suitable for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c            |   18 +++++-------------
 include/linux/dcache.h |    1 +
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 1c56593..fa5b7fa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -123,22 +123,13 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
-static DEFINE_PER_CPU(unsigned int, nr_dentry);
+static struct percpu_counter nr_dentry __cacheline_aligned_in_smp;
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
-static int get_nr_dentry(void)
-{
-	int i;
-	int sum = 0;
-	for_each_possible_cpu(i)
-		sum += per_cpu(nr_dentry, i);
-	return sum < 0 ? 0 : sum;
-}
-
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
 		   size_t *lenp, loff_t *ppos)
 {
-	dentry_stat.nr_dentry = get_nr_dentry();
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -159,7 +150,7 @@ static void __d_free(struct rcu_head *head)
 static void d_free(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_count);
-	this_cpu_dec(nr_dentry);
+	percpu_counter_dec(&nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
@@ -1138,7 +1129,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 		spin_unlock(&parent->d_lock);
 	}
 
-	this_cpu_inc(nr_dentry);
+	percpu_counter_inc(&nr_dentry);
 
 	return dentry;
 }
@@ -2847,6 +2838,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 8834f58..64848dd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -9,6 +9,7 @@
 #include <linux/seqlock.h>
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
+#include <linux/percpu_counter.h>
 
 struct nameidata;
 struct path;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/13] vfs: Limit the number of dentries globally
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (5 preceding siblings ...)
  2011-05-03 12:17 ` [PATCH 6/13] vfs: Turn the nr_dentry into percpu_counter Pavel Emelyanov
@ 2011-05-03 12:18 ` Pavel Emelyanov
  2011-05-03 12:18 ` [PATCH 8/13] vfs: Introduce the dentry mobs Pavel Emelyanov
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:18 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

When we try to allocate more than the limit (with the percpu_counter
accuracy) the dcache is shrunk. If no dentries can be shrunk the
allocation is failed.

This is just to introduce the dcache_mem_check routine. The actual
limit imposed on the overall dcache size is set to 80Mb. The API for
limit configuration is to be discussed, and thus I made this hard
coded not to introduce some fancy API that will have to be changed.

The amount of dentries to shrink before reallocating the new one is
hard coded to limit/64, not to introduce the not yet discussed
configuration API.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c |   41 +++++++++++++++++++++++++++++++++--------
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fa5b7fa..c5179c3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -124,6 +124,7 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static struct percpu_counter nr_dentry __cacheline_aligned_in_smp;
+static unsigned long nr_dentry_max;
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
@@ -759,7 +760,7 @@ static void shrink_dcache_subtree(struct dentry *root, int with_root)
  *
  * This may fail if locks cannot be acquired no problem, just try again.
  */
-static void try_prune_one_dentry(struct dentry *dentry)
+static int try_prune_one_dentry(struct dentry *dentry)
 	__releases(dentry->d_lock)
 {
 	struct dentry *parent;
@@ -776,9 +777,9 @@ static void try_prune_one_dentry(struct dentry *dentry)
 	 * fragmentation.
 	 */
 	if (!parent)
-		return;
+		return 1;
 	if (parent == dentry)
-		return;
+		return 0;
 
 	/* Prune ancestors. */
 	dentry = parent;
@@ -787,15 +788,18 @@ static void try_prune_one_dentry(struct dentry *dentry)
 		if (dentry->d_count > 1) {
 			dentry->d_count--;
 			spin_unlock(&dentry->d_lock);
-			return;
+			return 1;
 		}
 		dentry = dentry_kill(dentry, 1);
 	}
+
+	return 1;
 }
 
-static void shrink_dentry_list(struct list_head *list)
+static int shrink_dentry_list(struct list_head *list)
 {
 	struct dentry *dentry;
+	int did_progress = 0;
 
 	rcu_read_lock();
 	for (;;) {
@@ -821,11 +825,13 @@ static void shrink_dentry_list(struct list_head *list)
 
 		rcu_read_unlock();
 
-		try_prune_one_dentry(dentry);
+		did_progress |= try_prune_one_dentry(dentry);
 
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
+
+	return did_progress;
 }
 
 /**
@@ -838,7 +844,7 @@ static void shrink_dentry_list(struct list_head *list)
  * This function may fail to free any resources if all the dentries are in use.
  */
 
-static void prune_dcache(int count)
+static int prune_dcache(int count)
 {
 	/* called from prune_dcache() and shrink_dcache_parent() */
 	struct dentry *dentry;
@@ -877,7 +883,7 @@ relock:
 		list_splice(&referenced, &dentry_lru);
 	spin_unlock(&dcache_lru_lock);
 
-	shrink_dentry_list(&tmp);
+	return shrink_dentry_list(&tmp);
 }
 
 /**
@@ -1065,6 +1071,19 @@ static struct shrinker dcache_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+static int dcache_mem_check(void)
+{
+	if (percpu_counter_read_positive(&nr_dentry) <= nr_dentry_max)
+		return 0;
+
+	do {
+		if (percpu_counter_sum_positive(&nr_dentry) <= nr_dentry_max)
+			return 0;
+	} while (prune_dcache(nr_dentry_max >> 6) > 0);
+
+	return -ENOMEM;
+}
+
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -1080,6 +1099,9 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;
 
+	if (dcache_mem_check())
+		return NULL;
+
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
 	if (!dentry)
 		return NULL;
@@ -2839,6 +2861,9 @@ static void __init dcache_init(void)
 	int loop;
 
 	percpu_counter_init(&nr_dentry, 0);
+	nr_dentry_max = 80 * 1024 * 1024 / sizeof(struct dentry);
+	printk("nr_dentry_max = %lu\n", nr_dentry_max);
+
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 8/13] vfs: Introduce the dentry mobs
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (6 preceding siblings ...)
  2011-05-03 12:18 ` [PATCH 7/13] vfs: Limit the number of dentries globally Pavel Emelyanov
@ 2011-05-03 12:18 ` Pavel Emelyanov
  2011-06-18 13:40   ` Andrea Arcangeli
  2011-05-03 12:18 ` [PATCH 9/13] vfs: More than one mob management Pavel Emelyanov
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:18 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

The dentry mobs are the sets of dentries, that

* belong to one subtree growing from some dentry
* are limited in size per-mob
* have the unused dentries stored in the mob's LRU list

At this point all this patch does it moves the nr_dentry, the nr_dentry_max
and the dentry_lru on the static init_dentry_mob structure. The ability to
create more mobs will be introduced in the next patches.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c            |   46 ++++++++++++++++++++++++++--------------------
 include/linux/dcache.h |    7 +++++++
 2 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c5179c3..bfe047d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -123,14 +123,14 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
-static struct percpu_counter nr_dentry __cacheline_aligned_in_smp;
-static unsigned long nr_dentry_max;
+static struct dentry_mob init_dentry_mob;
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
 		   size_t *lenp, loff_t *ppos)
 {
-	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	dentry_stat.nr_dentry =
+		percpu_counter_sum_positive(&init_dentry_mob.nr_dentry);
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -151,7 +151,7 @@ static void __d_free(struct rcu_head *head)
 static void d_free(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_count);
-	percpu_counter_dec(&nr_dentry);
+	percpu_counter_dec(&dentry->d_mob->nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
@@ -224,7 +224,6 @@ static void dentry_unlink_inode(struct dentry * dentry)
 		iput(inode);
 }
 
-static LIST_HEAD(dentry_lru);
 /*
  * dentry_lru_(add|del|move_tail) must be called with d_lock held.
  */
@@ -232,7 +231,7 @@ static void dentry_lru_add(struct dentry *dentry)
 {
 	if (list_empty(&dentry->d_lru)) {
 		spin_lock(&dcache_lru_lock);
-		list_add(&dentry->d_lru, &dentry_lru);
+		list_add(&dentry->d_lru, &dentry->d_mob->dentry_lru);
 		dentry_stat.nr_unused++;
 		spin_unlock(&dcache_lru_lock);
 	}
@@ -844,7 +843,7 @@ static int shrink_dentry_list(struct list_head *list)
  * This function may fail to free any resources if all the dentries are in use.
  */
 
-static int prune_dcache(int count)
+static int prune_dcache(struct dentry_mob *mob, int count)
 {
 	/* called from prune_dcache() and shrink_dcache_parent() */
 	struct dentry *dentry;
@@ -853,8 +852,8 @@ static int prune_dcache(int count)
 
 relock:
 	spin_lock(&dcache_lru_lock);
-	while (!list_empty(&dentry_lru)) {
-		dentry = list_entry(dentry_lru.prev, struct dentry, d_lru);
+	while (!list_empty(&mob->dentry_lru)) {
+		dentry = list_entry(mob->dentry_lru.prev, struct dentry, d_lru);
 
 		if (!spin_trylock(&dentry->d_lock)) {
 			spin_unlock(&dcache_lru_lock);
@@ -880,7 +879,7 @@ relock:
 		cond_resched_lock(&dcache_lru_lock);
 	}
 	if (!list_empty(&referenced))
-		list_splice(&referenced, &dentry_lru);
+		list_splice(&referenced, &mob->dentry_lru);
 	spin_unlock(&dcache_lru_lock);
 
 	return shrink_dentry_list(&tmp);
@@ -1060,7 +1059,7 @@ static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
-		prune_dcache(nr);
+		prune_dcache(&init_dentry_mob, nr);
 	}
 
 	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
@@ -1071,15 +1070,17 @@ static struct shrinker dcache_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
-static int dcache_mem_check(void)
+static int dcache_mem_check(struct dentry_mob *dmob)
 {
-	if (percpu_counter_read_positive(&nr_dentry) <= nr_dentry_max)
+	if (percpu_counter_read_positive(&dmob->nr_dentry) <=
+			dmob->nr_dentry_max)
 		return 0;
 
 	do {
-		if (percpu_counter_sum_positive(&nr_dentry) <= nr_dentry_max)
+		if (percpu_counter_sum_positive(&dmob->nr_dentry) <=
+				dmob->nr_dentry_max)
 			return 0;
-	} while (prune_dcache(nr_dentry_max >> 6) > 0);
+	} while (prune_dcache(dmob, dmob->nr_dentry_max >> 6) > 0);
 
 	return -ENOMEM;
 }
@@ -1098,8 +1099,10 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 {
 	struct dentry *dentry;
 	char *dname;
+	struct dentry_mob *dmob;
 
-	if (dcache_mem_check())
+	dmob = &init_dentry_mob;
+	if (dcache_mem_check(dmob))
 		return NULL;
 
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
@@ -1136,6 +1139,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_subdirs);
 	INIT_LIST_HEAD(&dentry->d_alias);
 	INIT_LIST_HEAD(&dentry->d_u.d_child);
+	dentry->d_mob = dmob;
 
 	if (parent) {
 		spin_lock(&parent->d_lock);
@@ -1151,7 +1155,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 		spin_unlock(&parent->d_lock);
 	}
 
-	percpu_counter_inc(&nr_dentry);
+	percpu_counter_inc(&dmob->nr_dentry);
 
 	return dentry;
 }
@@ -2860,9 +2864,11 @@ static void __init dcache_init(void)
 {
 	int loop;
 
-	percpu_counter_init(&nr_dentry, 0);
-	nr_dentry_max = 80 * 1024 * 1024 / sizeof(struct dentry);
-	printk("nr_dentry_max = %lu\n", nr_dentry_max);
+	percpu_counter_init(&init_dentry_mob.nr_dentry, 0);
+	init_dentry_mob.nr_dentry_max = 80 * 1024 * 1024 / sizeof(struct dentry);
+	INIT_LIST_HEAD(&init_dentry_mob.dentry_lru);
+
+	printk("nr_dentry_max = %lu\n", init_dentry_mob.nr_dentry_max);
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 64848dd..80bb9e4 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -48,6 +48,12 @@ struct dentry_stat_t {
 };
 extern struct dentry_stat_t dentry_stat;
 
+struct dentry_mob {
+	struct percpu_counter nr_dentry;
+	unsigned long nr_dentry_max;
+	struct list_head dentry_lru;
+};
+
 /*
  * Compare 2 name strings, return 0 if they match, otherwise non-zero.
  * The strings are both count bytes long, and count is non-zero.
@@ -143,6 +149,7 @@ struct dentry {
 	} d_u;
 	struct list_head d_subdirs;	/* our children */
 	struct list_head d_alias;	/* inode alias list */
+	struct dentry_mob *d_mob;
 };
 
 /*
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 9/13] vfs: More than one mob management
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (7 preceding siblings ...)
  2011-05-03 12:18 ` [PATCH 8/13] vfs: Introduce the dentry mobs Pavel Emelyanov
@ 2011-05-03 12:18 ` Pavel Emelyanov
  2011-05-03 12:19 ` [PATCH 10/13] vfs: Routnes for setting mob size and getting stats Pavel Emelyanov
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:18 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

This management includes

* new dentry attached to some parent inherits the mob from one
* on rename from one mob to another, the subtree is reattached to
the new mob (see comment below)
* the mob root it marked with a flag and on its death the mob is
killed

Some more words about the rename. This rechage is slow, but is OK,
since the main usecase for mobs is per-countainer dcache management
and the move of some subdir from one container is another is actually
a rare operation which is not expected to be fast.

Moreover, in OpenVZ each container's root is a bind-mount, so if one
tries to do mv one_ct_root/x other_ct_root/ the rename check for the
vfsmnt equality will fail and the real copy will occur.

One bad thing of this approach is - when we mount some new filesystem
to non-init mob-ed dentry the new mount will be attached to the init
mob. Need to do the mob change when we attach an fs to a mountpoint.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c            |  182 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |    5 ++
 2 files changed, 186 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index bfe047d..51fb998 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -145,6 +145,8 @@ static void __d_free(struct rcu_head *head)
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
+static void destroy_mob(struct dentry_mob *mob);
+
 /*
  * no locks, please.
  */
@@ -154,6 +156,8 @@ static void d_free(struct dentry *dentry)
 	percpu_counter_dec(&dentry->d_mob->nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
+	if (dentry->d_flags & DCACHE_MOB_ROOT)
+		destroy_mob(dentry->d_mob);
 
 	/* if dentry was never inserted into hash, immediate free is OK */
 	if (hlist_bl_unhashed(&dentry->d_hash))
@@ -1101,7 +1105,11 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	char *dname;
 	struct dentry_mob *dmob;
 
-	dmob = &init_dentry_mob;
+	if (parent)
+		dmob = parent->d_mob;
+	else
+		dmob = &init_dentry_mob;
+
 	if (dcache_mem_check(dmob))
 		return NULL;
 
@@ -2039,6 +2047,23 @@ static void dentry_unlock_parents_for_move(struct dentry *dentry,
 		spin_unlock(&target->d_parent->d_lock);
 }
 
+static void dcache_move_to_new_mob(struct dentry *root, struct dentry_mob *dmob);
+static void switch_mobs(struct dentry *dentry, struct dentry *target)
+{
+	if (dentry->d_mob == target->d_mob)
+		return;
+
+	if (dentry->d_flags & DCACHE_MOB_ROOT)
+		return;
+
+	dcache_move_to_new_mob(dentry, target->d_mob);
+	if (target->d_flags & DCACHE_MOB_ROOT) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_MOB_ROOT;
+		spin_unlock(&dentry->d_lock);
+	}
+}
+
 /*
  * When switching names, the actual string doesn't strictly have to
  * be preserved in the target - because we're dropping the target
@@ -2115,6 +2140,8 @@ void d_move(struct dentry * dentry, struct dentry * target)
 	fsnotify_d_move(dentry);
 	spin_unlock(&dentry->d_lock);
 	write_sequnlock(&dentry->d_sb->s_rename_lock);
+
+	switch_mobs(dentry, target);
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2826,6 +2853,159 @@ ino_t find_inode_number(struct dentry *dir, struct qstr *name)
 }
 EXPORT_SYMBOL(find_inode_number);
 
+static struct dentry_mob *create_mob(struct dentry_mob *cur)
+{
+	struct dentry_mob *dmob;
+
+	dmob = kmalloc(sizeof(struct dentry_mob), GFP_KERNEL);
+	if (dmob == NULL)
+		return NULL;
+
+	if (percpu_counter_init(&dmob->nr_dentry, 0) < 0) {
+		kfree(dmob);
+		return NULL;
+	}
+
+	dmob->nr_dentry_max = cur->nr_dentry_max;
+	INIT_LIST_HEAD(&dmob->dentry_lru);
+
+	return dmob;
+}
+
+static void destroy_mob(struct dentry_mob *mob)
+{
+	if (percpu_counter_sum(&mob->nr_dentry) != 0)
+		BUG();
+	if (!list_empty(&mob->dentry_lru))
+		BUG();
+
+	percpu_counter_destroy(&mob->nr_dentry);
+	kfree(mob);
+}
+
+static void dentry_move_to_mob(struct dentry *de, struct dentry_mob *dmob)
+{
+	percpu_counter_dec(&de->d_mob->nr_dentry);
+	if (!list_empty(&de->d_lru)) {
+		spin_lock(&dcache_lru_lock);
+		list_del_init(&de->d_lru);
+		spin_unlock(&dcache_lru_lock);
+	}
+
+	de->d_mob = dmob;
+
+	percpu_counter_inc(&dmob->nr_dentry);
+	if (!de->d_count) {
+		spin_lock(&dcache_lru_lock);
+		list_add_tail(&de->d_lru, &dmob->dentry_lru);
+		spin_unlock(&dcache_lru_lock);
+	}
+}
+
+static void dcache_move_to_new_mob(struct dentry *root, struct dentry_mob *dmob)
+{
+	struct dentry *this_parent;
+	struct list_head *next;
+	unsigned seq;
+	int locked = 0;
+
+	seq = read_seqbegin(&root->d_sb->s_rename_lock);
+again:
+	this_parent = root;
+	spin_lock(&this_parent->d_lock);
+repeat:
+	next = this_parent->d_subdirs.next;
+resume:
+	while (next != &this_parent->d_subdirs) {
+		struct list_head *tmp = next;
+		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
+		next = tmp->next;
+
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		/*
+		 * Descend a level if the d_subdirs list is non-empty.
+		 */
+		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
+			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
+			goto repeat;
+		}
+
+		dentry_move_to_mob(dentry, dmob);
+
+		spin_unlock(&dentry->d_lock);
+	}
+	/*
+	 * All done at this level ... ascend and resume the search.
+	 */
+	if (this_parent != root) {
+		struct dentry *tmp;
+		struct dentry *child;
+
+		tmp = this_parent->d_parent;
+		rcu_read_lock();
+		spin_unlock(&this_parent->d_lock);
+		child = this_parent;
+		this_parent = tmp;
+		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+			(!locked && read_seqretry(&root->d_sb->s_rename_lock, seq))) {
+			spin_unlock(&this_parent->d_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
+		next = child->d_u.d_child.next;
+
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		dentry_move_to_mob(child, dmob);
+		spin_unlock(&child->d_lock);
+
+		goto resume;
+	}
+
+	spin_unlock(&this_parent->d_lock);
+	if (!locked && read_seqretry(&root->d_sb->s_rename_lock, seq))
+		goto rename_retry;
+	if (locked)
+		read_sequnlock(&root->d_sb->s_rename_lock);
+	return;
+
+rename_retry:
+	locked = 1;
+	read_seqlock(&root->d_sb->s_rename_lock);
+	goto again;
+}
+
+int dcache_new_mob(struct dentry *root)
+{
+	struct dentry_mob *dmob, *old = NULL;
+
+	if (root->d_flags & DCACHE_MOB_ROOT)
+		old = root->d_mob;
+
+	dmob = create_mob(root->d_mob);
+	if (dmob == NULL)
+		return -ENOMEM;
+
+	dcache_move_to_new_mob(root, dmob);
+
+	spin_lock(&root->d_lock);
+	root->d_flags |= DCACHE_MOB_ROOT;
+	root->d_mob = dmob;
+	percpu_counter_inc(&dmob->nr_dentry);
+	spin_unlock(&root->d_lock);
+
+	if (old != NULL)
+		destroy_mob(old);
+
+	return 0;
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 80bb9e4..3681307 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -54,6 +54,9 @@ struct dentry_mob {
 	struct list_head dentry_lru;
 };
 
+struct dentry;
+int dcache_new_mob(struct dentry *root);
+
 /*
  * Compare 2 name strings, return 0 if they match, otherwise non-zero.
  * The strings are both count bytes long, and count is non-zero.
@@ -227,6 +230,8 @@ struct dentry_operations {
 #define DCACHE_MANAGED_DENTRY \
 	(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
+#define DCACHE_MOB_ROOT		0x80000
+
 static inline int dname_external(struct dentry *dentry)
 {
 	return dentry->d_name.name != dentry->d_iname;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/13] vfs: Routnes for setting mob size and getting stats
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (8 preceding siblings ...)
  2011-05-03 12:18 ` [PATCH 9/13] vfs: More than one mob management Pavel Emelyanov
@ 2011-05-03 12:19 ` Pavel Emelyanov
  2011-05-03 12:19 ` [PATCH 11/13] vfs: Make shrink_dcache_memory prune dcache from all mobs Pavel Emelyanov
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:19 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

Very simple ones, just to make the further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c            |   30 ++++++++++++++++++++++++++++++
 include/linux/dcache.h |    7 +++++++
 2 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 51fb998..bdacea3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3006,6 +3006,36 @@ int dcache_new_mob(struct dentry *root)
 	return 0;
 }
 
+int dcache_set_mob_size(struct dentry *de, unsigned long size)
+{
+	struct dentry_mob *mob = de->d_mob;
+	unsigned long usage;
+
+	if (mob == &init_dentry_mob) /* for safety */
+		return 0;
+
+	mob->nr_dentry_max = size;
+	usage = percpu_counter_sum_positive(&mob->nr_dentry);
+	if (usage > mob->nr_dentry_max)
+		prune_dcache(mob, usage - mob->nr_dentry_max);
+
+	return 0;
+}
+
+int dcache_get_mob_stat(struct dentry *de, struct dentry_mob_stat __user *stat)
+{
+	struct dentry_mob *mob = de->d_mob;
+	struct dentry_mob_stat kstat;
+
+	kstat.nr_dentry = percpu_counter_sum_positive(&mob->nr_dentry);
+	kstat.nr_dentry_max = mob->nr_dentry_max;
+
+	if (copy_to_user(stat, &kstat, sizeof(kstat)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3681307..0d13966 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -55,7 +55,14 @@ struct dentry_mob {
 };
 
 struct dentry;
+struct dentry_mob_stat {
+	unsigned long nr_dentry;
+	unsigned long nr_dentry_max;
+};
+
 int dcache_new_mob(struct dentry *root);
+int dcache_set_mob_size(struct dentry *de, unsigned long size);
+int dcache_get_mob_stat(struct dentry *de, struct dentry_mob_stat __user *stat);
 
 /*
  * Compare 2 name strings, return 0 if they match, otherwise non-zero.
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/13] vfs: Make shrink_dcache_memory prune dcache from all mobs
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (9 preceding siblings ...)
  2011-05-03 12:19 ` [PATCH 10/13] vfs: Routnes for setting mob size and getting stats Pavel Emelyanov
@ 2011-05-03 12:19 ` Pavel Emelyanov
  2011-05-03 12:20 ` [PATCH 12/13] vfs: Mobs creation and mgmt API Pavel Emelyanov
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:19 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

For that the list of mobs and their number is introduced. When the
shrink_dcache_memory is called it will srhrink dcache from each
mob in the system.

By now I shrink equal count of dentries from each mob, but I'm open
for discussions of better policies.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c            |   35 ++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |    1 +
 2 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index bdacea3..01e3464 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -124,6 +124,9 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static struct dentry_mob init_dentry_mob;
+static LIST_HEAD(dentry_mobs);
+static unsigned int dentry_mobs_nr = 1;
+static DEFINE_SPINLOCK(dcache_mobs_lock);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
@@ -889,6 +892,25 @@ relock:
 	return shrink_dentry_list(&tmp);
 }
 
+static void prune_dcache_mobs(int nr)
+{
+	int nr_per_mob;
+	struct dentry_mob *mob;
+
+	nr_per_mob = nr / dentry_mobs_nr;
+	if (nr_per_mob == 0)
+		nr_per_mob = 16;
+
+	while (nr--) {
+		spin_lock(&dcache_mobs_lock);
+		mob = list_first_entry(&dentry_mobs, struct dentry_mob, mobs);
+		list_move_tail(&mob->mobs, &dentry_mobs);
+		spin_unlock(&dcache_mobs_lock);
+
+		prune_dcache(mob, nr_per_mob);
+	}
+}
+
 /**
  * shrink_dcache_sb - shrink dcache for a superblock
  * @sb: superblock
@@ -1063,7 +1085,7 @@ static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
-		prune_dcache(&init_dentry_mob, nr);
+		prune_dcache_mobs(nr);
 	}
 
 	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
@@ -2874,6 +2896,11 @@ static struct dentry_mob *create_mob(struct dentry_mob *cur)
 
 static void destroy_mob(struct dentry_mob *mob)
 {
+	spin_lock(&dcache_mobs_lock);
+	list_del(&mob->mobs);
+	dentry_mobs_nr--;
+	spin_unlock(&dcache_mobs_lock);
+
 	if (percpu_counter_sum(&mob->nr_dentry) != 0)
 		BUG();
 	if (!list_empty(&mob->dentry_lru))
@@ -3000,6 +3027,11 @@ int dcache_new_mob(struct dentry *root)
 	percpu_counter_inc(&dmob->nr_dentry);
 	spin_unlock(&root->d_lock);
 
+	spin_lock(&dcache_mobs_lock);
+	list_add(&dmob->mobs, &dentry_mobs);
+	dentry_mobs_nr++;
+	spin_unlock(&dcache_mobs_lock);
+
 	if (old != NULL)
 		destroy_mob(old);
 
@@ -3077,6 +3109,7 @@ static void __init dcache_init(void)
 	percpu_counter_init(&init_dentry_mob.nr_dentry, 0);
 	init_dentry_mob.nr_dentry_max = 80 * 1024 * 1024 / sizeof(struct dentry);
 	INIT_LIST_HEAD(&init_dentry_mob.dentry_lru);
+	list_add_tail(&init_dentry_mob.mobs, &dentry_mobs);
 
 	printk("nr_dentry_max = %lu\n", init_dentry_mob.nr_dentry_max);
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 0d13966..7b1fdc1 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -52,6 +52,7 @@ struct dentry_mob {
 	struct percpu_counter nr_dentry;
 	unsigned long nr_dentry_max;
 	struct list_head dentry_lru;
+	struct list_head mobs;
 };
 
 struct dentry;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/13] vfs: Mobs creation and mgmt API
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (10 preceding siblings ...)
  2011-05-03 12:19 ` [PATCH 11/13] vfs: Make shrink_dcache_memory prune dcache from all mobs Pavel Emelyanov
@ 2011-05-03 12:20 ` Pavel Emelyanov
  2011-05-03 12:20 ` [PATCH 13/13] vfs: Dentry mobs listing in proc Pavel Emelyanov
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:20 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

These are ioctl-s, sorry. I've spent many time thinking which API for 
that would be better, but didn't find any perfect solutions, so this
is to be discussed.

The cgroups as they are implemented now are not suitable as they work
for controllers, that controll the per-task-groups resources. The dcache
is obviously not such.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/ioctl.c         |    9 +++++++++
 include/linux/fs.h |    3 +++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1eebeb7..e0c1e77 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include <linux/writeback.h>
 #include <linux/buffer_head.h>
 #include <linux/falloc.h>
+#include <linux/dcache.h>
 
 #include <asm/ioctls.h>
 
@@ -596,6 +597,14 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 		return put_user(inode->i_sb->s_blocksize, p);
 	}
 
+	case FIMOBROOT:
+		return dcache_new_mob(filp->f_path.dentry);
+	case FIMOBSIZE:
+		return dcache_set_mob_size(filp->f_path.dentry, arg);
+	case FIMOBSTAT:
+		return dcache_get_mob_stat(filp->f_path.dentry,
+				(struct dentry_mob_stat __user *)arg);
+
 	default:
 		if (S_ISREG(filp->f_path.dentry->d_inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a236ea3..4bf54bb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -327,6 +327,9 @@ struct inodes_stat_t {
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#define FIMOBROOT	_IO('X', 122)
+#define FIMOBSIZE	_IOW('X', 123, unsigned long)
+#define FIMOBSTAT	_IOR('X', 124, struct dentry_mob_stat)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/13] vfs: Dentry mobs listing in proc
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (11 preceding siblings ...)
  2011-05-03 12:20 ` [PATCH 12/13] vfs: Mobs creation and mgmt API Pavel Emelyanov
@ 2011-05-03 12:20 ` Pavel Emelyanov
  2011-05-06  1:05 ` [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Dave Chinner
  2011-05-23  6:43 ` Pavel Emelyanov
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-03 12:20 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

The /proc/dentry_mobs file lists the device the mob is located on,
the path to the dentry (starting from the device root) its usage and
size.

I tried to make it very simple, as this API is not perfect from my
POV and will most likely be rewritten. For the same reason (simplicity)
the dentry_path is used, not the full path to the root.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---
 fs/dcache.c            |   58 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/proc/root.c         |    3 ++
 include/linux/dcache.h |    2 +
 3 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 01e3464..c1120b3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3028,6 +3028,7 @@ int dcache_new_mob(struct dentry *root)
 	spin_unlock(&root->d_lock);
 
 	spin_lock(&dcache_mobs_lock);
+	dmob->mob_root = root;
 	list_add(&dmob->mobs, &dentry_mobs);
 	dentry_mobs_nr++;
 	spin_unlock(&dcache_mobs_lock);
@@ -3068,6 +3069,63 @@ int dcache_get_mob_stat(struct dentry *de, struct dentry_mob_stat __user *stat)
 	return 0;
 }
 
+static int dentry_mobs_show(struct seq_file *f, void *v)
+{
+	struct dentry_mob *dmob;
+
+	dmob = list_entry(v, struct dentry_mob, mobs);
+	if (dmob == &init_dentry_mob)
+		seq_putc(f, '/');
+	else {
+		struct dentry *root = dmob->mob_root;
+
+		seq_printf(f, "%u:%u ", MAJOR(root->d_sb->s_dev),
+				MINOR(root->d_sb->s_dev));
+		seq_dentry(f, root, " \t\n\\");
+	}
+
+	seq_printf(f, " %lu %lu\n",
+			(unsigned long)percpu_counter_sum_positive(&dmob->nr_dentry),
+			dmob->nr_dentry_max);
+
+	return 0;
+}
+
+static void *dentry_mobs_start(struct seq_file *f, loff_t *pos)
+{
+	spin_lock(&dcache_mobs_lock);
+	return seq_list_start(&dentry_mobs, *pos);
+}
+
+static void *dentry_mobs_next(struct seq_file *f, void *v, loff_t *pos)
+{
+	return seq_list_next(v, &dentry_mobs, pos);
+}
+
+static void dentry_mobs_stop(struct seq_file *f, void *v)
+{
+	spin_unlock(&dcache_mobs_lock);
+}
+
+static const struct seq_operations dentry_mobs_ops = {
+	.start = dentry_mobs_start,
+	.next  = dentry_mobs_next,
+	.stop  = dentry_mobs_stop,
+	.show  = dentry_mobs_show
+};
+
+static int dentry_mobs_open(struct inode *inode, struct file *filp)
+{
+	return seq_open(filp, &dentry_mobs_ops);
+}
+
+const struct file_operations proc_dentry_mobs_fops = {
+	.open		= dentry_mobs_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ef9fa8e..e4d4983 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -18,6 +18,7 @@
 #include <linux/bitops.h>
 #include <linux/mount.h>
 #include <linux/pid_namespace.h>
+#include <linux/dcache.h>
 
 #include "internal.h"
 
@@ -133,6 +134,8 @@ void __init proc_root_init(void)
 	proc_device_tree_init();
 #endif
 	proc_mkdir("bus", NULL);
+
+	proc_create("dentry_mobs", S_IRUGO, NULL, &proc_dentry_mobs_fops);
 	proc_sys_init();
 }
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7b1fdc1..8a9a5a5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -53,6 +53,7 @@ struct dentry_mob {
 	unsigned long nr_dentry_max;
 	struct list_head dentry_lru;
 	struct list_head mobs;
+	struct dentry *mob_root;
 };
 
 struct dentry;
@@ -64,6 +65,7 @@ struct dentry_mob_stat {
 int dcache_new_mob(struct dentry *root);
 int dcache_set_mob_size(struct dentry *de, unsigned long size);
 int dcache_get_mob_stat(struct dentry *de, struct dentry_mob_stat __user *stat);
+extern const struct file_operations proc_dentry_mobs_fops;
 
 /*
  * Compare 2 name strings, return 0 if they match, otherwise non-zero.
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (12 preceding siblings ...)
  2011-05-03 12:20 ` [PATCH 13/13] vfs: Dentry mobs listing in proc Pavel Emelyanov
@ 2011-05-06  1:05 ` Dave Chinner
  2011-05-06 12:15   ` Pavel Emelyanov
  2011-05-23  6:43 ` Pavel Emelyanov
  14 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2011-05-06  1:05 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

On Tue, May 03, 2011 at 04:14:37PM +0400, Pavel Emelyanov wrote:
> Hi.
> 
> According to the "release early, release often" strategy :) I'm
> glad to propose this scratch implementation of what I was talking
> about at the LSF - the way to limit the dcache grow for both
> containerized and not systems (the set applies to 2.6.38).

dcache growth is rarely the memory consumption problem in systems -
it's inode cache growth that is the issue. Each inodes consumes 4-5x
as much memory as a dentry, and the dentry lifecycle is a subset of
the inode lifecycle.  Limiting the number of dentries will do very
little to relieve memory problems because of this.

Indeed, I actually get a request from embedded folks every so often
to limit the size of the inode cache - they never have troubles with
the size of the dentry cache (and I do ask) - so perhaps you need to
consider this aspect of the problem a bit more.

FWIW, I often see machines during tests where the dentry cache is
empty, yet there are millions of inodes cached on the inode LRU
consuming gigabytes of memory. e.g a snapshot from my 4GB RAM test
VM right now:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
2180754 2107387  96%    0.21K 121153       18    484612K xfs_ili
2132964 2107389  98%    1.00K 533241        4   2132964K xfs_inode
1625922 944034   58%    0.06K  27558       59    110232K size-64
415320 415301    99%    0.19K  20766       20     83064K dentry

You see 400k active dentries consume 83MB of ram, yet 2.1M active
inodes consuming ~2.6GB of RAM. We've already reclaimed the dentry
cache down quite small, while the inode cache remains the dominant
memory consumer.....

I'm also concerned about the scalability issues - moving back to
global lists and locks for LRU, shrinker and mob management is the
opposite direction we are taking - we want to make the LRUs more
fine-grained and more closely related to the MM structures,
shrinkers confined to per-sb context (no more lifecycle issues,
ever) and operate per-node/-zone rather than globally, etc.  It
seems to me that this containerisation will make much of that work
difficult to acheive effectively because it doesn't take any of this
ongoing scalability work into account.

> The first 5 patches are preparations for this, descriptive (I hope)
> comments are inside them.
> 
> The general idea of this set is -- make the dentries subtrees be
> limited in size and shrink them as they hit the configured limit.

And if the inode cache that does not shrink with it?

> Why subtrees? Because this lets having the [dentry -> group] reference
> without the reference count, letting the [dentry -> parent] one handle
> this.
> 
> Why limited? For containers the answer is simple -- a container
> should not be allowed to consume too much of the host memory. For
> non-containerized systems the answer is -- to protect the kernel
> from the non-privileged attacks on the dcache memory like the 
> "while :; do mkdir x; cd x; done" one and similar.

Which will stop as soon as the path gets too long. And if this is
really a problem on your systems, quotas can prevent this from ever
being an issue....

> What isn't in this patch yet, but should be done after the discussion
> 
> * API. I haven't managed to invent any perfect solution, and would
> really like to have it discussed. In order to be able to play with it 
> the ioctls + proc for listing are proposed.
> 
> * New mounts management. Right now if you mount some new FS to a
> dentry which belongs to some managed set (I named it "mob" in this
> patchset), the new mount is managed with the system settings. This is
> not OK, the new mount should be managed with the settings of the
> mountpoint's mob.
> 
> * Elegant shrink_dcache_memory on global memory shortage. By now the
> code walks the mobs and shinks some equal amount of dentries from them.
> Better shrinking policy can and probably should be implemented.

See above.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-05-06  1:05 ` [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Dave Chinner
@ 2011-05-06 12:15   ` Pavel Emelyanov
  2011-05-07  0:01     ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-06 12:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

On 05/06/2011 05:05 AM, Dave Chinner wrote:
> On Tue, May 03, 2011 at 04:14:37PM +0400, Pavel Emelyanov wrote:
>> Hi.
>>
>> According to the "release early, release often" strategy :) I'm
>> glad to propose this scratch implementation of what I was talking
>> about at the LSF - the way to limit the dcache grow for both
>> containerized and not systems (the set applies to 2.6.38).
> 
> dcache growth is rarely the memory consumption problem in systems -
> it's inode cache growth that is the issue. Each inodes consumes 4-5x
> as much memory as a dentry, and the dentry lifecycle is a subset of
> the inode lifecycle.  Limiting the number of dentries will do very
> little to relieve memory problems because of this.

No, you don't take into account that once we have the dentry cache shrunk
the inode cache can be also shrunk (since there's no objects other than
dentries, that hold inodes in cache), but not the vice versa. That said -- 
if we keep the dentry cache from growing it becomes possible to keep the 
inode cache from growing.

> Indeed, I actually get a request from embedded folks every so often
> to limit the size of the inode cache - they never have troubles with
> the size of the dentry cache (and I do ask) - so perhaps you need to
> consider this aspect of the problem a bit more.
> 
> FWIW, I often see machines during tests where the dentry cache is
> empty, yet there are millions of inodes cached on the inode LRU
> consuming gigabytes of memory. e.g a snapshot from my 4GB RAM test
> VM right now:
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 2180754 2107387  96%    0.21K 121153       18    484612K xfs_ili
> 2132964 2107389  98%    1.00K 533241        4   2132964K xfs_inode
> 1625922 944034   58%    0.06K  27558       59    110232K size-64
> 415320 415301    99%    0.19K  20766       20     83064K dentry
> 
> You see 400k active dentries consume 83MB of ram, yet 2.1M active
> inodes consuming ~2.6GB of RAM. We've already reclaimed the dentry
> cache down quite small, while the inode cache remains the dominant
> memory consumer.....

Same here - this 2.6GB of RAM is shrinkable memory (unless xfs inodes 
references are leaked).

> I'm also concerned about the scalability issues - moving back to
> global lists and locks for LRU, shrinker and mob management is the
> opposite direction we are taking - we want to make the LRUs more
> fine-grained and more closely related to the MM structures,
> shrinkers confined to per-sb context (no more lifecycle issues,
> ever) and operate per-node/-zone rather than globally, etc.  It
> seems to me that this containerisation will make much of that work
> difficult to acheive effectively because it doesn't take any of this
> ongoing scalability work into account.

Two things from my side on this:

1. Can you be more specific on this - which parts of VFS suffer from the
LRU being global? The only thing I found was the problem with shrinking
the dcache for some sb on umount, but in my patch #4 I made both routines
doing it work on dentry tree, not the LRU list and thus the global LRU is
no longer an issue at this point.

2. If for any reason you do need to keep LRU per super block (please share
one if you do) we can create mobs per super block :) In other words - with
mobs we're much more flexible with how to manage dentry LRU-s rather than
with per-sb LRU-s.

>> The first 5 patches are preparations for this, descriptive (I hope)
>> comments are inside them.
>>
>> The general idea of this set is -- make the dentries subtrees be
>> limited in size and shrink them as they hit the configured limit.
> 
> And if the inode cache that does not shrink with it?

Yet again - that's not a big deal. Once we killed dentries, the inodes are
no longer pinned in memory and the very first try_to_free_pages can free them.

>> Why subtrees? Because this lets having the [dentry -> group] reference
>> without the reference count, letting the [dentry -> parent] one handle
>> this.
>>
>> Why limited? For containers the answer is simple -- a container
>> should not be allowed to consume too much of the host memory. For
>> non-containerized systems the answer is -- to protect the kernel
>> from the non-privileged attacks on the dcache memory like the 
>> "while :; do mkdir x; cd x; done" one and similar.
> 
> Which will stop as soon as the path gets too long. 

No, it will *not*! Bash will start complaining, that the won't be able to set
the CWD env. variable, but once you turn this into a C program...

> And if this is really a problem on your systems, quotas can prevent this from 
> ever being an issue....

Disagree.

Let's take the minimal CentOS5.5 container. It contains ~30K files, but in this
container there's no data like web server static pages/scripts, databases,
devel tools, etc. Thus we cannot configure the quota for this container with
less limit. I'd assume that 50K inodes is the minimal what we should set (for
the record - default quota size for this in OpenVZ is 200000, but people most
often increase one).

Having on x64_64 one dentry take ~200 bytes and one ext4 inode take ~1K we
give this container the ability to lock 50K * (200 + 1K) ~ 60M of RAM.

As our experience shows if you have a node with e.g. 2G of RAM you can easily host
up to 20 containers with LAMP stack (you can host more, but this will be notably slow).

Thus, trying to handle the issue with disk quota you are giving your containers
the ability to lock up to 1.2Gb of RAM with dcache + icache. This is way too many.

>> What isn't in this patch yet, but should be done after the discussion
>>
>> * API. I haven't managed to invent any perfect solution, and would
>> really like to have it discussed. In order to be able to play with it 
>> the ioctls + proc for listing are proposed.
>>
>> * New mounts management. Right now if you mount some new FS to a
>> dentry which belongs to some managed set (I named it "mob" in this
>> patchset), the new mount is managed with the system settings. This is
>> not OK, the new mount should be managed with the settings of the
>> mountpoint's mob.
>>
>> * Elegant shrink_dcache_memory on global memory shortage. By now the
>> code walks the mobs and shinks some equal amount of dentries from them.
>> Better shrinking policy can and probably should be implemented.
> 
> See above.
> 
> Cheers,
> 
> Dave.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-05-06 12:15   ` Pavel Emelyanov
@ 2011-05-07  0:01     ` Dave Chinner
  2011-05-10 11:18       ` Pavel Emelyanov
  2011-06-18 13:30       ` Andrea Arcangeli
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2011-05-07  0:01 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

On Fri, May 06, 2011 at 04:15:50PM +0400, Pavel Emelyanov wrote:
> On 05/06/2011 05:05 AM, Dave Chinner wrote:
> > On Tue, May 03, 2011 at 04:14:37PM +0400, Pavel Emelyanov wrote:
> >> Hi.
> >>
> >> According to the "release early, release often" strategy :) I'm
> >> glad to propose this scratch implementation of what I was talking
> >> about at the LSF - the way to limit the dcache grow for both
> >> containerized and not systems (the set applies to 2.6.38).
> > 
> > dcache growth is rarely the memory consumption problem in systems -
> > it's inode cache growth that is the issue. Each inodes consumes 4-5x
> > as much memory as a dentry, and the dentry lifecycle is a subset of
> > the inode lifecycle.  Limiting the number of dentries will do very
> > little to relieve memory problems because of this.
> 
> No, you don't take into account that once we have the dentry cache shrunk
> the inode cache can be also shrunk (since there's no objects other than
> dentries, that hold inodes in cache), but not the vice versa. That said -- 
> if we keep the dentry cache from growing it becomes possible to keep the 
> inode cache from growing.

That's a fairly naive view of the way the caches interact. Unlike
dentries, inodes can be dirtied and can't be reclaimed until they
are clean. That requires IO. Hence the inode cache can't be
reclaimed as easily as the dentry cache, nor can controlling the
size of the dentry cache control the size of the inode
cache. At best, it's a second order effect.

Effectively, what you have is:

	L1 cache = dentry cache
	L2 cache = VFS inode cache,
			pinned by L1,
			pinned by dirty state
	L3 cache = 1st level FS inode cache,
			pinned by L2
			pinned by dirty state

None of the cache sizes are fixed, and overall size is limited only
by RAM, so you will always tend to have the L3 cache dominate memory
usage because:

	a) they are the largest objects in the heirarchy; and
	b) they are pinned by the L1 and L2 caches and need to be
	freed from those caches first.

If you limit the size of the L2/L3 inode cache, you immediately
limit the size of the dentry cache for everything but heavy users of
hard links. If you can't allocate more inodes, you can't allocate a
new dentry.

> > Indeed, I actually get a request from embedded folks every so often
> > to limit the size of the inode cache - they never have troubles with
> > the size of the dentry cache (and I do ask) - so perhaps you need to
> > consider this aspect of the problem a bit more.
> > 
> > FWIW, I often see machines during tests where the dentry cache is
> > empty, yet there are millions of inodes cached on the inode LRU
> > consuming gigabytes of memory. e.g a snapshot from my 4GB RAM test
> > VM right now:
> > 
> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > 2180754 2107387  96%    0.21K 121153       18    484612K xfs_ili
> > 2132964 2107389  98%    1.00K 533241        4   2132964K xfs_inode
> > 1625922 944034   58%    0.06K  27558       59    110232K size-64
> > 415320 415301    99%    0.19K  20766       20     83064K dentry
> > 
> > You see 400k active dentries consume 83MB of ram, yet 2.1M active
> > inodes consuming ~2.6GB of RAM. We've already reclaimed the dentry
> > cache down quite small, while the inode cache remains the dominant
> > memory consumer.....
> 
> Same here - this 2.6GB of RAM is shrinkable memory (unless xfs inodes 
> references are leaked).

They aren't immediately reclaimable - they are all still pinned by
the VFS inode (L2) cache, and will be dirtied by having to truncate
away speculative allocation beyond EOF when the VFS inode cache
frees them. So there is IO required on all of those inodes before
they can be reclaim. That's why the caches have ended up with this
size ratio, and that's from a long running, steady-state workload.
Controlling the dentry cache size won't help reduce that inode cache
size one bit on such workloads....

> > global lists and locks for LRU, shrinker and mob management is the
> > opposite direction we are taking - we want to make the LRUs more
> > fine-grained and more closely related to the MM structures,
> > shrinkers confined to per-sb context (no more lifecycle issues,
> > ever) and operate per-node/-zone rather than globally, etc.  It
> > seems to me that this containerisation will make much of that work
> > difficult to acheive effectively because it doesn't take any of this
> > ongoing scalability work into account.
> 
> Two things from my side on this:
> 
> 1. Can you be more specific on this - which parts of VFS suffer from the
> LRU being global?

Performance. It doesn't scale beyond a few CPUs before lock
contention becomes the limiting factor.

> The only thing I found was the problem with shrinking
> the dcache for some sb on umount, but in my patch #4 I made both routines
> doing it work on dentry tree, not the LRU list and thus the global LRU is
> no longer an issue at this point.

Actually, it is, because you've still got to remove the dentry from
the LRU to free it, which means traversing the global lock.

> 2. If for any reason you do need to keep LRU per super block (please share
> one if you do) we can create mobs per super block :) In other words - with
> mobs we're much more flexible with how to manage dentry LRU-s rather than
> with per-sb LRU-s.

Because of the heirarchical nature of the caches, and the fact that
we've got to jump through hoops to make sure the superblock doesn't
go away while we are doing a shrinker walk (the s_umount lock
problem). Move to a per-sb shrinker means the shrinker callback has
the same life cycle as the superblock, and we no longer have a big
mess of locking and lifecycle concerns in memory reclaim.

On top of that, a single shrinker callout that shrinkers the dentry,
VFS inode and FS inode caches in a single call means we do larger
chunks of work on each superblock at a time instead of a small
handful of dentries or inodes per shrinker call as the current
"proportion across all sbs" code currently works. That will give
reclaim a smaller CPU cache footprint with higher hit rates, so
should significantly reduce the CPU usage of shrinking the caches as
well.

Not to mention having a per-sb shrinker means that you can call the
shrinker from inode allocation when you run out of inodes, and it
will shrink the dentry cache, the VFS inode cache and the FS inode
cache in the correct order to free up inodes as quickly as possible
to allow the new inode allocation to occur....

> >> The first 5 patches are preparations for this, descriptive (I hope)
> >> comments are inside them.
> >>
> >> The general idea of this set is -- make the dentries subtrees be
> >> limited in size and shrink them as they hit the configured limit.
> > 
> > And if the inode cache that does not shrink with it?
> 
> Yet again - that's not a big deal. Once we killed dentries, the inodes are
> no longer pinned in memory and the very first try_to_free_pages can free them.

See above - the inode cache does not shrink in proportion with the
dentry cache.

> >> Why subtrees? Because this lets having the [dentry -> group] reference
> >> without the reference count, letting the [dentry -> parent] one handle
> >> this.
> >>
> >> Why limited? For containers the answer is simple -- a container
> >> should not be allowed to consume too much of the host memory. For
> >> non-containerized systems the answer is -- to protect the kernel
> >> from the non-privileged attacks on the dcache memory like the 
> >> "while :; do mkdir x; cd x; done" one and similar.
> > 
> > Which will stop as soon as the path gets too long. 
> 
> No, it will *not*! Bash will start complaining, that the won't be able to set
> the CWD env. variable, but once you turn this into a C program...

Sounds like a bug in bash ;)

> > And if this is really a problem on your systems, quotas can prevent this from 
> > ever being an issue....
> 
> Disagree.
> 
> Let's take the minimal CentOS5.5 container. It contains ~30K files, but in this
> container there's no data like web server static pages/scripts, databases,
> devel tools, etc. Thus we cannot configure the quota for this container with
> less limit. I'd assume that 50K inodes is the minimal what we should set (for
> the record - default quota size for this in OpenVZ is 200000, but people most
> often increase one).
>
> Having on x64_64 one dentry take ~200 bytes and one ext4 inode take ~1K we
> give this container the ability to lock 50K * (200 + 1K) ~ 60M of RAM.
>
> As our experience shows if you have a node with e.g. 2G of RAM you can easily host
> up to 20 containers with LAMP stack (you can host more, but this will be notably slow).
> 
> Thus, trying to handle the issue with disk quota you are giving your containers
> the ability to lock up to 1.2Gb of RAM with dcache + icache. This is way too many.

<sigh>

I never implied quotas were for limiting cache usage. I only
suggested they were the solution to your DOS example by preventing
unbound numbers of inodes from being created by an unprivileged
user.

To me, it sounds like you overprovision your servers and then
have major troubles when everyone tries to use what you supplied
them with simultaneously. There is a simple solution to that. ;)
Otherwise, I think you need to directly limit the size of the inode
caches, not try to do it implicitly via 2nd and 3rd order side
effects of controlling the size of the dentry cache.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-05-07  0:01     ` Dave Chinner
@ 2011-05-10 11:18       ` Pavel Emelyanov
  2011-06-18 13:30       ` Andrea Arcangeli
  1 sibling, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-10 11:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

>> No, you don't take into account that once we have the dentry cache shrunk
>> the inode cache can be also shrunk (since there's no objects other than
>> dentries, that hold inodes in cache), but not the vice versa. That said -- 
>> if we keep the dentry cache from growing it becomes possible to keep the 
>> inode cache from growing.
> 
> That's a fairly naive view of the way the caches interact. Unlike
> dentries, inodes can be dirtied and can't be reclaimed until they
> are clean. That requires IO. 

The dirty set ITSELF provides big problems, regardless of the inode cache.

Besides, the dirty page *can* be cleaned, thus releasing the inode, while the
dentry held by open file can *not*.

Moreover, the dirty set cannot grow infinitely, thus the amount of this "hard
to reclaim inodes" is limited as well. And the scales of both are not comparable
typically... It's the matter of policy, not the inability.

> Hence the inode cache can't be
> reclaimed as easily as the dentry cache, nor can controlling the
> size of the dentry cache control the size of the inode
> cache. At best, it's a second order effect.

I'm not talking about the control, I'm talking about - will you be able to free
the memory *at* *all* or not.

> Effectively, what you have is:
> 
> 	L1 cache = dentry cache
> 	L2 cache = VFS inode cache,
> 			pinned by L1,
> 			pinned by dirty state
> 	L3 cache = 1st level FS inode cache,
> 			pinned by L2
> 			pinned by dirty state
> 
> None of the cache sizes are fixed, and overall size is limited only
> by RAM, so you will always tend to have the L3 cache dominate memory
> usage because:
> 
> 	a) they are the largest objects in the heirarchy; and
> 	b) they are pinned by the L1 and L2 caches and need to be
> 	freed from those caches first.
> 
> If you limit the size of the L2/L3 inode cache, you immediately
> limit the size of the dentry cache for everything but heavy users of
> hard links. If you can't allocate more inodes, you can't allocate a
> new dentry.

Hm! You propose to manage the Linux page cache by managing the number of
cached inodes? Sounds like a plan ;)

>> 1. Can you be more specific on this - which parts of VFS suffer from the
>> LRU being global?
> 
> Performance. It doesn't scale beyond a few CPUs before lock
> contention becomes the limiting factor.

1. lru lock is global now, regardless of where the dentries are stored, on global
   list or per-sb list;
2. If we're talking about "the possibility to", then per-sb lock is significantly
   less optimization-friendly as compared to per-abstract-mob one. Create mobs
   per-sb and be happy (again).

>> The only thing I found was the problem with shrinking
>> the dcache for some sb on umount, but in my patch #4 I made both routines
>> doing it work on dentry tree, not the LRU list and thus the global LRU is
>> no longer an issue at this point.
> 
> Actually, it is, because you've still got to remove the dentry from
> the LRU to free it, which means traversing the global lock.

You still got to remove the dentry from the per-SB LRU to free it, which
means traversing the global lock. So, where's the catch?

>> 2. If for any reason you do need to keep LRU per super block (please share
>> one if you do) we can create mobs per super block :) In other words - with
>> mobs we're much more flexible with how to manage dentry LRU-s rather than
>> with per-sb LRU-s.
> 
> Because of the heirarchical nature of the caches, and the fact that
> we've got to jump through hoops to make sure the superblock doesn't
> go away while we are doing a shrinker walk (the s_umount lock
> problem). Move to a per-sb shrinker means the shrinker callback has
> the same life cycle as the superblock, and we no longer have a big
> mess of locking and lifecycle concerns in memory reclaim.
> 
> On top of that, a single shrinker callout that shrinkers the dentry,
> VFS inode and FS inode caches in a single call means we do larger
> chunks of work on each superblock at a time instead of a small
> handful of dentries or inodes per shrinker call as the current
> "proportion across all sbs" code currently works. That will give
> reclaim a smaller CPU cache footprint with higher hit rates, so
> should significantly reduce the CPU usage of shrinking the caches as
> well.
> 
> Not to mention having a per-sb shrinker means that you can call the
> shrinker from inode allocation when you run out of inodes, and it
> will shrink the dentry cache, the VFS inode cache and the FS inode
> cache in the correct order to free up inodes as quickly as possible
> to allow the new inode allocation to occur....

Yet again - this all sounds great, but why didn't you think on my proposal to 
create mobs per super block? This will solve not only my problems, but also 
all of *yours*.

>>>> The first 5 patches are preparations for this, descriptive (I hope)
>>>> comments are inside them.
>>>>
>>>> The general idea of this set is -- make the dentries subtrees be
>>>> limited in size and shrink them as they hit the configured limit.
>>>
>>> And if the inode cache that does not shrink with it?
>>
>> Yet again - that's not a big deal. Once we killed dentries, the inodes are
>> no longer pinned in memory and the very first try_to_free_pages can free them.
> 
> See above - the inode cache does not shrink in proportion with the
> dentry cache.

See above - having the big amount of pinned dentries give you no chance to 
free any inode, not the vice versa.

> <sigh>
> 
> I never implied quotas were for limiting cache usage. I only
> suggested they were the solution to your DOS example by preventing
> unbound numbers of inodes from being created by an unprivileged
> user.

As I've shown above this does NOT prevent you from DOS. This ... trick with
quota cannot be considered to solve *any* problems with memory.

> To me, it sounds like you overprovision your servers and then

Not *my* servers. But this doesn't matter.

> have major troubles when everyone tries to use what you supplied
> them with simultaneously. There is a simple solution to that. ;)

That's  how people use Linux (and Containers) - they do overcommit
resources and hard-limiting everything to the physical ability of a
host is not always the best way to go.

> Otherwise, I think you need to directly limit the size of the inode
> caches, not try to do it implicitly via 2nd and 3rd order side
> effects of controlling the size of the dentry cache.

As I stated above - limiting the inode cache size will have uncontrollable
effect of the Linux ability to manage the page cache.

> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
                   ` (13 preceding siblings ...)
  2011-05-06  1:05 ` [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Dave Chinner
@ 2011-05-23  6:43 ` Pavel Emelyanov
  14 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-05-23  6:43 UTC (permalink / raw)
  To: Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, Alexa
  Cc: linux-fsdevel

Gentlemen,

I'm sorry to bother you taking into account the merge window rush, but
may I ask you to find some time in your schedule for this set, please.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-05-07  0:01     ` Dave Chinner
  2011-05-10 11:18       ` Pavel Emelyanov
@ 2011-06-18 13:30       ` Andrea Arcangeli
  2011-06-20  0:49         ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2011-06-18 13:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Pavel Emelyanov, Hugh Dickins, Nick Piggin, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

Hi everyone,

I would suggest to re-submit the first few locking improvements that
are independent of the per-container dentry limit. Increasing the
seqlock if there's no modification to the struct is unnecessary, looks
nice and we don't want it lost if valid microopt. And the patchset
size to discuss will decrease too ;).

On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote:
> They aren't immediately reclaimable - they are all still pinned by
> the VFS inode (L2) cache, and will be dirtied by having to truncate
> away speculative allocation beyond EOF when the VFS inode cache
> frees them. So there is IO required on all of those inodes before
> they can be reclaim. That's why the caches have ended up with this
> size ratio, and that's from a long running, steady-state workload.
> Controlling the dentry cache size won't help reduce that inode cache
> size one bit on such workloads....

Certainly opening a flood of inodes, changing some attribute and
writing 1 page to disk, by reusing the same dentry wouldn't provide a
too nice effect but from the container point of view it'd be still
better than a unlimited number of simultaneous pinned inodes which
makes far too easy to DoS. Maybe next step would be to require some
other logic to limit the number of dirty inodes that can be opened by
a container. And waiting on inode writeback and pagecache writeback
and shrinkage during open(2), won't even -EFAIL but it'd just wait so
it'd be more graceful than the effect of too many dentries.  That
would likely be a lot more complex than a dentry limit though... so if
that is next thing to expect we should take that into account from
complexity point of view.

Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2)
aren't so nice for apps, which is I'm not so fond of the container
virt vs a virt where the container manages its memory and no DoS issue
like this one can ever materialize for the host, and it requires no
added complexity to the host. The container approach won't ever be as
reliable as guest OS in avoiding these issues, so we maybe shouldn't
complain that this solution isn't perfect for the inode cache, when
clearly it will help their usage.

> > > global lists and locks for LRU, shrinker and mob management is the
> > > opposite direction we are taking - we want to make the LRUs more
> > > fine-grained and more closely related to the MM structures,
> > > shrinkers confined to per-sb context (no more lifecycle issues,
> > > ever) and operate per-node/-zone rather than globally, etc.  It
> > > seems to me that this containerisation will make much of that work
> > > difficult to acheive effectively because it doesn't take any of this
> > > ongoing scalability work into account.
> > 
> > Two things from my side on this:
> > 
> > 1. Can you be more specific on this - which parts of VFS suffer from the
> > LRU being global?
> 
> Performance. It doesn't scale beyond a few CPUs before lock
> contention becomes the limiting factor.

The global vs per-zone/numa dentry lru vs global seems an interesting
point. Probably here we've two different point of views that asks for
development going into two different directions because of different
production objectives and priorities. This won't be as easy to get an
agreement on.

Maybe I remember wrong, but I seem to recall Nick was proposing to
split per-zone/numa vfs lrus too, and Christoph was against it (or
maybe it was the other way around :). But it wasn't discussed in
container context and I don't remember exactly what the cons
were. Global usually provides better lru behavior, and splitting
arbitrarily among zone/node tends to be a scalability boost but if we
only see it as a lock-scalability improvement it becomes a tradeoff
between better lru info and better scalability, so then we could
arbitrarily split the lru without regard of the actual zone/node
size. It is much better to split lrus on zone/node boundaries when
there's a real need to shrink specific zones/nodes from a reclaim
point of view, not just better scalability when taking the lock.

We obviously use that zone/node lru split in the pagecache lru, and
clearly when we've to shrink a single node it helps more than just for
lock scalability, so the per-node lrus is certainly needed in NUMA
setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids
global lru churning as well. So maybe a per zone/node lru would
provide similar benefits for vfs caches and it would be indeed the
right direction for MM point of view (ignoring this very issue of
containers). Right now we do blind vfs shrinks when we could do
selective zone/node ones as the vfs shrinker caller has the zone/node
info already. Maybe whoever was against it (regardless of this
container dentry limit discussion) should point out what the cons are.

> I never implied quotas were for limiting cache usage. I only
> suggested they were the solution to your DOS example by preventing
> unbound numbers of inodes from being created by an unprivileged
> user.
> 
> To me, it sounds like you overprovision your servers and then
> have major troubles when everyone tries to use what you supplied
> them with simultaneously. There is a simple solution to that. ;)
> Otherwise, I think you need to directly limit the size of the inode
> caches, not try to do it implicitly via 2nd and 3rd order side
> effects of controlling the size of the dentry cache.

They want to limit the number of simultaneously pinned amount of
kernel ram structures, while still leaving huge amount of files
possible in the filesystem to make life simple during install
etc... So you can untar whatever size of backup into the container
regardless of quotas, but if only a part of the unpacked data (common
case) is used by the apps it just works. Again I don't think the
objective is a perfect accounting but just something that happen to
works better, if one wants perfect accounting of the memory and bytes
utilized by the on-disk image there are other types of virt available.

Yet another approach would be to account how much kernel data
structures each process is keeping pinned and unfreeable and sum that
to the process RAM during the oom killer decision, but that wouldn't
be an hard per container limit and it sounds way too CPU costly to
account the vfs pinned RAM every time somebody open() or chdir(), it'd
require to count too many things towards the dentry root.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/13] vfs: Introduce the dentry mobs
  2011-05-03 12:18 ` [PATCH 8/13] vfs: Introduce the dentry mobs Pavel Emelyanov
@ 2011-06-18 13:40   ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2011-06-18 13:40 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Alexander Viro, linux-fsdevel

On Tue, May 03, 2011 at 04:18:33PM +0400, Pavel Emelyanov wrote:
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 64848dd..80bb9e4 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -48,6 +48,12 @@ struct dentry_stat_t {
>  };
>  extern struct dentry_stat_t dentry_stat;
>  
> +struct dentry_mob {
> +	struct percpu_counter nr_dentry;
> +	unsigned long nr_dentry_max;
> +	struct list_head dentry_lru;
> +};
> +
>  /*
>   * Compare 2 name strings, return 0 if they match, otherwise non-zero.
>   * The strings are both count bytes long, and count is non-zero.
> @@ -143,6 +149,7 @@ struct dentry {
>  	} d_u;
>  	struct list_head d_subdirs;	/* our children */
>  	struct list_head d_alias;	/* inode alias list */
> +	struct dentry_mob *d_mob;
>  };

I'd prefer if it was under a config option considering it'd waste 8
bytes of memory for every dentry in the system, if you don't ever use
it because you use other types of virt that solves the guest memory
partitioning without any change to the host OS. Same may apply to
other parts of the patch (yeah, maybe SLAB_HWCACHE_ALIGN won't change
memory footprint of dcache today, I didn't actually count but dentry
may change tomorrow making the dentry require one more cacheline that
is not cheap at all in memory footprint terms). I don't think all
distro are enabling all host support you need in the first place, so I
don't think it should be a problem for you if this goes under a config
option, surely I'd be more comfortable with that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-06-18 13:30       ` Andrea Arcangeli
@ 2011-06-20  0:49         ` Dave Chinner
  2011-07-04  5:32           ` Pavel Emelyanov
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2011-06-20  0:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pavel Emelyanov, Hugh Dickins, Nick Piggin, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

On Sat, Jun 18, 2011 at 03:30:38PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> I would suggest to re-submit the first few locking improvements that
> are independent of the per-container dentry limit. Increasing the
> seqlock if there's no modification to the struct is unnecessary, looks
> nice and we don't want it lost if valid microopt. And the patchset
> size to discuss will decrease too ;).
> 
> On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote:
> > They aren't immediately reclaimable - they are all still pinned by
> > the VFS inode (L2) cache, and will be dirtied by having to truncate
> > away speculative allocation beyond EOF when the VFS inode cache
> > frees them. So there is IO required on all of those inodes before
> > they can be reclaim. That's why the caches have ended up with this
> > size ratio, and that's from a long running, steady-state workload.
> > Controlling the dentry cache size won't help reduce that inode cache
> > size one bit on such workloads....
> 
> Certainly opening a flood of inodes, changing some attribute and
> writing 1 page to disk, by reusing the same dentry wouldn't provide a
> too nice effect but from the container point of view it'd be still
> better than a unlimited number of simultaneous pinned inodes which
> makes far too easy to DoS.

Perhaps you haven't understood how the VFS cache reclaim works? The
dentry cache shrinker runs before the inode cache shrinker, and
hence we unpin inodes before trying to reclaim inodes. Hence you can't
"DOS" the system by reading dentries and using them to pin
inodes....

> Maybe next step would be to require some
> other logic to limit the number of dirty inodes that can be opened by
> a container.

You don't "open" dirty inodes. Inodes are clean until they are
marked dirty as a result of some other operation. Hence a limit on
the number of dirty inodes in a container can only be done via a
limit on the total number of inodes.

> And waiting on inode writeback and pagecache writeback
> and shrinkage during open(2),

What makes you think open(2) is the only consumer of inodes? e.g.
"ls -l" causes the dentry/inode cache to be populated via
lstat(2). Another example: NFS file handle lookup....

> won't even -EFAIL but it'd just wait so
> it'd be more graceful than the effect of too many dentries.  That
> would likely be a lot more complex than a dentry limit though... so if
> that is next thing to expect we should take that into account from
> complexity point of view.
> 
> Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2)
> aren't so nice for apps, which is I'm not so fond of the container
> virt vs a virt where the container manages its memory and no DoS issue
> like this one can ever materialize for the host, and it requires no
> added complexity to the host. The container approach won't ever be as
> reliable as guest OS in avoiding these issues, so we maybe shouldn't
> complain that this solution isn't perfect for the inode cache, when
> clearly it will help their usage.

I'm not sure what point you are trying to get across here? What do
containers have to do with how applications handle errors that can
already occur?

> > > > global lists and locks for LRU, shrinker and mob management is the
> > > > opposite direction we are taking - we want to make the LRUs more
> > > > fine-grained and more closely related to the MM structures,
> > > > shrinkers confined to per-sb context (no more lifecycle issues,
> > > > ever) and operate per-node/-zone rather than globally, etc.  It
> > > > seems to me that this containerisation will make much of that work
> > > > difficult to acheive effectively because it doesn't take any of this
> > > > ongoing scalability work into account.
> > > 
> > > Two things from my side on this:
> > > 
> > > 1. Can you be more specific on this - which parts of VFS suffer from the
> > > LRU being global?
> > 
> > Performance. It doesn't scale beyond a few CPUs before lock
> > contention becomes the limiting factor.
> 
> The global vs per-zone/numa dentry lru vs global seems an interesting
> point. Probably here we've two different point of views that asks for
> development going into two different directions because of different
> production objectives and priorities. This won't be as easy to get an
> agreement on.
> 
> Maybe I remember wrong, but I seem to recall Nick was proposing to
> split per-zone/numa vfs lrus too, and Christoph was against it (or
> maybe it was the other way around :).

It was Nick and myself that differed in opinion, not Christoph ;)

> But it wasn't discussed in
> container context and I don't remember exactly what the cons
> were. Global usually provides better lru behavior, and splitting
> arbitrarily among zone/node tends to be a scalability boost but if we
> only see it as a lock-scalability improvement it becomes a tradeoff
> between better lru info and better scalability, so then we could
> arbitrarily split the lru without regard of the actual zone/node
> size.

Which was my argument - that filesystems scale along different axis
than the VM, so tightly integrating the LRU implementation at a high
level to the MM architecture is not necessarily the right thing to
do in all cases.  If we tie everything to the MM architecture,
anything that scales along different axis to the MM subsystem is
stuck with a nasty impedence mismatch.

Point in case: XFS scales it's internal inode cache via per
allocation group structures, not per-NUMA zone. It does this to
provide optimal IO scheduling when reclaiming inodes because the
cost of making bad IO decisions is orders of magnitude worse than
reclaiming an object that the VM doesn't consider necessary for
reclaim...

IOWs, XFS optimises inode reclaim for high IO throughput as IO has
much higher cost than spending CPU time scanning lists. Hence
filesystems often have a fundamentally different memory reclaim
scalability problem to the MM subsystem, but Nick considered that
problem irrelevant to the architecture of the VFS cache reclaim
subsystem. That was the basis of our disagreement....

> It is much better to split lrus on zone/node boundaries when
> there's a real need to shrink specific zones/nodes from a reclaim
> point of view, not just better scalability when taking the lock.
>
> We obviously use that zone/node lru split in the pagecache lru, and
> clearly when we've to shrink a single node it helps more than just for
> lock scalability, so the per-node lrus is certainly needed in NUMA
> setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids
> global lru churning as well.

Which was Nick's argument - it should all be done according to how
the MM subsystem sees the world....

> So maybe a per zone/node lru would
> provide similar benefits for vfs caches and it would be indeed the
> right direction for MM point of view (ignoring this very issue of
> containers). Right now we do blind vfs shrinks when we could do
> selective zone/node ones as the vfs shrinker caller has the zone/node
> info already. Maybe whoever was against it (regardless of this
> container dentry limit discussion) should point out what the cons are.

Nick's implementation of the per-mm-zone LRUs was the biggest
problem  - it tightly coupled the VFS cache LRUs directly to the
struct zone in the mm.  This meant that any subsystem that wanted to
use the same per-node LRU + shrinker infrastructure needed to tie
deeply into the MM architecture.  IOWs, it didn't provide any
abstraction from the VM, nor provide the necessary flexibility for
subsystems to use their own LRUs or object reclaim tracking
infrstructure....

> > I never implied quotas were for limiting cache usage. I only
> > suggested they were the solution to your DOS example by preventing
> > unbound numbers of inodes from being created by an unprivileged
> > user.
> > 
> > To me, it sounds like you overprovision your servers and then
> > have major troubles when everyone tries to use what you supplied
> > them with simultaneously. There is a simple solution to that. ;)
> > Otherwise, I think you need to directly limit the size of the inode
> > caches, not try to do it implicitly via 2nd and 3rd order side
> > effects of controlling the size of the dentry cache.
> 
> They want to limit the number of simultaneously pinned amount of
> kernel ram structures, while still leaving huge amount of files
> possible in the filesystem to make life simple during install

But we don't pin the memory in the vfs caches forever - it gets
freed when we run out of free memory. The caches only grow large
when there isn't any other selection pressure (i.e. application or
page cache) to cause the VFS caches to shrink. In reality, this is a
generic problem that people have been hitting for years, and it is
not specific to containerised configurations.

> etc... So you can untar whatever size of backup into the container
> regardless of quotas, but if only a part of the unpacked data (common
> case) is used by the apps it just works.

Well, it does just work in most cases, containerised or not. It's
when you push the boundaries (effectively overcommit resources) that
the current cache reclaim algorithms fail, containerised or not.

The point I'm trying to get across is that this problem is not
unique to containerised systems, so a solution that is tailored to a
specific containerised system implementation does not solve the
generic problem that is the root cause.

> Again I don't think the
> objective is a perfect accounting but just something that happen to
> works better, if one wants perfect accounting of the memory and bytes
> utilized by the on-disk image there are other types of virt available.

IOWs, they want per-container resource limits. I'd suggest that any
solution along these lines needs to use existing infrastructure (i.e
cgroups) to control the resource usage of a given container...

FYI, the way I am trying to solve this problem is as follows:

	1. Encode the reclaim dependency in the VFS cache memory
	   reclaim implementation.

		-> per-sb shrinker implementation
			-> per-sb LRU lists
			-> per-sb locking
			-> binds dentry, inode and filesystem inode
			   cache reclaim together
			-> allows LRU scalability to be adressed
			   indepenedently at a future time
		-> patches already out for review

	2. provide global cache limiting at inode/dentry allocation
	   time

		-> calls per-sb shrinker to free inodes on same sb
		-> can be done asynchrounously
		-> no new locking/lifecycle issues
		-> no cross-sb reference/locking issues

	3. add cache size limiting to cgroup infrastructure
		-> just another level of abstraction to existing
		   infrastructure
		-> ties in with existing resource limiting
		   mechanisms in the kernel

Basically the concept of "mobs" ends up being subsumed by
cgroups, but only at the LRU level and has no hooks into the dentry
cache heirarchy at all. i.e. reclaim works just like it does now,
but it is simply container-aware. We're already having to solve
these issues for cgroup-aware dirty page writeback (i.e. making the
bdi-flusher infrastructure cgroup aware), so this is not as big a
leap as you might think. It also avoids the need for a one-off
configuration ABI just for controlling the dentry cache....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
  2011-06-20  0:49         ` Dave Chinner
@ 2011-07-04  5:32           ` Pavel Emelyanov
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2011-07-04  5:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrea Arcangeli, Hugh Dickins, Nick Piggin, Rik van Riel,
	Dave Hansen, Alexander Viro, linux-fsdevel

>> Again I don't think the
>> objective is a perfect accounting but just something that happen to
>> works better, if one wants perfect accounting of the memory and bytes
>> utilized by the on-disk image there are other types of virt available.
> 
> IOWs, they want per-container resource limits. I'd suggest that any
> solution along these lines needs to use existing infrastructure (i.e
> cgroups) to control the resource usage of a given container...

The problem is that dcache memory is not tied to ANY group of tasks, while
the cgroups engine's idea is to control the resource consumption of a task
group. Thus the cgroup engine is not suitable for this.

Besides, the dcache is likely to get shared between several tasks (task groups)
and thus we'll face the same problems we're now having the the memory controller.

However, I'm open for discussion if you propose some way for extending it 
on resources, that are not directly related to tasks.

> FYI, the way I am trying to solve this problem is as follows:
> 
> 	1. Encode the reclaim dependency in the VFS cache memory
> 	   reclaim implementation.
> 
> 		-> per-sb shrinker implementation
> 			-> per-sb LRU lists
> 			-> per-sb locking
> 			-> binds dentry, inode and filesystem inode
> 			   cache reclaim together
> 			-> allows LRU scalability to be adressed
> 			   indepenedently at a future time
> 		-> patches already out for review
> 
> 	2. provide global cache limiting at inode/dentry allocation
> 	   time
> 
> 		-> calls per-sb shrinker to free inodes on same sb
> 		-> can be done asynchrounously
> 		-> no new locking/lifecycle issues
> 		-> no cross-sb reference/locking issues

For 1. and 2. do s/per-sb/per-mob/
So how does it differ from the mobs concept?

> 	3. add cache size limiting to cgroup infrastructure
> 		-> just another level of abstraction to existing
> 		   infrastructure
> 		-> ties in with existing resource limiting
> 		   mechanisms in the kernel
> 
> Basically the concept of "mobs" ends up being subsumed by cgroups,

Nope, the mobs are independent from cgroups.

> but only at the LRU level and has no hooks into the dentry
> cache heirarchy at all. 

What do you mean by this? The mob is tightly coupled with the dentry cache!

> i.e. reclaim works just like it does now, but it is simply container-aware. 

No it is not - it just breaks the global dentry LRU into smaller ones. No
containers at that point at all.

> We're already having to solve
> these issues for cgroup-aware dirty page writeback (i.e. making the
> bdi-flusher infrastructure cgroup aware), so this is not as big a
> leap as you might think. It also avoids the need for a one-off
> configuration ABI just for controlling the dentry cache....

Controlling the dirty sets (of both pages and inodes) is an important task
of course, but putting the dcache management as an essential part of this
is the wrong way to go methinks.

> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-07-04  5:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
2011-05-03 12:15 ` [PATCH 1/13] vfs: Lighten r/o rename_lock lockers Pavel Emelyanov
2011-05-03 12:15 ` [PATCH 2/13] vfs: Factor out rename_lock locking Pavel Emelyanov
2011-05-03 12:16 ` [PATCH 3/13] vfs: Make the rename_lock per-sb Pavel Emelyanov
2011-05-03 12:16 ` [PATCH 4/13] vfs: Factor out tree (of four) shrinkers code Pavel Emelyanov
2011-05-03 12:17 ` [PATCH 5/13] vfs: Make dentry LRU list global Pavel Emelyanov
2011-05-03 12:17 ` [PATCH 6/13] vfs: Turn the nr_dentry into percpu_counter Pavel Emelyanov
2011-05-03 12:18 ` [PATCH 7/13] vfs: Limit the number of dentries globally Pavel Emelyanov
2011-05-03 12:18 ` [PATCH 8/13] vfs: Introduce the dentry mobs Pavel Emelyanov
2011-06-18 13:40   ` Andrea Arcangeli
2011-05-03 12:18 ` [PATCH 9/13] vfs: More than one mob management Pavel Emelyanov
2011-05-03 12:19 ` [PATCH 10/13] vfs: Routnes for setting mob size and getting stats Pavel Emelyanov
2011-05-03 12:19 ` [PATCH 11/13] vfs: Make shrink_dcache_memory prune dcache from all mobs Pavel Emelyanov
2011-05-03 12:20 ` [PATCH 12/13] vfs: Mobs creation and mgmt API Pavel Emelyanov
2011-05-03 12:20 ` [PATCH 13/13] vfs: Dentry mobs listing in proc Pavel Emelyanov
2011-05-06  1:05 ` [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Dave Chinner
2011-05-06 12:15   ` Pavel Emelyanov
2011-05-07  0:01     ` Dave Chinner
2011-05-10 11:18       ` Pavel Emelyanov
2011-06-18 13:30       ` Andrea Arcangeli
2011-06-20  0:49         ` Dave Chinner
2011-07-04  5:32           ` Pavel Emelyanov
2011-05-23  6:43 ` Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).