[PATCH 0/4] Per-superblock dcache limitation

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Per-superblock dcache limitation
@ 2011-07-29 13:44 Glauber Costa
  2011-07-29 13:44 ` [PATCH 1/4] Keep nr_dentry per super block Glauber Costa
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Glauber Costa @ 2011-07-29 13:44 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Hello,

Now that we've got per-sb dcache lru lists, I am proposing
through theses patches that each of those be size-limited.
The main use case I am targetting is containers, where we want
a container not to use more than a certain specified amount
of non-shrikable kernel memory. 

Comments welcome

Glauber Costa (4):
  Keep nr_dentry per super block
  limit nr_dentries per superblock
  dcache set size
  parse options in the vfs level

 fs/dcache.c            |   75 +++++++++++++++++++++++++++++++---------
 fs/namespace.c         |   89 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/super.c             |    3 ++
 include/linux/dcache.h |    1 +
 include/linux/fs.h     |    3 ++
 5 files changed, 155 insertions(+), 16 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] Keep nr_dentry per super block
  2011-07-29 13:44 [PATCH 0/4] Per-superblock dcache limitation Glauber Costa
@ 2011-07-29 13:44 ` Glauber Costa
  2011-07-31  0:50   ` Dave Chinner
  2011-07-29 13:44 ` [PATCH 2/4] limit nr_dentries per superblock Glauber Costa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Glauber Costa @ 2011-07-29 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins,
	Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen,
	James Bottomley, David Chinner, Glauber Costa

Now that we have per-sb shrinkers, it makes sense to have nr_dentries
stored per sb as well. We turn them into per-cpu counters so we can
keep acessing them without locking.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
---
 fs/dcache.c        |   18 ++++++++++--------
 fs/super.c         |    2 ++
 include/linux/fs.h |    2 ++
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b05aac3..9cb6395 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -115,16 +115,18 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
-static DEFINE_PER_CPU(unsigned int, nr_dentry);
-
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+static void super_nr_dentry(struct super_block *sb, void *arg)
+{
+	int *dentries = arg;
+	*dentries += percpu_counter_sum_positive(&sb->s_nr_dentry);
+}
+
 static int get_nr_dentry(void)
 {
-	int i;
 	int sum = 0;
-	for_each_possible_cpu(i)
-		sum += per_cpu(nr_dentry, i);
-	return sum < 0 ? 0 : sum;
+	iterate_supers(super_nr_dentry, &sum);
+	return sum;
 }
 
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
@@ -151,7 +153,7 @@ static void __d_free(struct rcu_head *head)
 static void d_free(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_count);
-	this_cpu_dec(nr_dentry);
+	percpu_counter_dec(&dentry->d_sb->s_nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
@@ -1225,7 +1227,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_u.d_child);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
 
-	this_cpu_inc(nr_dentry);
+	percpu_counter_inc(&dentry->d_sb->s_nr_dentry);
 
 	return dentry;
 }
diff --git a/fs/super.c b/fs/super.c
index 3f56a26..b16d8e8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -183,6 +183,8 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_shrink.seeks = DEFAULT_SEEKS;
 		s->s_shrink.shrink = prune_super;
 		s->s_shrink.batch = 1024;
+
+		percpu_counter_init(&s->s_nr_dentry, 0);
 	}
 out:
 	return s;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f23bcb7..8150f52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1399,6 +1399,8 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
+	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
+
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
 	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_inode_lru;		/* unused inode lru */
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] Keep nr_dentry per super block
  2011-07-29 13:44 ` [PATCH 1/4] Keep nr_dentry per super block Glauber Costa
@ 2011-07-31  0:50   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2011-07-31  0:50 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley

On Fri, Jul 29, 2011 at 05:44:16PM +0400, Glauber Costa wrote:
> Now that we have per-sb shrinkers, it makes sense to have nr_dentries
> stored per sb as well. We turn them into per-cpu counters so we can
> keep acessing them without locking.

Comments below.

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> ---
>  fs/dcache.c        |   18 ++++++++++--------
>  fs/super.c         |    2 ++
>  include/linux/fs.h |    2 ++
>  3 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index b05aac3..9cb6395 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -115,16 +115,18 @@ struct dentry_stat_t dentry_stat = {
>  	.age_limit = 45,
>  };
>  
> -static DEFINE_PER_CPU(unsigned int, nr_dentry);
> -
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +static void super_nr_dentry(struct super_block *sb, void *arg)
> +{
> +	int *dentries = arg;
> +	*dentries += percpu_counter_sum_positive(&sb->s_nr_dentry);
> +}
> +
>  static int get_nr_dentry(void)
>  {
> -	int i;
>  	int sum = 0;
> -	for_each_possible_cpu(i)
> -		sum += per_cpu(nr_dentry, i);
> -	return sum < 0 ? 0 : sum;
> +	iterate_supers(super_nr_dentry, &sum);
> +	return sum;
>  }

That is rather expensive for large CPU count machines. Think of what
happens now when someone now reads nr_dentrys on a 4096 CPU machine
with a couple of hundred active superblocks.

If you are going to use the struct percpu_counter (see below,
however), then we coul dprobably just get away with a
percpu_counter_read_positive() call as this summation is used only
by /proc readers.

However, I'd suggest that you just leave the existing global counter
alone - it has very little overhead and avoids the need for per-sb,
per-cpu iteration explosions.

>  int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
> @@ -151,7 +153,7 @@ static void __d_free(struct rcu_head *head)
>  static void d_free(struct dentry *dentry)
>  {
>  	BUG_ON(dentry->d_count);
> -	this_cpu_dec(nr_dentry);
> +	percpu_counter_dec(&dentry->d_sb->s_nr_dentry);
>  	if (dentry->d_op && dentry->d_op->d_release)
>  		dentry->d_op->d_release(dentry);
>  
> @@ -1225,7 +1227,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
>  	INIT_LIST_HEAD(&dentry->d_u.d_child);
>  	d_set_d_op(dentry, dentry->d_sb->s_d_op);
>  
> -	this_cpu_inc(nr_dentry);
> +	percpu_counter_inc(&dentry->d_sb->s_nr_dentry);
>  
>  	return dentry;
>  }
> diff --git a/fs/super.c b/fs/super.c
> index 3f56a26..b16d8e8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -183,6 +183,8 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  		s->s_shrink.seeks = DEFAULT_SEEKS;
>  		s->s_shrink.shrink = prune_super;
>  		s->s_shrink.batch = 1024;
> +
> +		percpu_counter_init(&s->s_nr_dentry, 0);
>  	}
>  out:
>  	return s;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f23bcb7..8150f52 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1399,6 +1399,8 @@ struct super_block {
>  	struct list_head	s_dentry_lru;	/* unused dentry lru */
>  	int			s_nr_dentry_unused;	/* # of dentry on lru */
>  
> +	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
> +

I got well and truly beaten down for trying to use struct
percpu_counter counters in the inode and dentry cache because "they
have way too much overhead for fast path operations" compared to
this_cpu_inc() and this_cpu_dec(). That requires more work to set
up, though, for embedded structures like this (i.e. needs it's own
initialisation via alloc_percpu(), IIRC), but should result in an
implementation with no additional overhead.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/4] limit nr_dentries per superblock
  2011-07-29 13:44 [PATCH 0/4] Per-superblock dcache limitation Glauber Costa
  2011-07-29 13:44 ` [PATCH 1/4] Keep nr_dentry per super block Glauber Costa
@ 2011-07-29 13:44 ` Glauber Costa
  2011-07-31  1:15   ` Dave Chinner
  2011-07-29 13:44 ` [PATCH 3/4] dcache set size Glauber Costa
  2011-07-29 13:44 ` [PATCH 4/4] parse options in the vfs level Glauber Costa
  3 siblings, 1 reply; 13+ messages in thread
From: Glauber Costa @ 2011-07-29 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins,
	Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen,
	James Bottomley, David Chinner, Glauber Costa

This patch lays the foundation for us to limit the dcache size.
Each super block can have only a maximum amount of dentries under its
sub-tree. Allocation fails if we we're over limit and the cache
can't be pruned to free up space for the newcomers.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
---
 fs/dcache.c        |   45 +++++++++++++++++++++++++++++++++++++--------
 fs/super.c         |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9cb6395..3bdb106 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -663,7 +663,7 @@ EXPORT_SYMBOL(d_prune_aliases);
  *
  * This may fail if locks cannot be acquired no problem, just try again.
  */
-static void try_prune_one_dentry(struct dentry *dentry)
+static int try_prune_one_dentry(struct dentry *dentry)
 	__releases(dentry->d_lock)
 {
 	struct dentry *parent;
@@ -680,9 +680,9 @@ static void try_prune_one_dentry(struct dentry *dentry)
 	 * fragmentation.
 	 */
 	if (!parent)
-		return;
+		return 1;
 	if (parent == dentry)
-		return;
+		return 0;
 
 	/* Prune ancestors. */
 	dentry = parent;
@@ -691,15 +691,17 @@ static void try_prune_one_dentry(struct dentry *dentry)
 		if (dentry->d_count > 1) {
 			dentry->d_count--;
 			spin_unlock(&dentry->d_lock);
-			return;
+			return 1;
 		}
 		dentry = dentry_kill(dentry, 1);
 	}
+	return 1;
 }
 
-static void shrink_dentry_list(struct list_head *list)
+static int shrink_dentry_list(struct list_head *list)
 {
 	struct dentry *dentry;
+	int pruned = 0;
 
 	rcu_read_lock();
 	for (;;) {
@@ -725,11 +727,13 @@ static void shrink_dentry_list(struct list_head *list)
 
 		rcu_read_unlock();
 
-		try_prune_one_dentry(dentry);
+		pruned += try_prune_one_dentry(dentry);
 
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
+
+	return pruned;
 }
 
 /**
@@ -740,7 +744,7 @@ static void shrink_dentry_list(struct list_head *list)
  *
  * If flags contains DCACHE_REFERENCED reference dentries will not be pruned.
  */
-static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
+static int __shrink_dcache_sb(struct super_block *sb, int count, int flags)
 {
 	struct dentry *dentry;
 	LIST_HEAD(referenced);
@@ -781,7 +785,7 @@ relock:
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lru_lock);
 
-	shrink_dentry_list(&tmp);
+	return shrink_dentry_list(&tmp);
 }
 
 /**
@@ -1176,6 +1180,28 @@ void shrink_dcache_parent(struct dentry * parent)
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
+static int dcache_mem_check(struct super_block *sb)
+{
+	int nr_dentry;
+	int count;
+
+	if (percpu_counter_read_positive(&sb->s_nr_dentry) <=
+						sb->s_nr_dentry_max)
+		return 0;
+
+	do {
+		nr_dentry = percpu_counter_sum_positive(&sb->s_nr_dentry);
+
+		if (nr_dentry < sb->s_nr_dentry_max)
+			return 0;
+
+		count = (nr_dentry - sb->s_nr_dentry_max) << 1;
+
+	} while (__shrink_dcache_sb(sb, count, DCACHE_REFERENCED) > 0);
+
+	return -ENOMEM;
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1191,6 +1217,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;
 
+	if (dcache_mem_check(sb))
+		return NULL;
+
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
 	if (!dentry)
 		return NULL;
diff --git a/fs/super.c b/fs/super.c
index b16d8e8..d4ad43a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -185,6 +185,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_shrink.batch = 1024;
 
 		percpu_counter_init(&s->s_nr_dentry, 0);
+		s->s_nr_dentry_max = INT_MAX;
 	}
 out:
 	return s;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8150f52..1dfff2e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1399,6 +1399,7 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
+	int 			s_nr_dentry_max;	/* max # of dentry on this sb*/
 	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
 
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] limit nr_dentries per superblock
  2011-07-29 13:44 ` [PATCH 2/4] limit nr_dentries per superblock Glauber Costa
@ 2011-07-31  1:15   ` Dave Chinner
  2011-08-02 12:46     ` Glauber Costa
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2011-07-31  1:15 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley

On Fri, Jul 29, 2011 at 05:44:17PM +0400, Glauber Costa wrote:
> This patch lays the foundation for us to limit the dcache size.
> Each super block can have only a maximum amount of dentries under its
> sub-tree. Allocation fails if we we're over limit and the cache
> can't be pruned to free up space for the newcomers.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> ---
>  fs/dcache.c        |   45 +++++++++++++++++++++++++++++++++++++--------
>  fs/super.c         |    1 +
>  include/linux/fs.h |    1 +
>  3 files changed, 39 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 9cb6395..3bdb106 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -663,7 +663,7 @@ EXPORT_SYMBOL(d_prune_aliases);
>   *
>   * This may fail if locks cannot be acquired no problem, just try again.
>   */
> -static void try_prune_one_dentry(struct dentry *dentry)
> +static int try_prune_one_dentry(struct dentry *dentry)
>  	__releases(dentry->d_lock)
>  {
>  	struct dentry *parent;

Comment explaining the return value?

[...]
> @@ -740,7 +744,7 @@ static void shrink_dentry_list(struct list_head *list)
>   *
>   * If flags contains DCACHE_REFERENCED reference dentries will not be pruned.
>   */
> -static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
> +static int __shrink_dcache_sb(struct super_block *sb, int count, int flags)
>  {
>  	struct dentry *dentry;
>  	LIST_HEAD(referenced);

What's the new return value do?

> @@ -781,7 +785,7 @@ relock:
>  		list_splice(&referenced, &sb->s_dentry_lru);
>  	spin_unlock(&dcache_lru_lock);
>  
> -	shrink_dentry_list(&tmp);
> +	return shrink_dentry_list(&tmp);
>  }
>  
>  /**
> @@ -1176,6 +1180,28 @@ void shrink_dcache_parent(struct dentry * parent)
>  }
>  EXPORT_SYMBOL(shrink_dcache_parent);
>  
> +static int dcache_mem_check(struct super_block *sb)
> +{
> +	int nr_dentry;
> +	int count;
> +
> +	if (percpu_counter_read_positive(&sb->s_nr_dentry) <=
> +						sb->s_nr_dentry_max)
> +		return 0;
> +
> +	do {
> +		nr_dentry = percpu_counter_sum_positive(&sb->s_nr_dentry);
> +
> +		if (nr_dentry < sb->s_nr_dentry_max)
> +			return 0;
> +
> +		count = (nr_dentry - sb->s_nr_dentry_max) << 1;
> +
> +	} while (__shrink_dcache_sb(sb, count, DCACHE_REFERENCED) > 0);
> +
> +	return -ENOMEM;
> +}


I don't like the idea of trying to be exact on how many we shrink.
percpu_counter_read_positive() will always have an error in it:
+/-(N cpus * 31) by default. Statistically speaking 50% of the cases
it passes will then get rejected on the expensive sum operation. And
then we shrink only 2x the difference between the counter and the
max value, which will leave it inside the error margin of the
counter anyway. Hence once we hit the threshold, we'll continually
take expensive sum operations, many for no reason.  There needs to
be some hysteresis here to avoid simply hammering the percpu counter
on every single __d_alloc call. Perhaps per-cpu counters are not
suited for this purpose?

Indeed, what happens if one caller saw the read and sum over the
limit, then starts shrinking but other callers don't see the read
over the limit but keep allocating dentries and hence keeping the
shrinking caller's sum over the limit and hence forever shrinking?
That seems like a unprivileged random process DOS, yes?

So at minimum, the loop should not recalculate how much to shrink -
calculate the amount to shrink, the run a loop to shrink that much
and only that much. If there are other concurrent callers, then
they'll also do a shrink, but nothing will get stuck in a loop.

Hmmm, what happens when the size of the cache and/or the max value
are less than the error margin of the counter?

Also, this doesn't address the problem I originally commented on -
it only shrinks the dcache, not the inode or filesystem caches that
hold all the memory.  I mentioned the new per-sb shrinkers for a
reason - you should be calling them, not __shrink_dcache_sb(). i.e.
building a scan_control structure and calling
sb->shrinker.shrink(&sb->shrinker, &sc);

I'm seriously starting to think that this needs to share an
implementation with the mm shrinker code, because it already
solves most of these problems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] limit nr_dentries per superblock
  2011-07-31  1:15   ` Dave Chinner
@ 2011-08-02 12:46     ` Glauber Costa
  0 siblings, 0 replies; 13+ messages in thread
From: Glauber Costa @ 2011-08-02 12:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Hi David,

Thank you for your comments. I actually agree with most of the things 
you pointed out, so I'll focus my reply here only.

> Also, this doesn't address the problem I originally commented on -
> it only shrinks the dcache, not the inode or filesystem caches that
> hold all the memory.

I understand that the icache is a greater memory user than the dcache.
I also understand that the icache can be dirty due to pending IO.
I am also okay to shrink some of the icache every time we shrink the 
dcache - and see below, this might happen if I change they way I am 
calling the shrinker.

However, since the dcache precedes the icache in the cache hierarchy,
and is the actual entity pinning the icache into memory, I don't see a 
reason to *limit* the icache size, since once the dcache is limited,
anything holding the icache into memory will go away eventually, and the 
shrinkers that run during memory pressure will be able to get rid of it.

> I mentioned the new per-sb shrinkers for a
> reason - you should be calling them, not __shrink_dcache_sb(). i.e.
> building a scan_control structure and calling
> sb->shrinker.shrink(&sb->shrinker,&sc);
>

Would you be okay with having prune_super() being called whenever the
nr_dentries_max limit is reached? It will effectively shrink both
the icache and the dcache, using the dentry limit as a trigger only.

> I'm seriously starting to think that this needs to share an
> implementation with the mm shrinker code, because it already
> solves most of these problems...

I have to think more about that. Hold on!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/4] dcache set size
  2011-07-29 13:44 [PATCH 0/4] Per-superblock dcache limitation Glauber Costa
  2011-07-29 13:44 ` [PATCH 1/4] Keep nr_dentry per super block Glauber Costa
  2011-07-29 13:44 ` [PATCH 2/4] limit nr_dentries per superblock Glauber Costa
@ 2011-07-29 13:44 ` Glauber Costa
  2011-07-31  1:38   ` Dave Chinner
  2011-07-29 13:44 ` [PATCH 4/4] parse options in the vfs level Glauber Costa
  3 siblings, 1 reply; 13+ messages in thread
From: Glauber Costa @ 2011-07-29 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins,
	Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen,
	James Bottomley, David Chinner, Glauber Costa

Simple patch that provides a function allowing its caller to set
the maximum number of dentries that can exist at the same time
at a given super block. Will be used in a later patch.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
---
 fs/dcache.c            |   12 ++++++++++++
 include/linux/dcache.h |    1 +
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3bdb106..4369aa2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1202,6 +1202,18 @@ static int dcache_mem_check(struct super_block *sb)
 	return -ENOMEM;
 }
 
+int vfs_set_dcache_size(struct super_block *sb, int size)
+{
+	unsigned long dr = percpu_counter_sum_positive(&sb->s_nr_dentry); 
+
+	if (size != sb->s_nr_dentry_max)
+		prune_dcache_sb(sb, dr - size);
+
+	sb->s_nr_dentry_max = size;
+
+	return 0;
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d37d2a7..cd1ab71 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,7 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern int vfs_set_dcache_size(struct super_block *sb, int size);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] dcache set size
  2011-07-29 13:44 ` [PATCH 3/4] dcache set size Glauber Costa
@ 2011-07-31  1:38   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2011-07-31  1:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley

On Fri, Jul 29, 2011 at 05:44:18PM +0400, Glauber Costa wrote:
> Simple patch that provides a function allowing its caller to set
> the maximum number of dentries that can exist at the same time
> at a given super block. Will be used in a later patch.

Pretty simple - probably better to put it in the patch that uses it,
I think.

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> ---
>  fs/dcache.c            |   12 ++++++++++++
>  include/linux/dcache.h |    1 +
>  2 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 3bdb106..4369aa2 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1202,6 +1202,18 @@ static int dcache_mem_check(struct super_block *sb)
>  	return -ENOMEM;
>  }
>  
> +int vfs_set_dcache_size(struct super_block *sb, int size)
> +{
> +	unsigned long dr = percpu_counter_sum_positive(&sb->s_nr_dentry); 
> +
> +	if (size != sb->s_nr_dentry_max)
> +		prune_dcache_sb(sb, dr - size);

And why shrink if the new size > the current sb->s_nr_dentry_max?

And what happens when size > dr?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] parse options in the vfs level
  2011-07-29 13:44 [PATCH 0/4] Per-superblock dcache limitation Glauber Costa
                   ` (2 preceding siblings ...)
  2011-07-29 13:44 ` [PATCH 3/4] dcache set size Glauber Costa
@ 2011-07-29 13:44 ` Glauber Costa
  2011-07-31  1:34   ` Dave Chinner
  3 siblings, 1 reply; 13+ messages in thread
From: Glauber Costa @ 2011-07-29 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins,
	Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen,
	James Bottomley, David Chinner, Glauber Costa

This patch introduces a simple generic vfs option parser.
Right now, the only option we have is to limit the size of the dcache.

So any user that wants to have a dcache entries limit, can specify:

  mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint>

It is supposed to work well with remounts, allowing it to change
multiple over the course of the filesystem's lifecycle.

I find mount a natural interface for handling filesystem options,
so that's what I've choosen. Feel free to yell at it at will if
you disagree.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
---
 fs/namespace.c |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 22bfe82..11ce45d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,7 @@
 #include <linux/idr.h>
 #include <linux/fs_struct.h>
 #include <linux/fsnotify.h>
+#include <linux/parser.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include "pnode.h"
@@ -2271,6 +2272,82 @@ int copy_mount_string(const void __user *data, char **where)
 	return 0;
 }
 
+static const match_table_t tokens = {
+	{1, "vfs_dcache_size=%u"},
+};
+
+struct vfs_options {
+	unsigned long vfs_dcache_size;
+};
+
+/**
+ * Generic option parsing for the VFS.
+ *
+ * Since most of the filesystems already do their own option parsing, and with
+ * very few code shared between them, this function strips out any options that
+ * we succeed in parsing ourselves. Passing them forward would just give the
+ * underlying fs an option it does not expect, leading it to fail.
+ *
+ * We don't yet have a pointer to the super block as well, since this is
+ * pre-mount. We accumulate in struct vfs_options whatever data we collected,
+ * and act on it later.
+ */
+static int vfs_parse_options(char *options, struct vfs_options *ops)
+{
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int option;
+	char *p;
+	char *opt;
+	char *start = NULL;
+	int ret;
+	
+	if (!options)
+		return 0;
+
+	opt = kstrdup(options, GFP_KERNEL);
+	if (!opt)
+		return 1;
+	
+	ret = 1;
+
+	start = opt;
+	while ((p = strsep(&opt, ",")) != NULL) {
+		int token;
+		if (!*p)
+			continue;
+
+		/*
+		 * Initialize args struct so we know whether arg was
+		 * found; some options take optional arguments.
+		 */
+		args[0].to = args[0].from = 0;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case 1:
+			if (!args[0].from)
+				break;
+
+			if (match_int(&args[0], &option))
+				break;
+			else
+				ops->vfs_dcache_size = option;
+
+			ret = 0;
+			if (!opt) /* it is the last option listed */
+				*(options + (p - start)) = '\0';
+			else
+				strcpy(options + (p - start), opt);
+			break;
+		default:
+			ret = 0;
+			break;
+		}
+	}
+
+	kfree(start);
+	return ret;
+}
+
 /*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
@@ -2291,6 +2368,7 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	struct path path;
 	int retval = 0;
 	int mnt_flags = 0;
+	struct vfs_options vfs_options;
 
 	/* Discard magic */
 	if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
@@ -2318,6 +2396,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	if (!(flags & MS_NOATIME))
 		mnt_flags |= MNT_RELATIME;
 
+
+	vfs_options.vfs_dcache_size = INT_MAX;
+	retval = vfs_parse_options(data_page, &vfs_options);
+	if (retval)
+		goto dput_out;
+
 	/* Separate the per-mountpoint flags */
 	if (flags & MS_NOSUID)
 		mnt_flags |= MNT_NOSUID;
@@ -2350,6 +2434,11 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	else
 		retval = do_new_mount(&path, type_page, flags, mnt_flags,
 				      dev_name, data_page);
+
+	if (!retval)
+		vfs_set_dcache_size(path.mnt->mnt_sb,
+				    vfs_options.vfs_dcache_size);
+			
 dput_out:
 	path_put(&path);
 	return retval;
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] parse options in the vfs level
  2011-07-29 13:44 ` [PATCH 4/4] parse options in the vfs level Glauber Costa
@ 2011-07-31  1:34   ` Dave Chinner
  2011-08-02 13:04     ` Glauber Costa
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2011-07-31  1:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley

On Fri, Jul 29, 2011 at 05:44:19PM +0400, Glauber Costa wrote:
> This patch introduces a simple generic vfs option parser.
> Right now, the only option we have is to limit the size of the dcache.
> 
> So any user that wants to have a dcache entries limit, can specify:
> 
>   mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint>
> 
> It is supposed to work well with remounts, allowing it to change
> multiple over the course of the filesystem's lifecycle.
> 
> I find mount a natural interface for handling filesystem options,
> so that's what I've choosen. Feel free to yell at it at will if
> you disagree.

IMO, the whole point of having a configurable cache size maximum is
that is can be changed at runtime. Tying it to mount options is a
painful way to acheive that because the only way to change it would
be via a remount command.

I'm not sure what the best API is, but I'd prefer something that is
specific to a superblock, not a vfs mount. Perhaps something in
/sys/fs?

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> ---
>  fs/namespace.c |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 89 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 22bfe82..11ce45d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -31,6 +31,7 @@
>  #include <linux/idr.h>
>  #include <linux/fs_struct.h>
>  #include <linux/fsnotify.h>
> +#include <linux/parser.h>
>  #include <asm/uaccess.h>
>  #include <asm/unistd.h>
>  #include "pnode.h"
> @@ -2271,6 +2272,82 @@ int copy_mount_string(const void __user *data, char **where)
>  	return 0;
>  }
>  
> +static const match_table_t tokens = {
> +	{1, "vfs_dcache_size=%u"},
> +};
> +
> +struct vfs_options {
> +	unsigned long vfs_dcache_size;
> +};
> +
> +/**
> + * Generic option parsing for the VFS.
> + *
> + * Since most of the filesystems already do their own option parsing, and with
> + * very few code shared between them, this function strips out any options that
> + * we succeed in parsing ourselves. Passing them forward would just give the
> + * underlying fs an option it does not expect, leading it to fail.
> + *
> + * We don't yet have a pointer to the super block as well, since this is
> + * pre-mount. We accumulate in struct vfs_options whatever data we collected,
> + * and act on it later.
> + */
> +static int vfs_parse_options(char *options, struct vfs_options *ops)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	unsigned int option;
> +	char *p;
> +	char *opt;
> +	char *start = NULL;
> +	int ret;
> +	
> +	if (!options)
> +		return 0;
> +
> +	opt = kstrdup(options, GFP_KERNEL);
> +	if (!opt)
> +		return 1;
> +	
> +	ret = 1;
> +
> +	start = opt;
> +	while ((p = strsep(&opt, ",")) != NULL) {
> +		int token;
> +		if (!*p)
> +			continue;
> +
> +		/*
> +		 * Initialize args struct so we know whether arg was
> +		 * found; some options take optional arguments.
> +		 */
> +		args[0].to = args[0].from = 0;
> +		token = match_token(p, tokens, args);
> +		switch (token) {
> +		case 1:
> +			if (!args[0].from)
> +				break;
> +
> +			if (match_int(&args[0], &option))
> +				break;
> +			else

No need fo the else.

> +				ops->vfs_dcache_size = option;

Bounds checking? What are valid values? e.g. setting it to a
negative number would be bad, as would a number that is too small
(e.g. 1)....

> +
> +			ret = 0;
> +			if (!opt) /* it is the last option listed */
> +				*(options + (p - start)) = '\0';
> +			else
> +				strcpy(options + (p - start), opt);

What's this for? I don't see any of the other mount option code
doing this sort of thing...

> @@ -2350,6 +2434,11 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
>  	else
>  		retval = do_new_mount(&path, type_page, flags, mnt_flags,
>  				      dev_name, data_page);
> +
> +	if (!retval)
> +		vfs_set_dcache_size(path.mnt->mnt_sb,
> +				    vfs_options.vfs_dcache_size);
> +			

Hmmmm -  doesn't that mean bind mounts will override the value for
the original, underlying filesytem mount? Isn't that a bad thing to
do? i.e. the limiting is supposed to be per-sb, not per-vfsmnt?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] parse options in the vfs level
  2011-07-31  1:34   ` Dave Chinner
@ 2011-08-02 13:04     ` Glauber Costa
  2011-08-02 14:18       ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Glauber Costa @ 2011-08-02 13:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley

On 07/30/2011 10:34 PM, Dave Chinner wrote:
> On Fri, Jul 29, 2011 at 05:44:19PM +0400, Glauber Costa wrote:
>> This patch introduces a simple generic vfs option parser.
>> Right now, the only option we have is to limit the size of the dcache.
>>
>> So any user that wants to have a dcache entries limit, can specify:
>>
>>    mount -o whatever_options,vfs_dcache_size=XXX<dev>  <mntpoint>
>>
>> It is supposed to work well with remounts, allowing it to change
>> multiple over the course of the filesystem's lifecycle.
>>
>> I find mount a natural interface for handling filesystem options,
>> so that's what I've choosen. Feel free to yell at it at will if
>> you disagree.
>
> IMO, the whole point of having a configurable cache size maximum is
> that is can be changed at runtime. Tying it to mount options is a
> painful way to acheive that because the only way to change it would
> be via a remount command.
And what's wrong with a remount command? It's a quite natural operation.
Furthermore, changing it at runtime is important - and as you noted, 
quite doable, but it is not "the whole point of it".
The whole point of it is to allow a piece of the fs hierarchy to have
a limit on the cache sizes. So I expect the most common usage to be
at mount itself. Specifically for the use case I have in mind, when
a new container is created.

> I'm not sure what the best API is, but I'd prefer something that is
> specific to a superblock, not a vfs mount. Perhaps something in
> /sys/fs?

I am not sure either, but I still believe my proposal is superior to 
write-to-a-file specifically. Writing to a file, be it in proc, sys, or 
wherever, leaves a window of opportunity open between mounting a 
filesystem and limiting its caches. Doing it on mount is atomic.

Effectively, I see this limit as a property of a particular instance of 
a mounted filesystem. Since all properties of a filesystem are specified 
during mount, this becomes a natural extension.

Let me know what you think



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] parse options in the vfs level
  2011-08-02 13:04     ` Glauber Costa
@ 2011-08-02 14:18       ` Al Viro
  2011-08-02 14:43         ` Glauber Costa
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2011-08-02 14:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, containers,
	Pavel Emelyanov, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On Tue, Aug 02, 2011 at 10:04:06AM -0300, Glauber Costa wrote:

> I am not sure either, but I still believe my proposal is superior to
> write-to-a-file specifically. Writing to a file, be it in proc, sys,
> or wherever, leaves a window of opportunity open between mounting a
> filesystem and limiting its caches. Doing it on mount is atomic.
> 
> Effectively, I see this limit as a property of a particular instance
> of a mounted filesystem. Since all properties of a filesystem are
> specified during mount, this becomes a natural extension.

The trouble is, dentry tree is fundamentally a property of superblock.
It's shared between *all* instances of that fs in all mount trees...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] parse options in the vfs level
  2011-08-02 14:18       ` Al Viro
@ 2011-08-02 14:43         ` Glauber Costa
  0 siblings, 0 replies; 13+ messages in thread
From: Glauber Costa @ 2011-08-02 14:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, containers,
	Pavel Emelyanov, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On 08/02/2011 11:18 AM, Al Viro wrote:
> On Tue, Aug 02, 2011 at 10:04:06AM -0300, Glauber Costa wrote:
>
>> I am not sure either, but I still believe my proposal is superior to
>> write-to-a-file specifically. Writing to a file, be it in proc, sys,
>> or wherever, leaves a window of opportunity open between mounting a
>> filesystem and limiting its caches. Doing it on mount is atomic.
>>
>> Effectively, I see this limit as a property of a particular instance
>> of a mounted filesystem. Since all properties of a filesystem are
>> specified during mount, this becomes a natural extension.
>
> The trouble is, dentry tree is fundamentally a property of superblock.
> It's shared between *all* instances of that fs in all mount trees...

And how is it different from any fs-specific options, like the ones extX 
have, for instance ? Many of them seem to operate on a superblock.

If you mount a superblock somewhere, you can tweak specifics about its 
operation. If you mount it somewhere else, the assumption is you know 
what you're doing.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-08-02 14:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-29 13:44 [PATCH 0/4] Per-superblock dcache limitation Glauber Costa
2011-07-29 13:44 ` [PATCH 1/4] Keep nr_dentry per super block Glauber Costa
2011-07-31  0:50   ` Dave Chinner
2011-07-29 13:44 ` [PATCH 2/4] limit nr_dentries per superblock Glauber Costa
2011-07-31  1:15   ` Dave Chinner
2011-08-02 12:46     ` Glauber Costa
2011-07-29 13:44 ` [PATCH 3/4] dcache set size Glauber Costa
2011-07-31  1:38   ` Dave Chinner
2011-07-29 13:44 ` [PATCH 4/4] parse options in the vfs level Glauber Costa
2011-07-31  1:34   ` Dave Chinner
2011-08-02 13:04     ` Glauber Costa
2011-08-02 14:18       ` Al Viro
2011-08-02 14:43         ` Glauber Costa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).