* [PATCH v2 00/28] memcg-aware slab shrinking
@ 2013-03-29  9:13 Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers Glauber Costa
                   ` (28 more replies)
  0 siblings, 29 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan
Hi,
Notes:
======
This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
and it behaves well, at least from the isolation point of view. However,
I feel some more testing is needed before we commit to it. Still, this is
doing the job fairly well. Comments welcome.
Base work:
==========
Please note that this builds upon the recent work from Dave Chinner that
sanitizes the LRU shrinking API and make the shrinkers node aware. Node
awareness is not *strictly* needed for my work, but I still perceive it
as an advantage. The API unification is a major need, and I build upon it
heavily. That allows us to manipulate the LRUs without knowledge of the
underlying objects with ease. This time, I am including that work here as
a baseline.
Description:
============
This patchset implements targeted shrinking for memcg when kmem limits are
present. So far, we've been accounting kernel objects but failing allocations
when short of memory. This is because our only option would be to call the
global shrinker, depleting objects from all caches and breaking isolation.
The main idea is to associate per-memcg lists with each of the LRUs. The main
LRU still provides a single entry point and when adding or removing an element
from the LRU, we use the page information to figure out which memcg it belongs
to and relay it to the right list.
Patches:
========
1 and 2: improve handling of small number of shrinkable objects.
	 This is a scenario that is way more likely to appear under memcg,
         although it is not memcg-specific. Already sent separately, but not
         yet merged.
3 to 20: Dave's work to unify the LRU API and make it per-node. I had to make
         minor changes to the patches to reflect new code in the tree and due to
         build problems. I tried to keep them as unchanged as possible.
21 to 28: memcg targeted shrinking.
Main changes from *v1:
* merged comments from the mailing list
* reworked lru-memcg API
* effective proportional shrinking
* sanitized locking on the memcg side
* bill user memory first when kmem == umem
* various bugfixes
TODO:
* shrink dead memcgs when global pressure kicks in. (minor)
Dave Chinner (17):
  dcache: convert dentry_stat.nr_unused to per-cpu counters
  dentry: move to per-sb LRU locks
  dcache: remove dentries from LRU before putting on dispose list
  mm: new shrinker API
  shrinker: convert superblock shrinkers to new API
  list: add a new LRU list type
  inode: convert inode lru list to generic lru list code.
  dcache: convert to use new lru list infrastructure
  list_lru: per-node list infrastructure
  shrinker: add node awareness
  fs: convert inode and dentry shrinking to be node aware
  xfs: convert buftarg LRU to generic code
  xfs: convert dquot cache lru to list_lru
  fs: convert fs shrinkers to new scan/count API
  drivers: convert shrinkers to new count/scan API
  shrinker: convert remaining shrinkers to count/scan API
  shrinker: Kill old ->shrink API.
Glauber Costa (11):
  super: fix calculation of shrinkable objects for small numbers
  vmscan: take at least one pass with shrinkers
  hugepage: convert huge zero page shrinker to new shrinker API
  vmscan: also shrink slab in memcg pressure
  memcg,list_lru: duplicate LRUs upon kmemcg creation
  lru: add an element to a memcg list
  list_lru: also include memcg lists in counts and scans
  list_lru: per-memcg walks
  memcg: per-memcg kmem shrinking
  list_lru: reclaim proportionaly between memcgs and nodes
  super: targeted memcg reclaim
 arch/x86/kvm/mmu.c                         |  35 ++-
 drivers/gpu/drm/i915/i915_dma.c            |   4 +-
 drivers/gpu/drm/i915/i915_drv.h            |   2 +-
 drivers/gpu/drm/i915/i915_gem.c            |  69 +++--
 drivers/gpu/drm/i915/i915_gem_evict.c      |  10 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c       |  48 ++-
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c   |  55 ++--
 drivers/md/dm-bufio.c                      |  65 ++--
 drivers/staging/android/ashmem.c           |  44 ++-
 drivers/staging/android/lowmemorykiller.c  |  40 ++-
 drivers/staging/zcache/zcache-main.c       |  29 +-
 fs/dcache.c                                | 215 +++++++------
 fs/drop_caches.c                           |   1 +
 fs/ext4/extents_status.c                   |  30 +-
 fs/gfs2/glock.c                            |  30 +-
 fs/gfs2/main.c                             |   3 +-
 fs/gfs2/quota.c                            |  14 +-
 fs/gfs2/quota.h                            |   4 +-
 fs/inode.c                                 | 174 +++++------
 fs/internal.h                              |   5 +
 fs/mbcache.c                               |  53 ++--
 fs/nfs/dir.c                               |  20 +-
 fs/nfs/internal.h                          |   4 +-
 fs/nfs/super.c                             |   3 +-
 fs/nfsd/nfscache.c                         |  31 +-
 fs/quota/dquot.c                           |  39 ++-
 fs/super.c                                 | 107 ++++---
 fs/ubifs/shrinker.c                        |  20 +-
 fs/ubifs/super.c                           |   3 +-
 fs/ubifs/ubifs.h                           |   3 +-
 fs/xfs/xfs_buf.c                           | 167 +++++-----
 fs/xfs/xfs_buf.h                           |   5 +-
 fs/xfs/xfs_dquot.c                         |   7 +-
 fs/xfs/xfs_icache.c                        |   4 +-
 fs/xfs/xfs_icache.h                        |   2 +-
 fs/xfs/xfs_qm.c                            | 274 +++++++++--------
 fs/xfs/xfs_qm.h                            |   4 +-
 fs/xfs/xfs_super.c                         |  12 +-
 include/linux/dcache.h                     |   4 +
 include/linux/fs.h                         |  25 +-
 include/linux/list_lru.h                   | 112 +++++++
 include/linux/memcontrol.h                 |  41 +++
 include/linux/shrinker.h                   |  45 ++-
 include/linux/swap.h                       |   2 +
 include/trace/events/vmscan.h              |   4 +-
 lib/Makefile                               |   2 +-
 lib/list_lru.c                             | 474 +++++++++++++++++++++++++++++
 mm/huge_memory.c                           |  18 +-
 mm/memcontrol.c                            | 373 ++++++++++++++++++++---
 mm/memory-failure.c                        |   2 +
 mm/slab_common.c                           |   1 -
 mm/vmscan.c                                | 142 ++++++---
 net/sunrpc/auth.c                          |  45 ++-
 54 files changed, 2091 insertions(+), 836 deletions(-)
 create mode 100644 include/linux/list_lru.h
 create mode 100644 lib/list_lru.c
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-01  7:16   ` Kamezawa Hiroyuki
  2013-03-29  9:13 ` [PATCH v2 02/28] vmscan: take at least one pass with shrinkers Glauber Costa
                   ` (27 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Theodore Ts'o, Al Viro
The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.
It works great in situations in which we have many objects (at least
more than 100), because the aproximation errors will be negligible. But
if this is not the case, specially when total_objects < 100, we may end
up concluding that we have no objects at all (total / 100 = 0,  if total
< 100).
This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.
[ v2: fix it for all occurrences of sysctl_vfs_cache_pressure ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
CC: Dave Chinner <david@fromorbit.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/gfs2/glock.c        |  2 +-
 fs/gfs2/quota.c        |  2 +-
 fs/mbcache.c           |  2 +-
 fs/nfs/dir.c           |  2 +-
 fs/quota/dquot.c       |  5 ++---
 fs/super.c             | 14 +++++++-------
 fs/xfs/xfs_qm.c        |  2 +-
 include/linux/dcache.h |  4 ++++
 8 files changed, 18 insertions(+), 15 deletions(-)
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index cf35155..078daa5 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1476,7 +1476,7 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
 		gfs2_scan_glock_lru(sc->nr_to_scan);
 	}
 
-	return (atomic_read(&lru_count) / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(atomic_read(&lru_count));
 }
 
 static struct shrinker glock_shrinker = {
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index c7c840e..5c14206 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -114,7 +114,7 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
 	spin_unlock(&qd_lru_lock);
 
 out:
-	return (atomic_read(&qd_lru_count) * sysctl_vfs_cache_pressure) / 100;
+	return vfs_pressure_ratio(atomic_read(&qd_lru_count));
 }
 
 static u64 qd2index(struct gfs2_quota_data *qd)
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..5eb0476 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -189,7 +189,7 @@ mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
 		__mb_cache_entry_forget(entry, gfp_mask);
 	}
-	return (count / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(count);
 }
 
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f23f455..197bfff 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1996,7 +1996,7 @@ remove_lru_entry:
 	}
 	spin_unlock(&nfs_access_lru_lock);
 	nfs_access_free_list(&head);
-	return (atomic_long_read(&nfs_access_nr_entries) / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
 }
 
 static void __nfs_access_zap_cache(struct nfs_inode *nfsi, struct list_head *head)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3e64169..762b09c 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -719,9 +719,8 @@ static int shrink_dqcache_memory(struct shrinker *shrink,
 		prune_dqcache(nr);
 		spin_unlock(&dq_list_lock);
 	}
-	return ((unsigned)
-		percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS])
-		/100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(
+	percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
 }
 
 static struct shrinker dqcache_shrinker = {
diff --git a/fs/super.c b/fs/super.c
index 7465d43..2a37fd6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -82,13 +82,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		int	inodes;
 
 		/* proportion the scan between the caches */
-		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
-							total_objects;
-		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
-							total_objects;
+		dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
+							total_objects);
+		inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
+							total_objects);
 		if (fs_objects)
-			fs_objects = (sc->nr_to_scan * fs_objects) /
-							total_objects;
+			fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
+							total_objects);
 		/*
 		 * prune the dcache first as the icache is pinned by it, then
 		 * prune the icache, followed by the filesystem specific caches
@@ -104,7 +104,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 				sb->s_nr_inodes_unused + fs_objects;
 	}
 
-	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
+	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
 	return total_objects;
 }
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index e5b5cf9..305f4e5 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1568,7 +1568,7 @@ xfs_qm_shake(
 	}
 
 out:
-	return (qi->qi_lru_count / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(qi->qi_lru_count);
 }
 
 /*
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..4d24a12 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -411,4 +411,8 @@ static inline bool d_mountpoint(struct dentry *dentry)
 
 extern int sysctl_vfs_cache_pressure;
 
+static inline unsigned long vfs_pressure_ratio(unsigned long val)
+{
+	return mult_frac(val, sysctl_vfs_cache_pressure, 100);
+}
 #endif	/* __LINUX_DCACHE_H */
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-01  7:26   ` Kamezawa Hiroyuki
  2013-04-08  8:42   ` Joonsoo Kim
  2013-03-29  9:13 ` [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Theodore Ts'o, Al Viro
In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.
More specifically, this happens because we encode this in a loop with
the condition: "while (total_scan >= batch_size)". So if we are in such
a case, we'll not even enter the loop.
This patch modifies turns it into a do () while {} loop, that will
guarantee that we scan it at least once, while keeping the behaviour
exactly the same for the cases in which total_scan > batch_size.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..fc6d45a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -280,7 +280,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 					nr_pages_scanned, lru_pages,
 					max_pass, delta, total_scan);
 
-		while (total_scan >= batch_size) {
+		do {
 			int nr_before;
 
 			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -294,7 +294,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 			total_scan -= batch_size;
 
 			cond_resched();
-		}
+		} while (total_scan >= batch_size);
 
 		/*
 		 * move the unused scan count back into the shrinker in a
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 02/28] vmscan: take at least one pass with shrinkers Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-05  1:09   ` Greg Thelen
  2013-03-29  9:13 ` [PATCH v2 04/28] dentry: move to per-sb LRU locks Glauber Costa
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Before we split up the dcache_lru_lock, the unused dentry counter
needs to be made independent of the global dcache_lru_lock. Convert
it to per-cpu counters to do this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dcache.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index fbfae008..f1196f2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static DEFINE_PER_CPU(unsigned int, nr_dentry);
+static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 static int get_nr_dentry(void)
@@ -129,10 +130,20 @@ static int get_nr_dentry(void)
 	return sum < 0 ? 0 : sum;
 }
 
+static int get_nr_dentry_unused(void)
+{
+	int i;
+	int sum = 0;
+	for_each_possible_cpu(i)
+		sum += per_cpu(nr_dentry_unused, i);
+	return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
 		   size_t *lenp, loff_t *ppos)
 {
 	dentry_stat.nr_dentry = get_nr_dentry();
+	dentry_stat.nr_unused = get_nr_dentry_unused();
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -312,7 +323,7 @@ static void dentry_lru_add(struct dentry *dentry)
 		spin_lock(&dcache_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
+		this_cpu_inc(nr_dentry_unused);
 		spin_unlock(&dcache_lru_lock);
 	}
 }
@@ -322,7 +333,7 @@ static void __dentry_lru_del(struct dentry *dentry)
 	list_del_init(&dentry->d_lru);
 	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	dentry->d_sb->s_nr_dentry_unused--;
-	dentry_stat.nr_unused--;
+	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -360,7 +371,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
 		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
+		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
 	}
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 04/28] dentry: move to per-sb LRU locks
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (2 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
With the dentry LRUs being per-sb structures, there is no real need
for a global dentry_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dcache.c        | 37 ++++++++++++++++++-------------------
 fs/super.c         |  1 +
 include/linux/fs.h |  4 +++-
 3 files changed, 22 insertions(+), 20 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f1196f2..0a1d7b3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -48,7 +48,7 @@
  *   - the dcache hash table
  * s_anon bl list spinlock protects:
  *   - the s_anon list (see __d_drop)
- * dcache_lru_lock protects:
+ * dentry->d_sb->s_dentry_lru_lock protects:
  *   - the dcache lru lists and counters
  * d_lock protects:
  *   - d_flags
@@ -63,7 +63,7 @@
  * Ordering:
  * dentry->d_inode->i_lock
  *   dentry->d_lock
- *     dcache_lru_lock
+ *     dentry->d_sb->s_dentry_lru_lock
  *     dcache_hash_bucket lock
  *     s_anon lock
  *
@@ -81,7 +81,6 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
@@ -320,11 +319,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
 static void dentry_lru_add(struct dentry *dentry)
 {
 	if (list_empty(&dentry->d_lru)) {
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
 		this_cpu_inc(nr_dentry_unused);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
@@ -342,9 +341,9 @@ static void __dentry_lru_del(struct dentry *dentry)
 static void dentry_lru_del(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
@@ -359,15 +358,15 @@ static void dentry_lru_prune(struct dentry *dentry)
 		if (dentry->d_flags & DCACHE_OP_PRUNE)
 			dentry->d_op->d_prune(dentry);
 
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
 		dentry->d_sb->s_nr_dentry_unused++;
@@ -375,7 +374,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 	} else {
 		list_move_tail(&dentry->d_lru, list);
 	}
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
 
 /**
@@ -853,14 +852,14 @@ void prune_dcache_sb(struct super_block *sb, int count)
 	LIST_HEAD(tmp);
 
 relock:
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		dentry = list_entry(sb->s_dentry_lru.prev,
 				struct dentry, d_lru);
 		BUG_ON(dentry->d_sb != sb);
 
 		if (!spin_trylock(&dentry->d_lock)) {
-			spin_unlock(&dcache_lru_lock);
+			spin_unlock(&sb->s_dentry_lru_lock);
 			cpu_relax();
 			goto relock;
 		}
@@ -876,11 +875,11 @@ relock:
 			if (!--count)
 				break;
 		}
-		cond_resched_lock(&dcache_lru_lock);
+		cond_resched_lock(&sb->s_dentry_lru_lock);
 	}
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&sb->s_dentry_lru_lock);
 
 	shrink_dentry_list(&tmp);
 }
@@ -896,14 +895,14 @@ void shrink_dcache_sb(struct super_block *sb)
 {
 	LIST_HEAD(tmp);
 
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&sb->s_dentry_lru_lock);
 		shrink_dentry_list(&tmp);
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&sb->s_dentry_lru_lock);
 	}
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&sb->s_dentry_lru_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
diff --git a/fs/super.c b/fs/super.c
index 2a37fd6..0be75fb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -182,6 +182,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
+		spin_lock_init(&s->s_dentry_lru_lock);
 		INIT_LIST_HEAD(&s->s_inode_lru);
 		spin_lock_init(&s->s_inode_lru_lock);
 		INIT_LIST_HEAD(&s->s_mounts);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..02934f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1261,7 +1261,9 @@ struct super_block {
 	struct list_head	s_files;
 #endif
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
+
+	/* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
+	spinlock_t		s_dentry_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (3 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 04/28] dentry: move to per-sb LRU locks Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-03  6:51   ` Sha Zhengju
  2013-03-29  9:13 ` [PATCH v2 06/28] mm: new shrinker API Glauber Costa
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
One of the big problems with modifying the way the dcache shrinker
and LRU implementation works is that the LRU is abused in several
ways. One of these is shrink_dentry_list().
Basically, we can move a dentry off the LRU onto a different list
without doing any accounting changes, and then use dentry_lru_prune()
to remove it from what-ever list it is now on to do the LRU
accounting at that point.
This makes it -really hard- to change the LRU implementation. The
use of the per-sb LRU lock serialises movement of the dentries
between the different lists and the removal of them, and this is the
only reason that it works. If we want to break up the dentry LRU
lock and lists into, say, per-node lists, we remove the only
serialisation that allows this lru list/dispose list abuse to work.
To make this work effectively, the dispose list has to be isolated
from the LRU list - dentries have to be removed from the LRU
*before* being placed on the dispose list. This means that the LRU
accounting and isolation is completed before disposal is started,
and that means we can change the LRU implementation freely in
future.
This means that dentries *must* be marked with DCACHE_SHRINK_LIST
when they are placed on the dispose list so that we don't think that
parent dentries found in try_prune_one_dentry() are on the LRU when
the are actually on the dispose list. This would result in
accounting the dentry to the LRU a second time. Hence
dentry_lru_prune() has to handle the DCACHE_SHRINK_LIST case
differently because the dentry isn't on the LRU list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 63 insertions(+), 10 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 0a1d7b3..d15420b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -330,7 +330,6 @@ static void dentry_lru_add(struct dentry *dentry)
 static void __dentry_lru_del(struct dentry *dentry)
 {
 	list_del_init(&dentry->d_lru);
-	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	dentry->d_sb->s_nr_dentry_unused--;
 	this_cpu_dec(nr_dentry_unused);
 }
@@ -340,6 +339,8 @@ static void __dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_del(struct dentry *dentry)
 {
+	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
 	if (!list_empty(&dentry->d_lru)) {
 		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
@@ -351,28 +352,42 @@ static void dentry_lru_del(struct dentry *dentry)
  * Remove a dentry that is unreferenced and about to be pruned
  * (unhashed and destroyed) from the LRU, and inform the file system.
  * This wrapper should be called _prior_ to unhashing a victim dentry.
+ *
+ * Check that the dentry really is on the LRU as it may be on a private dispose
+ * list and in that case we do not want to call the generic LRU removal
+ * functions. This typically happens when shrink_dcache_sb() clears the LRU in
+ * one go and then try_prune_one_dentry() walks back up the parent chain finding
+ * dentries that are also on the dispose list.
  */
 static void dentry_lru_prune(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
+
 		if (dentry->d_flags & DCACHE_OP_PRUNE)
 			dentry->d_op->d_prune(dentry);
 
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		__dentry_lru_del(dentry);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+		if ((dentry->d_flags & DCACHE_SHRINK_LIST))
+			list_del_init(&dentry->d_lru);
+		else {
+			spin_lock(&dentry->d_sb->s_dentry_lru_lock);
+			__dentry_lru_del(dentry);
+			spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+		}
+		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
+	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
 	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused++;
-		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
+		dentry->d_sb->s_nr_dentry_unused--;
+		this_cpu_dec(nr_dentry_unused);
 	}
 	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
@@ -814,12 +829,18 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 
 		/*
+		 * The dispose list is isolated and dentries are not accounted
+		 * to the LRU here, so we can simply remove it from the list
+		 * here regardless of whether it is referenced or not.
+		 */
+		list_del_init(&dentry->d_lru);
+
+		/*
 		 * We found an inuse dentry which was not removed from
-		 * the LRU because of laziness during lookup.  Do not free
-		 * it - just keep it off the LRU list.
+		 * the LRU because of laziness during lookup. Do not free it.
 		 */
 		if (dentry->d_count) {
-			dentry_lru_del(dentry);
+			dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
@@ -871,6 +892,8 @@ relock:
 		} else {
 			list_move_tail(&dentry->d_lru, &tmp);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
+			this_cpu_dec(nr_dentry_unused);
+			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
 			if (!--count)
 				break;
@@ -884,6 +907,28 @@ relock:
 	shrink_dentry_list(&tmp);
 }
 
+/*
+ * Mark all the dentries as on being the dispose list so we don't think they are
+ * still on the LRU if we try to kill them from ascending the parent chain in
+ * try_prune_one_dentry() rather than directly from the dispose list.
+ */
+static void
+shrink_dcache_list(
+	struct list_head *dispose)
+{
+	struct dentry *dentry;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(dentry, dispose, d_lru) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_SHRINK_LIST;
+		this_cpu_dec(nr_dentry_unused);
+		spin_unlock(&dentry->d_lock);
+	}
+	rcu_read_unlock();
+	shrink_dentry_list(dispose);
+}
+
 /**
  * shrink_dcache_sb - shrink dcache for a superblock
  * @sb: superblock
@@ -898,8 +943,16 @@ void shrink_dcache_sb(struct super_block *sb)
 	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
+
+		/*
+		 * account for removal here so we don't need to handle it later
+		 * even though the dentry is no longer on the lru list.
+		 */
+		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
+		sb->s_nr_dentry_unused = 0;
+
 		spin_unlock(&sb->s_dentry_lru_lock);
-		shrink_dentry_list(&tmp);
+		shrink_dcache_list(&tmp);
 		spin_lock(&sb->s_dentry_lru_lock);
 	}
 	spin_unlock(&sb->s_dentry_lru_lock);
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 06/28] mm: new shrinker API
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (4 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-05  1:09   ` Greg Thelen
  2013-03-29  9:13 ` [PATCH v2 07/28] shrinker: convert superblock shrinkers to new API Glauber Costa
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
The current shrinker callout API uses an a single shrinker call for
multiple functions. To determine the function, a special magical
value is passed in a parameter to change the behaviour. This
complicates the implementation and return value specification for
the different behaviours.
Separate the two different behaviours into separate operations, one
to return a count of freeable objects in the cache, and another to
scan a certain number of objects in the cache for freeing. In
defining these new operations, ensure the return values and
resultant behaviours are clearly defined and documented.
Modify shrink_slab() to use the new API and implement the callouts
for all the existing shrinkers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h | 37 +++++++++++++++++++++++++----------
 mm/vmscan.c              | 51 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 60 insertions(+), 28 deletions(-)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index ac6b8ee..4f59615 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,31 +4,47 @@
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
  * We consolidate the values for easier extention later.
+ *
+ * The 'gfpmask' refers to the allocation we are currently trying to
+ * fulfil.
+ *
+ * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
+ * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
 	gfp_t gfp_mask;
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
-	unsigned long nr_to_scan;
+	long nr_to_scan;
 };
 
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
- * and a 'gfpmask'.  It should look through the least-recently-used
- * 'nr_to_scan' entries and attempt to free them up.  It should return
- * the number of objects which remain in the cache.  If it returns -1, it means
- * it cannot do any scanning at this time (eg. there is a risk of deadlock).
+ * @shrink() should look through the least-recently-used 'nr_to_scan' entries
+ * and attempt to free them up.  It should return the number of objects which
+ * remain in the cache.  If it returns -1, it means it cannot do any scanning at
+ * this time (eg. there is a risk of deadlock).
  *
- * The 'gfpmask' refers to the allocation we are currently trying to
- * fulfil.
+ * @count_objects should return the number of freeable items in the cache. If
+ * there are no objects to free or the number of freeable items cannot be
+ * determined, it should return 0. No deadlock checks should be done during the
+ * count callback - the shrinker relies on aggregating scan counts that couldn't
+ * be executed due to potential deadlocks to be run at a later call when the
+ * deadlock condition is no longer pending.
  *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
+ * @scan_objects will only be called if @count_objects returned a positive
+ * value for the number of freeable objects. The callout should scan the cache
+ * and attemp to free items from the cache. It should then return the number of
+ * objects freed during the scan, or -1 if progress cannot be made due to
+ * potential deadlocks. If -1 is returned, then no further attempts to call the
+ * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
 	int (*shrink)(struct shrinker *, struct shrink_control *sc);
+	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
+	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
+
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
 
@@ -36,6 +52,7 @@ struct shrinker {
 	struct list_head list;
 	atomic_long_t nr_in_batch; /* objs pending delete */
 };
+
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fc6d45a..64b0157 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,19 +204,19 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
  *
  * Returns the number of slab objects which we shrunk.
  */
-unsigned long shrink_slab(struct shrink_control *shrink,
+unsigned long shrink_slab(struct shrink_control *sc,
 			  unsigned long nr_pages_scanned,
 			  unsigned long lru_pages)
 {
 	struct shrinker *shrinker;
-	unsigned long ret = 0;
+	unsigned long freed = 0;
 
 	if (nr_pages_scanned == 0)
 		nr_pages_scanned = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem)) {
 		/* Assume we'll be able to shrink next time */
-		ret = 1;
+		freed = 1;
 		goto out;
 	}
 
@@ -224,13 +224,16 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		long total_scan;
 		long max_pass;
-		int shrink_ret = 0;
 		long nr;
 		long new_nr;
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
-		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+		if (shrinker->scan_objects) {
+			max_pass = shrinker->count_objects(shrinker, sc);
+			WARN_ON(max_pass < 0);
+		} else
+			max_pass = do_shrinker_shrink(shrinker, sc, 0);
 		if (max_pass <= 0)
 			continue;
 
@@ -247,8 +250,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		do_div(delta, lru_pages + 1);
 		total_scan += delta;
 		if (total_scan < 0) {
-			printk(KERN_ERR "shrink_slab: %pF negative objects to "
-			       "delete nr=%ld\n",
+			printk(KERN_ERR
+			"shrink_slab: %pF negative objects to delete nr=%ld\n",
 			       shrinker->shrink, total_scan);
 			total_scan = max_pass;
 		}
@@ -276,20 +279,32 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		if (total_scan > max_pass * 2)
 			total_scan = max_pass * 2;
 
-		trace_mm_shrink_slab_start(shrinker, shrink, nr,
+		trace_mm_shrink_slab_start(shrinker, sc, nr,
 					nr_pages_scanned, lru_pages,
 					max_pass, delta, total_scan);
 
 		do {
-			int nr_before;
+			long ret;
+
+			if (shrinker->scan_objects) {
+				sc->nr_to_scan = batch_size;
+				ret = shrinker->scan_objects(shrinker, sc);
+
+				if (ret == -1)
+					break;
+				freed += ret;
+			} else {
+				int nr_before;
+
+				nr_before = do_shrinker_shrink(shrinker, sc, 0);
+				ret = do_shrinker_shrink(shrinker, sc,
+								batch_size);
+				if (ret == -1)
+					break;
+				if (ret < nr_before)
+					freed += nr_before - ret;
+			}
 
-			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
-			shrink_ret = do_shrinker_shrink(shrinker, shrink,
-							batch_size);
-			if (shrink_ret == -1)
-				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
 			count_vm_events(SLABS_SCANNED, batch_size);
 			total_scan -= batch_size;
 
@@ -307,12 +322,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		else
 			new_nr = atomic_long_read(&shrinker->nr_in_batch);
 
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+		trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
-	return ret;
+	return freed;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 07/28] shrinker: convert superblock shrinkers to new API
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (5 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 06/28] mm: new shrinker API Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 08/28] list: add a new LRU list type Glauber Costa
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Glauber Costa
From: Dave Chinner <dchinner@redhat.com>
Convert superblock shrinker to use the new count/scan API, and
propagate the API changes through to the filesystem callouts. The
filesystem callouts already use a count/scan API, so it's just
changing counters to longs to match the VM API.
This requires the dentry and inode shrinker callouts to be converted
to the count/scan API. This is mainly a mechanical change.
[ glommer: use mult_frac for fractional proportions, build fixes ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/dcache.c         | 10 +++++---
 fs/inode.c          |  7 +++--
 fs/internal.h       |  2 ++
 fs/super.c          | 74 ++++++++++++++++++++++++++++++++---------------------
 fs/xfs/xfs_icache.c |  4 +--
 fs/xfs/xfs_icache.h |  2 +-
 fs/xfs/xfs_super.c  |  8 +++---
 include/linux/fs.h  |  8 ++----
 8 files changed, 67 insertions(+), 48 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index d15420b..2c9fcd6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -866,11 +866,12 @@ static void shrink_dentry_list(struct list_head *list)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-void prune_dcache_sb(struct super_block *sb, int count)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
 {
 	struct dentry *dentry;
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
+	long freed = 0;
 
 relock:
 	spin_lock(&sb->s_dentry_lru_lock);
@@ -895,7 +896,8 @@ relock:
 			this_cpu_dec(nr_dentry_unused);
 			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
-			if (!--count)
+			freed++;
+			if (!--nr_to_scan)
 				break;
 		}
 		cond_resched_lock(&sb->s_dentry_lru_lock);
@@ -905,6 +907,7 @@ relock:
 	spin_unlock(&sb->s_dentry_lru_lock);
 
 	shrink_dentry_list(&tmp);
+	return freed;
 }
 
 /*
@@ -1291,9 +1294,8 @@ rename_retry:
 void shrink_dcache_parent(struct dentry * parent)
 {
 	LIST_HEAD(dispose);
-	int found;
 
-	while ((found = select_parent(parent, &dispose)) != 0)
+	while (select_parent(parent, &dispose))
 		shrink_dentry_list(&dispose);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
diff --git a/fs/inode.c b/fs/inode.c
index f5f7c06..1dd8908 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -704,10 +704,11 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-void prune_icache_sb(struct super_block *sb, int nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_scanned;
+	long nr_scanned;
+	long freed = 0;
 	unsigned long reap = 0;
 
 	spin_lock(&sb->s_inode_lru_lock);
@@ -777,6 +778,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 		list_move(&inode->i_lru, &freeable);
 		sb->s_nr_inodes_unused--;
 		this_cpu_dec(nr_unused);
+		freed++;
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -787,6 +789,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 		current->reclaim_state->reclaimed_slab += reap;
 
 	dispose_list(&freeable);
+	return freed;
 }
 
 static void __wait_on_freeing_inode(struct inode *inode);
diff --git a/fs/internal.h b/fs/internal.h
index 507141f..5099f87 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,6 +110,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,3 +126,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
diff --git a/fs/super.c b/fs/super.c
index 0be75fb..9d2f2e9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,11 +53,14 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
-static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
+static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
-	int	fs_objects = 0;
-	int	total_objects;
+	long	fs_objects = 0;
+	long	total_objects;
+	long	freed = 0;
+	long	dentries;
+	long	inodes;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -65,7 +68,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
 	 * to recurse into the FS that called us in clear_inode() and friends..
 	 */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
 
 	if (!grab_super_passive(sb))
@@ -77,33 +80,45 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	total_objects = sb->s_nr_dentry_unused +
 			sb->s_nr_inodes_unused + fs_objects + 1;
 
-	if (sc->nr_to_scan) {
-		int	dentries;
-		int	inodes;
-
-		/* proportion the scan between the caches */
-		dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
-							total_objects);
-		inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
-							total_objects);
-		if (fs_objects)
-			fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-							total_objects);
-		/*
-		 * prune the dcache first as the icache is pinned by it, then
-		 * prune the icache, followed by the filesystem specific caches
-		 */
-		prune_dcache_sb(sb, dentries);
-		prune_icache_sb(sb, inodes);
+	/* proportion the scan between the caches */
+	dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
+								total_objects);
+	inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
+								total_objects);
 
-		if (fs_objects && sb->s_op->free_cached_objects) {
-			sb->s_op->free_cached_objects(sb, fs_objects);
-			fs_objects = sb->s_op->nr_cached_objects(sb);
-		}
-		total_objects = sb->s_nr_dentry_unused +
-				sb->s_nr_inodes_unused + fs_objects;
+	/*
+	 * prune the dcache first as the icache is pinned by it, then
+	 * prune the icache, followed by the filesystem specific caches
+	 */
+	freed = prune_dcache_sb(sb, dentries);
+	freed += prune_icache_sb(sb, inodes);
+
+	if (fs_objects) {
+		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
+								total_objects);
+		freed += sb->s_op->free_cached_objects(sb, fs_objects);
 	}
 
+	drop_super(sb);
+	return freed;
+}
+
+static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct super_block *sb;
+	long	total_objects = 0;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	if (!grab_super_passive(sb))
+		return -1;
+
+	if (sb->s_op && sb->s_op->nr_cached_objects)
+		total_objects = sb->s_op->nr_cached_objects(sb);
+
+	total_objects += sb->s_nr_dentry_unused;
+	total_objects += sb->s_nr_inodes_unused;
+
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
 	return total_objects;
@@ -217,7 +232,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->cleancache_poolid = -1;
 
 		s->s_shrink.seeks = DEFAULT_SEEKS;
-		s->s_shrink.shrink = prune_super;
+		s->s_shrink.scan_objects = super_cache_scan;
+		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
 	}
 out:
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 96e344e..b35c311 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1164,7 +1164,7 @@ xfs_reclaim_inodes(
  * them to be cleaned, which we hope will not be very long due to the
  * background walker having already kicked the IO off on those dirty inodes.
  */
-void
+long
 xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
@@ -1173,7 +1173,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
 }
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index e0f138c..2d6d2d3 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -31,7 +31,7 @@ void xfs_reclaim_worker(struct work_struct *work);
 
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
-void xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
+long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ea341ce..1ff991b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1523,19 +1523,19 @@ xfs_fs_mount(
 	return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
 }
 
-static int
+static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
 
-static void
+static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	int			nr_to_scan)
+	long			nr_to_scan)
 {
-	xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02934f5..a49fe84 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1324,10 +1324,6 @@ struct super_block {
 	int s_readonly_remount;
 };
 
-/* superblock cache pruning functions */
-extern void prune_icache_sb(struct super_block *sb, int nr_to_scan);
-extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
-
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
@@ -1614,8 +1610,8 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	int (*nr_cached_objects)(struct super_block *);
-	void (*free_cached_objects)(struct super_block *, int);
+	long (*nr_cached_objects)(struct super_block *);
+	long (*free_cached_objects)(struct super_block *, long);
 };
 
 /*
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 08/28] list: add a new LRU list type
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (6 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 07/28] shrinker: convert superblock shrinkers to new API Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-04 21:53   ` Greg Thelen
  2013-03-29  9:13 ` [PATCH v2 09/28] inode: convert inode lru list to generic lru list code Glauber Costa
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Several subsystems use the same construct for LRU lists - a list
head, a spin lock and and item count. They also use exactly the same
code for adding and removing items from the LRU. Create a generic
type for these LRU lists.
This is the beginning of generic, node aware LRUs for shrinkers to
work with.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_lru.h |  36 +++++++++++++++
 lib/Makefile             |   2 +-
 lib/list_lru.c           | 117 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/list_lru.h
 create mode 100644 lib/list_lru.c
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
new file mode 100644
index 0000000..3423949
--- /dev/null
+++ b/include/linux/list_lru.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#ifndef _LRU_LIST_H
+#define _LRU_LIST_H 0
+
+#include <linux/list.h>
+
+struct list_lru {
+	spinlock_t		lock;
+	struct list_head	list;
+	long			nr_items;
+};
+
+int list_lru_init(struct list_lru *lru);
+int list_lru_add(struct list_lru *lru, struct list_head *item);
+int list_lru_del(struct list_lru *lru, struct list_head *item);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+	return lru->nr_items;
+}
+
+typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
+				void *cb_arg);
+typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
+
+long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+		   void *cb_arg, long nr_to_walk);
+
+long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
+
+#endif /* _LRU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index d7946ff..f14abd9 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o
+	 earlycpio.o list_lru.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/list_lru.c b/lib/list_lru.c
new file mode 100644
index 0000000..475d0e9
--- /dev/null
+++ b/lib/list_lru.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_lru.h>
+
+int
+list_lru_add(
+	struct list_lru	*lru,
+	struct list_head *item)
+{
+	spin_lock(&lru->lock);
+	if (list_empty(item)) {
+		list_add_tail(item, &lru->list);
+		lru->nr_items++;
+		spin_unlock(&lru->lock);
+		return 1;
+	}
+	spin_unlock(&lru->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_add);
+
+int
+list_lru_del(
+	struct list_lru	*lru,
+	struct list_head *item)
+{
+	spin_lock(&lru->lock);
+	if (!list_empty(item)) {
+		list_del_init(item);
+		lru->nr_items--;
+		spin_unlock(&lru->lock);
+		return 1;
+	}
+	spin_unlock(&lru->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_del);
+
+long
+list_lru_walk(
+	struct list_lru *lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	long		nr_to_walk)
+{
+	struct list_head *item, *n;
+	long removed = 0;
+restart:
+	spin_lock(&lru->lock);
+	list_for_each_safe(item, n, &lru->list) {
+		int ret;
+
+		if (nr_to_walk-- < 0)
+			break;
+
+		ret = isolate(item, &lru->lock, cb_arg);
+		switch (ret) {
+		case 0:	/* item removed from list */
+			lru->nr_items--;
+			removed++;
+			break;
+		case 1: /* item referenced, give another pass */
+			list_move_tail(item, &lru->list);
+			break;
+		case 2: /* item cannot be locked, skip */
+			break;
+		case 3: /* item not freeable, lock dropped */
+			goto restart;
+		default:
+			BUG();
+		}
+	}
+	spin_unlock(&lru->lock);
+	return removed;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+long
+list_lru_dispose_all(
+	struct list_lru *lru,
+	list_lru_dispose_cb dispose)
+{
+	long disposed = 0;
+	LIST_HEAD(dispose_list);
+
+	spin_lock(&lru->lock);
+	while (!list_empty(&lru->list)) {
+		list_splice_init(&lru->list, &dispose_list);
+		disposed += lru->nr_items;
+		lru->nr_items = 0;
+		spin_unlock(&lru->lock);
+
+		dispose(&dispose_list);
+
+		spin_lock(&lru->lock);
+	}
+	spin_unlock(&lru->lock);
+	return disposed;
+}
+
+int
+list_lru_init(
+	struct list_lru	*lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_init);
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 09/28] inode: convert inode lru list to generic lru list code.
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (7 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 08/28] list: add a new LRU list type Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 10/28] dcache: convert to use new lru list infrastructure Glauber Costa
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c         | 174 +++++++++++++++++++++--------------------------------
 fs/super.c         |  12 ++--
 include/linux/fs.h |   6 +-
 3 files changed, 76 insertions(+), 116 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 1dd8908..18505c5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
 #include <linux/prefetch.h>
 #include <linux/buffer_head.h> /* for inode_has_buffers */
 #include <linux/ratelimit.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 
 /*
@@ -24,7 +25,7 @@
  *
  * inode->i_lock protects:
  *   inode->i_state, inode->i_hash, __iget()
- * inode->i_sb->s_inode_lru_lock protects:
+ * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
@@ -37,7 +38,7 @@
  *
  * inode_sb_list_lock
  *   inode->i_lock
- *     inode->i_sb->s_inode_lru_lock
+ *     Inode LRU list locks
  *
  * bdi->wb.list_lock
  *   inode->i_lock
@@ -399,13 +400,8 @@ EXPORT_SYMBOL(ihold);
 
 static void inode_lru_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_lru_lock);
-	if (list_empty(&inode->i_lru)) {
-		list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
-		inode->i_sb->s_nr_inodes_unused++;
+	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_inc(nr_unused);
-	}
-	spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 /*
@@ -423,13 +419,9 @@ void inode_add_lru(struct inode *inode)
 
 static void inode_lru_list_del(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_lru_lock);
-	if (!list_empty(&inode->i_lru)) {
-		list_del_init(&inode->i_lru);
-		inode->i_sb->s_nr_inodes_unused--;
+
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_dec(nr_unused);
-	}
-	spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 /**
@@ -673,24 +665,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	return busy;
 }
 
-static int can_unuse(struct inode *inode)
-{
-	if (inode->i_state & ~I_REFERENCED)
-		return 0;
-	if (inode_has_buffers(inode))
-		return 0;
-	if (atomic_read(&inode->i_count))
-		return 0;
-	if (inode->i_data.nrpages)
-		return 0;
-	return 1;
-}
-
 /*
- * Walk the superblock inode LRU for freeable inodes and attempt to free them.
- * This is called from the superblock shrinker function with a number of inodes
- * to trim from the LRU. Inodes to be freed are moved to a temporary list and
- * then are freed outside inode_lock by dispose_list().
+ * Isolate the inode from the LRU in preparation for freeing it.
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  If the inode has metadata buffers attached to
@@ -704,90 +680,78 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+				void *arg)
 {
-	LIST_HEAD(freeable);
-	long nr_scanned;
-	long freed = 0;
-	unsigned long reap = 0;
+	struct list_head *freeable = arg;
+	struct inode	*inode = container_of(item, struct inode, i_lru);
 
-	spin_lock(&sb->s_inode_lru_lock);
-	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
-		struct inode *inode;
+	/*
+	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
+	 * If we fail to get the lock, just skip it.
+	 */
+	if (!spin_trylock(&inode->i_lock))
+		return 2;
 
-		if (list_empty(&sb->s_inode_lru))
-			break;
+	/*
+	 * Referenced or dirty inodes are still in use. Give them another pass
+	 * through the LRU as we canot reclaim them now.
+	 */
+	if (atomic_read(&inode->i_count) ||
+	    (inode->i_state & ~I_REFERENCED)) {
+		list_del_init(&inode->i_lru);
+		spin_unlock(&inode->i_lock);
+		this_cpu_dec(nr_unused);
+		return 0;
+	}
 
-		inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);
+	/* recently referenced inodes get one more pass */
+	if (inode->i_state & I_REFERENCED) {
+		inode->i_state &= ~I_REFERENCED;
+		spin_unlock(&inode->i_lock);
+		return 1;
+	}
 
-		/*
-		 * we are inverting the sb->s_inode_lru_lock/inode->i_lock here,
-		 * so use a trylock. If we fail to get the lock, just move the
-		 * inode to the back of the list so we don't spin on it.
-		 */
-		if (!spin_trylock(&inode->i_lock)) {
-			list_move_tail(&inode->i_lru, &sb->s_inode_lru);
-			continue;
+	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(lru_lock);
+		if (remove_inode_buffers(inode)) {
+			unsigned long reap;
+			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
+			if (current_is_kswapd())
+				__count_vm_events(KSWAPD_INODESTEAL, reap);
+			else
+				__count_vm_events(PGINODESTEAL, reap);
+			if (current->reclaim_state)
+				current->reclaim_state->reclaimed_slab += reap;
 		}
+		iput(inode);
+		return 3;
+	}
 
-		/*
-		 * Referenced or dirty inodes are still in use. Give them
-		 * another pass through the LRU as we canot reclaim them now.
-		 */
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_lru);
-			spin_unlock(&inode->i_lock);
-			sb->s_nr_inodes_unused--;
-			this_cpu_dec(nr_unused);
-			continue;
-		}
+	WARN_ON(inode->i_state & I_NEW);
+	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
-		/* recently referenced inodes get one more pass */
-		if (inode->i_state & I_REFERENCED) {
-			inode->i_state &= ~I_REFERENCED;
-			list_move(&inode->i_lru, &sb->s_inode_lru);
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
-			spin_unlock(&inode->i_lock);
-			spin_unlock(&sb->s_inode_lru_lock);
-			if (remove_inode_buffers(inode))
-				reap += invalidate_mapping_pages(&inode->i_data,
-								0, -1);
-			iput(inode);
-			spin_lock(&sb->s_inode_lru_lock);
-
-			if (inode != list_entry(sb->s_inode_lru.next,
-						struct inode, i_lru))
-				continue;	/* wrong inode or list_empty */
-			/* avoid lock inversions with trylock */
-			if (!spin_trylock(&inode->i_lock))
-				continue;
-			if (!can_unuse(inode)) {
-				spin_unlock(&inode->i_lock);
-				continue;
-			}
-		}
-		WARN_ON(inode->i_state & I_NEW);
-		inode->i_state |= I_FREEING;
-		spin_unlock(&inode->i_lock);
+	list_move(&inode->i_lru, freeable);
+	this_cpu_dec(nr_unused);
+	return 0;
+}
 
-		list_move(&inode->i_lru, &freeable);
-		sb->s_nr_inodes_unused--;
-		this_cpu_dec(nr_unused);
-		freed++;
-	}
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_INODESTEAL, reap);
-	else
-		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&sb->s_inode_lru_lock);
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += reap;
+/*
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
+ */
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+{
+	LIST_HEAD(freeable);
+	long freed;
 
+	freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
+						&freeable, nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/super.c b/fs/super.c
index 9d2f2e9..9049110 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,14 +77,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb);
 
-	total_objects = sb->s_nr_dentry_unused +
-			sb->s_nr_inodes_unused + fs_objects + 1;
+	inodes = list_lru_count(&sb->s_inode_lru);
+	total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
 								total_objects);
-	inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
-								total_objects);
+	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
@@ -117,7 +116,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 		total_objects = sb->s_op->nr_cached_objects(sb);
 
 	total_objects += sb->s_nr_dentry_unused;
-	total_objects += sb->s_nr_inodes_unused;
+	total_objects += list_lru_count(&sb->s_inode_lru);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -198,8 +197,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		spin_lock_init(&s->s_dentry_lru_lock);
-		INIT_LIST_HEAD(&s->s_inode_lru);
-		spin_lock_init(&s->s_inode_lru_lock);
+		list_lru_init(&s->s_inode_lru);
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a49fe84..fdeaca1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
 #include <linux/stat.h>
 #include <linux/cache.h>
 #include <linux/list.h>
+#include <linux/list_lru.h>
 #include <linux/radix-tree.h>
 #include <linux/rbtree.h>
 #include <linux/init.h>
@@ -1267,10 +1268,7 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
-	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
-	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inode_lru;		/* unused inode lru */
-	int			s_nr_inodes_unused;	/* # of inodes on lru */
+	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 10/28] dcache: convert to use new lru list infrastructure
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (8 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 09/28] inode: convert inode lru list to generic lru list code Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-04-08 13:14   ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 11/28] list_lru: per-node " Glauber Costa
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c        | 171 ++++++++++++++++++++++-------------------------------
 fs/super.c         |  11 ++--
 include/linux/fs.h |  15 +++--
 3 files changed, 82 insertions(+), 115 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 2c9fcd6..b59d341 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/prefetch.h>
 #include <linux/ratelimit.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -318,20 +319,8 @@ static void dentry_unlink_inode(struct dentry * dentry)
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
-	if (list_empty(&dentry->d_lru)) {
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-		dentry->d_sb->s_nr_dentry_unused++;
+	if (list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
 		this_cpu_inc(nr_dentry_unused);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-	}
-}
-
-static void __dentry_lru_del(struct dentry *dentry)
-{
-	list_del_init(&dentry->d_lru);
-	dentry->d_sb->s_nr_dentry_unused--;
-	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -341,11 +330,8 @@ static void dentry_lru_del(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
 
-	if (!list_empty(&dentry->d_lru)) {
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		__dentry_lru_del(dentry);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-	}
+	if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
+		this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -361,35 +347,19 @@ static void dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_prune(struct dentry *dentry)
 {
-	if (!list_empty(&dentry->d_lru)) {
+	int prune = dentry->d_flags & DCACHE_OP_PRUNE;
 
-		if (dentry->d_flags & DCACHE_OP_PRUNE)
-			dentry->d_op->d_prune(dentry);
-
-		if ((dentry->d_flags & DCACHE_SHRINK_LIST))
-			list_del_init(&dentry->d_lru);
-		else {
-			spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-			__dentry_lru_del(dentry);
-			spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-		}
-		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
-	}
-}
-
-static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
-{
-	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
-
-	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-	if (list_empty(&dentry->d_lru)) {
-		list_add_tail(&dentry->d_lru, list);
-	} else {
-		list_move_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused--;
+	if (!list_empty(&dentry->d_lru) &&
+	    (dentry->d_flags & DCACHE_SHRINK_LIST))
+		list_del_init(&dentry->d_lru);
+	else if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
 		this_cpu_dec(nr_dentry_unused);
-	}
-	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+	else
+		prune = 0;
+
+	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+	if (prune)
+		dentry->d_op->d_prune(dentry);
 }
 
 /**
@@ -854,6 +824,51 @@ static void shrink_dentry_list(struct list_head *list)
 	rcu_read_unlock();
 }
 
+static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+				void *arg)
+{
+	struct list_head *freeable = arg;
+	struct dentry	*dentry = container_of(item, struct dentry, d_lru);
+
+
+	/*
+	 * we are inverting the lru lock/dentry->d_lock here,
+	 * so use a trylock. If we fail to get the lock, just skip
+	 * it
+	 */
+	if (!spin_trylock(&dentry->d_lock))
+		return 2;
+
+	/*
+	 * Referenced dentries are still in use. If they have active
+	 * counts, just remove them from the LRU. Otherwise give them
+	 * another pass through the LRU.
+	 */
+	if (dentry->d_count) {
+		list_del_init(&dentry->d_lru);
+		spin_unlock(&dentry->d_lock);
+		return 0;
+	}
+
+	if (dentry->d_flags & DCACHE_REFERENCED) {
+		dentry->d_flags &= ~DCACHE_REFERENCED;
+		spin_unlock(&dentry->d_lock);
+
+		/*
+		 * XXX: this list move should be be done under d_lock. Need to
+		 * determine if it is safe just to do it under the lru lock.
+		 */
+		return 1;
+	}
+
+	dentry->d_flags |= DCACHE_SHRINK_LIST;
+	list_move_tail(&dentry->d_lru, freeable);
+	this_cpu_dec(nr_dentry_unused);
+	spin_unlock(&dentry->d_lock);
+
+	return 0;
+}
+
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
@@ -868,45 +883,12 @@ static void shrink_dentry_list(struct list_head *list)
  */
 long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
 {
-	struct dentry *dentry;
-	LIST_HEAD(referenced);
-	LIST_HEAD(tmp);
-	long freed = 0;
-
-relock:
-	spin_lock(&sb->s_dentry_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		dentry = list_entry(sb->s_dentry_lru.prev,
-				struct dentry, d_lru);
-		BUG_ON(dentry->d_sb != sb);
-
-		if (!spin_trylock(&dentry->d_lock)) {
-			spin_unlock(&sb->s_dentry_lru_lock);
-			cpu_relax();
-			goto relock;
-		}
-
-		if (dentry->d_flags & DCACHE_REFERENCED) {
-			dentry->d_flags &= ~DCACHE_REFERENCED;
-			list_move(&dentry->d_lru, &referenced);
-			spin_unlock(&dentry->d_lock);
-		} else {
-			list_move_tail(&dentry->d_lru, &tmp);
-			dentry->d_flags |= DCACHE_SHRINK_LIST;
-			this_cpu_dec(nr_dentry_unused);
-			sb->s_nr_dentry_unused--;
-			spin_unlock(&dentry->d_lock);
-			freed++;
-			if (!--nr_to_scan)
-				break;
-		}
-		cond_resched_lock(&sb->s_dentry_lru_lock);
-	}
-	if (!list_empty(&referenced))
-		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&sb->s_dentry_lru_lock);
+	LIST_HEAD(dispose);
+	long freed;
 
-	shrink_dentry_list(&tmp);
+	freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
+			      &dispose, nr_to_scan);
+	shrink_dentry_list(&dispose);
 	return freed;
 }
 
@@ -941,24 +923,10 @@ shrink_dcache_list(
  */
 void shrink_dcache_sb(struct super_block *sb)
 {
-	LIST_HEAD(tmp);
-
-	spin_lock(&sb->s_dentry_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		list_splice_init(&sb->s_dentry_lru, &tmp);
-
-		/*
-		 * account for removal here so we don't need to handle it later
-		 * even though the dentry is no longer on the lru list.
-		 */
-		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
-		sb->s_nr_dentry_unused = 0;
+	long freed;
 
-		spin_unlock(&sb->s_dentry_lru_lock);
-		shrink_dcache_list(&tmp);
-		spin_lock(&sb->s_dentry_lru_lock);
-	}
-	spin_unlock(&sb->s_dentry_lru_lock);
+	freed = list_lru_dispose_all(&sb->s_dentry_lru, shrink_dcache_list);
+	this_cpu_sub(nr_dentry_unused, freed);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -1229,7 +1197,8 @@ resume:
 		if (dentry->d_count) {
 			dentry_lru_del(dentry);
 		} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
-			dentry_lru_move_list(dentry, dispose);
+			dentry_lru_del(dentry);
+			list_add_tail(&dentry->d_lru, dispose);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
 			found++;
 		}
diff --git a/fs/super.c b/fs/super.c
index 9049110..66f5cde 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,11 +78,11 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 		fs_objects = sb->s_op->nr_cached_objects(sb);
 
 	inodes = list_lru_count(&sb->s_inode_lru);
-	total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
+	dentries = list_lru_count(&sb->s_dentry_lru);
+	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
-	dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
-								total_objects);
+	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
 
 	/*
@@ -115,7 +115,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb);
 
-	total_objects += sb->s_nr_dentry_unused;
+	total_objects += list_lru_count(&sb->s_dentry_lru);
 	total_objects += list_lru_count(&sb->s_inode_lru);
 
 	total_objects = vfs_pressure_ratio(total_objects);
@@ -195,8 +195,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
-		INIT_LIST_HEAD(&s->s_dentry_lru);
-		spin_lock_init(&s->s_dentry_lru_lock);
+		list_lru_init(&s->s_dentry_lru);
 		list_lru_init(&s->s_inode_lru);
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fdeaca1..8b25de0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1262,14 +1262,6 @@ struct super_block {
 	struct list_head	s_files;
 #endif
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-
-	/* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
-	spinlock_t		s_dentry_lru_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_dentry_lru;	/* unused dentry lru */
-	int			s_nr_dentry_unused;	/* # of dentry on lru */
-
-	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
-
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
@@ -1320,6 +1312,13 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/*
+	 * Keep the lru lists last in the structure so they always sit on their
+	 * own individual cachelines.
+	 */
+	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
+	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 11/28] list_lru: per-node list infrastructure
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (9 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 10/28] dcache: convert to use new lru list infrastructure Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 12/28] shrinker: add node awareness Glauber Costa
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Now that we have an LRU list API, we can start to enhance the
implementation.  This splits the single LRU list into per-node lists
and locks to enhance scalability. Items are placed on lists
according to the node the memory belongs to. To make scanning the
lists efficient, also track whether the per-node lists have entries
in them in a active nodemask.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_lru.h |  14 +++--
 lib/list_lru.c           | 160 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 129 insertions(+), 45 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3423949..b0e3ba25 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -8,21 +8,23 @@
 #define _LRU_LIST_H 0
 
 #include <linux/list.h>
+#include <linux/nodemask.h>
 
-struct list_lru {
+struct list_lru_node {
 	spinlock_t		lock;
 	struct list_head	list;
 	long			nr_items;
+} ____cacheline_aligned_in_smp;
+
+struct list_lru {
+	struct list_lru_node	node[MAX_NUMNODES];
+	nodemask_t		active_nodes;
 };
 
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-
-static inline long list_lru_count(struct list_lru *lru)
-{
-	return lru->nr_items;
-}
+long list_lru_count(struct list_lru *lru);
 
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 475d0e9..881e342 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -6,6 +6,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 #include <linux/list_lru.h>
 
 int
@@ -13,14 +14,19 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	spin_lock(&lru->lock);
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	spin_lock(&nlru->lock);
+	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
-		list_add_tail(item, &lru->list);
-		lru->nr_items++;
-		spin_unlock(&lru->lock);
+		list_add_tail(item, &nlru->list);
+		if (nlru->nr_items++ == 0)
+			node_set(nid, lru->active_nodes);
+		spin_unlock(&nlru->lock);
 		return 1;
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
@@ -30,43 +36,72 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	spin_lock(&lru->lock);
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		lru->nr_items--;
-		spin_unlock(&lru->lock);
+		if (--nlru->nr_items == 0)
+			node_clear(nid, lru->active_nodes);
+		BUG_ON(nlru->nr_items < 0);
+		spin_unlock(&nlru->lock);
 		return 1;
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_walk(
-	struct list_lru *lru,
-	list_lru_walk_cb isolate,
-	void		*cb_arg,
-	long		nr_to_walk)
+list_lru_count(
+	struct list_lru *lru)
 {
+	long count = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		struct list_lru_node *nlru = &lru->node[nid];
+
+		spin_lock(&nlru->lock);
+		BUG_ON(nlru->nr_items < 0);
+		count += nlru->nr_items;
+		spin_unlock(&nlru->lock);
+	}
+
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static long
+list_lru_walk_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_walk_cb	isolate,
+	void			*cb_arg,
+	long			*nr_to_walk)
+{
+	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
-	long removed = 0;
+	long isolated = 0;
 restart:
-	spin_lock(&lru->lock);
-	list_for_each_safe(item, n, &lru->list) {
+	spin_lock(&nlru->lock);
+	list_for_each_safe(item, n, &nlru->list) {
 		int ret;
 
-		if (nr_to_walk-- < 0)
+		if ((*nr_to_walk)-- < 0)
 			break;
 
-		ret = isolate(item, &lru->lock, cb_arg);
+		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case 0:	/* item removed from list */
-			lru->nr_items--;
-			removed++;
+			if (--nlru->nr_items == 0)
+				node_clear(nid, lru->active_nodes);
+			BUG_ON(nlru->nr_items < 0);
+			isolated++;
 			break;
 		case 1: /* item referenced, give another pass */
-			list_move_tail(item, &lru->list);
+			list_move_tail(item, &nlru->list);
 			break;
 		case 2: /* item cannot be locked, skip */
 			break;
@@ -76,42 +111,89 @@ restart:
 			BUG();
 		}
 	}
-	spin_unlock(&lru->lock);
-	return removed;
+	spin_unlock(&nlru->lock);
+	return isolated;
+}
+
+long
+list_lru_walk(
+	struct list_lru	*lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	long		nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+	return isolated;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk);
 
 long
-list_lru_dispose_all(
-	struct list_lru *lru,
-	list_lru_dispose_cb dispose)
+list_lru_dispose_all_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_dispose_cb	dispose)
 {
-	long disposed = 0;
+	struct list_lru_node	*nlru = &lru->node[nid];
 	LIST_HEAD(dispose_list);
+	long disposed = 0;
 
-	spin_lock(&lru->lock);
-	while (!list_empty(&lru->list)) {
-		list_splice_init(&lru->list, &dispose_list);
-		disposed += lru->nr_items;
-		lru->nr_items = 0;
-		spin_unlock(&lru->lock);
+	spin_lock(&nlru->lock);
+	while (!list_empty(&nlru->list)) {
+		list_splice_init(&nlru->list, &dispose_list);
+		disposed += nlru->nr_items;
+		nlru->nr_items = 0;
+		node_clear(nid, lru->active_nodes);
+		spin_unlock(&nlru->lock);
 
 		dispose(&dispose_list);
 
-		spin_lock(&lru->lock);
+		spin_lock(&nlru->lock);
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return disposed;
 }
 
+long
+list_lru_dispose_all(
+	struct list_lru		*lru,
+	list_lru_dispose_cb	dispose)
+{
+	long disposed;
+	long total = 0;
+	int nid;
+
+	do {
+		disposed = 0;
+		for_each_node_mask(nid, lru->active_nodes) {
+			disposed += list_lru_dispose_all_node(lru, nid,
+							      dispose);
+		}
+		total += disposed;
+	} while (disposed != 0);
+
+	return total;
+}
+
 int
 list_lru_init(
 	struct list_lru	*lru)
 {
-	spin_lock_init(&lru->lock);
-	INIT_LIST_HEAD(&lru->list);
-	lru->nr_items = 0;
+	int i;
 
+	nodes_clear(lru->active_nodes);
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		spin_lock_init(&lru->node[i].lock);
+		INIT_LIST_HEAD(&lru->node[i].list);
+		lru->node[i].nr_items = 0;
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_init);
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 12/28] shrinker: add node awareness
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (10 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 11/28] list_lru: per-node " Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 13/28] fs: convert inode and dentry shrinking to be node aware Glauber Costa
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for
node aware shrinkers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c         |  1 +
 include/linux/shrinker.h |  3 +++
 mm/memory-failure.c      |  2 ++
 mm/vmscan.c              | 12 +++++++++---
 4 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c00e055..9fd702f 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -44,6 +44,7 @@ static void drop_slab(void)
 		.gfp_mask = GFP_KERNEL,
 	};
 
+	nodes_setall(shrink.nodes_to_scan);
 	do {
 		nr_objects = shrink_slab(&shrink, 1000, 1000);
 	} while (nr_objects > 10);
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 4f59615..e71286f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -16,6 +16,9 @@ struct shrink_control {
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
 	long nr_to_scan;
+
+	/* shrink from these nodes */
+	nodemask_t nodes_to_scan;
 };
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df0694c..857377e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -248,10 +248,12 @@ void shake_page(struct page *p, int access)
 	 */
 	if (access) {
 		int nr;
+		int nid = page_to_nid(p);
 		do {
 			struct shrink_control shrink = {
 				.gfp_mask = GFP_KERNEL,
 			};
+			node_set(nid, shrink.nodes_to_scan);
 
 			nr = shrink_slab(&shrink, 1000, 1000);
 			if (page_count(p) == 1)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64b0157..6926e09 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2191,15 +2191,20 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
+
+			nodes_clear(shrink->nodes_to_scan);
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
 				lru_pages += zone_reclaimable_pages(zone);
+				node_set(zone_to_nid(zone),
+					 shrink->nodes_to_scan);
 			}
 
 			shrink_slab(shrink, sc->nr_scanned, lru_pages);
+
 			if (reclaim_state) {
 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 				reclaim_state->reclaimed_slab = 0;
@@ -2778,6 +2783,8 @@ loop_again:
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
+				nodes_clear(shrink.nodes_to_scan);
+				node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
@@ -3364,10 +3371,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * number of slab pages and shake the slab until it is reduced
 		 * by the same nr_pages that we used for reclaiming unmapped
 		 * pages.
-		 *
-		 * Note that shrink_slab will free memory on all zones and may
-		 * take a long time.
 		 */
+		nodes_clear(shrink.nodes_to_scan);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 		for (;;) {
 			unsigned long lru_pages = zone_reclaimable_pages(zone);
 
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 13/28] fs: convert inode and dentry shrinking to be node aware
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (11 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 12/28] shrinker: add node awareness Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 14/28] xfs: convert buftarg LRU to generic code Glauber Costa
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Now that the shrinker is passing a nodemask in the scan control
structure, we can pass this to the the generic LRU list code to
isolate reclaim to the lists on matching nodes.
This requires a small amount of refactoring of the LRU list API,
which might be best split out into a separate patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c              |  7 ++++---
 fs/inode.c               |  7 ++++---
 fs/internal.h            |  6 ++++--
 fs/super.c               | 22 +++++++++++++---------
 fs/xfs/xfs_super.c       |  6 ++++--
 include/linux/fs.h       |  4 ++--
 include/linux/list_lru.h | 19 ++++++++++++++++---
 lib/list_lru.c           | 18 ++++++++++--------
 8 files changed, 57 insertions(+), 32 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index b59d341..79f6820 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -881,13 +881,14 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+		     nodemask_t *nodes_to_walk)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
-			      &dispose, nr_to_scan);
+	freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
+				       &dispose, nr_to_scan, nodes_to_walk);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 18505c5..1332eef 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -745,13 +745,14 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+		     nodemask_t *nodes_to_walk)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
-						&freeable, nr_to_scan);
+	freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
+				       &freeable, nr_to_scan, nodes_to_walk);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 5099f87..ed6944e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,7 +110,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+			    nodemask_t *nodes_to_scan);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -126,4 +127,5 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
-extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+			    nodemask_t *nodes_to_scan);
diff --git a/fs/super.c b/fs/super.c
index 66f5cde..5c7b879 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,10 +75,10 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 		return -1;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb);
+		fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);
 
-	inodes = list_lru_count(&sb->s_inode_lru);
-	dentries = list_lru_count(&sb->s_dentry_lru);
+	inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
+	dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,13 +89,14 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries);
-	freed += prune_icache_sb(sb, inodes);
+	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
+	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
 								total_objects);
-		freed += sb->s_op->free_cached_objects(sb, fs_objects);
+		freed += sb->s_op->free_cached_objects(sb, fs_objects,
+						       &sc->nodes_to_scan);
 	}
 
 	drop_super(sb);
@@ -113,10 +114,13 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 		return -1;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb);
+		total_objects = sb->s_op->nr_cached_objects(sb,
+						 &sc->nodes_to_scan);
 
-	total_objects += list_lru_count(&sb->s_dentry_lru);
-	total_objects += list_lru_count(&sb->s_inode_lru);
+	total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
+						 &sc->nodes_to_scan);
+	total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
+						 &sc->nodes_to_scan);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1ff991b..7fa6021 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1525,7 +1525,8 @@ xfs_fs_mount(
 
 static long
 xfs_fs_nr_cached_objects(
-	struct super_block	*sb)
+	struct super_block	*sb,
+	nodemask_t		*nodes_to_count)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1533,7 +1534,8 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan)
+	long			nr_to_scan,
+	nodemask_t		*nodes_to_scan)
 {
 	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b25de0..306c83e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1607,8 +1607,8 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *);
-	long (*free_cached_objects)(struct super_block *, long);
+	long (*nr_cached_objects)(struct super_block *, nodemask_t *);
+	long (*free_cached_objects)(struct super_block *, long, nodemask_t *);
 };
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b0e3ba25..02796da 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -24,14 +24,27 @@ struct list_lru {
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count(struct list_lru *lru);
+long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+	return list_lru_count_nodemask(lru, &lru->active_nodes);
+}
+
 
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, long nr_to_walk);
+long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
+		   void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+
+static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+				 void *cb_arg, long nr_to_walk)
+{
+	return list_lru_walk_nodemask(lru, isolate, cb_arg, nr_to_walk,
+				      &lru->active_nodes);
+}
 
 long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 881e342..0f08ed6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -54,13 +54,14 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_count(
-	struct list_lru *lru)
+list_lru_count_nodemask(
+	struct list_lru *lru,
+	nodemask_t	*nodes_to_count)
 {
 	long count = 0;
 	int nid;
 
-	for_each_node_mask(nid, lru->active_nodes) {
+	for_each_node_mask(nid, *nodes_to_count) {
 		struct list_lru_node *nlru = &lru->node[nid];
 
 		spin_lock(&nlru->lock);
@@ -71,7 +72,7 @@ list_lru_count(
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
 
 static long
 list_lru_walk_node(
@@ -116,16 +117,17 @@ restart:
 }
 
 long
-list_lru_walk(
+list_lru_walk_nodemask(
 	struct list_lru	*lru,
 	list_lru_walk_cb isolate,
 	void		*cb_arg,
-	long		nr_to_walk)
+	long		nr_to_walk,
+	nodemask_t	*nodes_to_walk)
 {
 	long isolated = 0;
 	int nid;
 
-	for_each_node_mask(nid, lru->active_nodes) {
+	for_each_node_mask(nid, *nodes_to_walk) {
 		isolated += list_lru_walk_node(lru, nid, isolate,
 					       cb_arg, &nr_to_walk);
 		if (nr_to_walk <= 0)
@@ -133,7 +135,7 @@ list_lru_walk(
 	}
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);
 
 long
 list_lru_dispose_all_node(
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 14/28] xfs: convert buftarg LRU to generic code
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (12 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 13/28] fs: convert inode and dentry shrinking to be node aware Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 15/28] xfs: convert dquot cache lru to list_lru Glauber Costa
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Convert the buftarg LRU to use the new generic LRU list and take
advantage of the functionality it supplies to make the buffer cache
shrinker node aware.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Conflicts with 3b19034d4f:
	fs/xfs/xfs_buf.c
---
 fs/xfs/xfs_buf.c | 167 +++++++++++++++++++++++++------------------------------
 fs/xfs/xfs_buf.h |   5 +-
 2 files changed, 79 insertions(+), 93 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8459b5d..4cc6632 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
  * The LRU takes a new reference to the buffer so that it will only be freed
  * once the shrinker takes the buffer off the LRU.
  */
-STATIC void
+static void
 xfs_buf_lru_add(
 	struct xfs_buf	*bp)
 {
-	struct xfs_buftarg *btp = bp->b_target;
-
-	spin_lock(&btp->bt_lru_lock);
-	if (list_empty(&bp->b_lru)) {
-		atomic_inc(&bp->b_hold);
-		list_add_tail(&bp->b_lru, &btp->bt_lru);
-		btp->bt_lru_nr++;
+	if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
 		bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+		atomic_inc(&bp->b_hold);
 	}
-	spin_unlock(&btp->bt_lru_lock);
 }
 
 /*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
  * The unlocked check is safe here because it only occurs when there are not
  * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
  * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
  */
-STATIC void
+static void
 xfs_buf_lru_del(
 	struct xfs_buf	*bp)
 {
-	struct xfs_buftarg *btp = bp->b_target;
-
-	if (list_empty(&bp->b_lru))
-		return;
-
-	spin_lock(&btp->bt_lru_lock);
-	if (!list_empty(&bp->b_lru)) {
-		list_del_init(&bp->b_lru);
-		btp->bt_lru_nr--;
-	}
-	spin_unlock(&btp->bt_lru_lock);
+	list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
 }
 
 /*
@@ -151,18 +134,10 @@ xfs_buf_stale(
 	bp->b_flags &= ~_XBF_DELWRI_Q;
 
 	atomic_set(&(bp)->b_lru_ref, 0);
-	if (!list_empty(&bp->b_lru)) {
-		struct xfs_buftarg *btp = bp->b_target;
-
-		spin_lock(&btp->bt_lru_lock);
-		if (!list_empty(&bp->b_lru) &&
-		    !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
-			list_del_init(&bp->b_lru);
-			btp->bt_lru_nr--;
-			atomic_dec(&bp->b_hold);
-		}
-		spin_unlock(&btp->bt_lru_lock);
-	}
+	if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+	    (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
+		atomic_dec(&bp->b_hold);
+
 	ASSERT(atomic_read(&bp->b_hold) >= 1);
 }
 
@@ -1498,83 +1473,95 @@ xfs_buf_iomove(
  * returned. These buffers will have an elevated hold count, so wait on those
  * while freeing all the buffers only held by the LRU.
  */
-void
-xfs_wait_buftarg(
-	struct xfs_buftarg	*btp)
+static int
+xfs_buftarg_wait_rele(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+
 {
-	struct xfs_buf		*bp;
+	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
 
-restart:
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-		if (atomic_read(&bp->b_hold) > 1) {
-			trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-			list_move_tail(&bp->b_lru, &btp->bt_lru);
-			spin_unlock(&btp->bt_lru_lock);
-			delay(100);
-			goto restart;
-		}
+	if (atomic_read(&bp->b_hold) > 1) {
+		/* need to wait */
+		trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
+		spin_unlock(lru_lock);
+		delay(100);
+	} else {
 		/*
 		 * clear the LRU reference count so the buffer doesn't get
 		 * ignored in xfs_buf_rele().
 		 */
 		atomic_set(&bp->b_lru_ref, 0);
-		spin_unlock(&btp->bt_lru_lock);
+		spin_unlock(lru_lock);
 		xfs_buf_rele(bp);
-		spin_lock(&btp->bt_lru_lock);
 	}
-	spin_unlock(&btp->bt_lru_lock);
+	return 3;
 }
 
-int
-xfs_buftarg_shrink(
+void
+xfs_wait_buftarg(
+	struct xfs_buftarg	*btp)
+{
+	while (list_lru_count(&btp->bt_lru))
+		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
+			      NULL, LONG_MAX);
+}
+
+static int
+xfs_buftarg_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+{
+	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
+	struct list_head	*dispose = arg;
+
+	/*
+	 * Decrement the b_lru_ref count unless the value is already
+	 * zero. If the value is already zero, we need to reclaim the
+	 * buffer, otherwise it gets another trip through the LRU.
+	 */
+	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0))
+		return 1;
+
+	bp->b_lru_flags |= _XBF_LRU_DISPOSE;
+	list_move(item, dispose);
+	return 0;
+}
+
+static long
+xfs_buftarg_shrink_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	struct xfs_buf		*bp;
-	int nr_to_scan = sc->nr_to_scan;
 	LIST_HEAD(dispose);
+	long			freed;
 
-	if (!nr_to_scan)
-		return btp->bt_lru_nr;
-
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		if (nr_to_scan-- <= 0)
-			break;
-
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-
-		/*
-		 * Decrement the b_lru_ref count unless the value is already
-		 * zero. If the value is already zero, we need to reclaim the
-		 * buffer, otherwise it gets another trip through the LRU.
-		 */
-		if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
-			list_move_tail(&bp->b_lru, &btp->bt_lru);
-			continue;
-		}
-
-		/*
-		 * remove the buffer from the LRU now to avoid needing another
-		 * lock round trip inside xfs_buf_rele().
-		 */
-		list_move(&bp->b_lru, &dispose);
-		btp->bt_lru_nr--;
-		bp->b_lru_flags |= _XBF_LRU_DISPOSE;
-	}
-	spin_unlock(&btp->bt_lru_lock);
+	freed = list_lru_walk_nodemask(&btp->bt_lru, xfs_buftarg_isolate,
+				       &dispose, sc->nr_to_scan,
+				       &sc->nodes_to_scan);
 
 	while (!list_empty(&dispose)) {
+		struct xfs_buf *bp;
 		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
 		list_del_init(&bp->b_lru);
 		xfs_buf_rele(bp);
 	}
 
-	return btp->bt_lru_nr;
+	return freed;
+}
+
+static long
+xfs_buftarg_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_buftarg	*btp = container_of(shrink,
+					struct xfs_buftarg, bt_shrinker);
+	return list_lru_count_nodemask(&btp->bt_lru, &sc->nodes_to_scan);
 }
 
 void
@@ -1656,11 +1643,11 @@ xfs_alloc_buftarg(
 	if (!btp->bt_bdi)
 		goto error;
 
-	INIT_LIST_HEAD(&btp->bt_lru);
-	spin_lock_init(&btp->bt_lru_lock);
+	list_lru_init(&btp->bt_lru);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
+	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
+	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&btp->bt_shrinker);
 	return btp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 433a12e..5ec7d35 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -25,6 +25,7 @@
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/uio.h>
+#include <linux/list_lru.h>
 
 /*
  *	Base types
@@ -92,9 +93,7 @@ typedef struct xfs_buftarg {
 
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
-	struct list_head	bt_lru;
-	spinlock_t		bt_lru_lock;
-	unsigned int		bt_lru_nr;
+	struct list_lru		bt_lru;
 } xfs_buftarg_t;
 
 struct xfs_buf;
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 15/28] xfs: convert dquot cache lru to list_lru
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (13 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 14/28] xfs: convert buftarg LRU to generic code Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 16/28] fs: convert fs shrinkers to new scan/count API Glauber Costa
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Glauber Costa
From: Dave Chinner <dchinner@redhat.com>
Convert the XFS dquot lru to use the list_lru construct and convert
the shrinker to being node aware.
[ glommer: edited for conflicts ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/xfs/xfs_dquot.c |   7 +-
 fs/xfs/xfs_qm.c    | 274 +++++++++++++++++++++++++++--------------------------
 fs/xfs/xfs_qm.h    |   4 +-
 3 files changed, 141 insertions(+), 144 deletions(-)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 8025eb2..c15e7f9 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -837,13 +837,8 @@ xfs_qm_dqput_final(
 
 	trace_xfs_dqput_free(dqp);
 
-	mutex_lock(&qi->qi_lru_lock);
-	if (list_empty(&dqp->q_lru)) {
-		list_add_tail(&dqp->q_lru, &qi->qi_lru_list);
-		qi->qi_lru_count++;
+	if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
 		XFS_STATS_INC(xs_qm_dquot_unused);
-	}
-	mutex_unlock(&qi->qi_lru_lock);
 
 	/*
 	 * If we just added a udquot to the freelist, then we want to release
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 305f4e5..0084362 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -50,8 +50,9 @@
  */
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(struct shrinker *, struct shrink_control *);
 
+
+STATIC void	xfs_qm_dqfree_one(struct xfs_dquot *dqp);
 /*
  * We use the batch lookup interface to iterate over the dquots as it
  * currently is the only interface into the radix tree code that allows
@@ -196,12 +197,9 @@ xfs_qm_dqpurge(
 	 * We move dquots to the freelist as soon as their reference count
 	 * hits zero, so it really should be on the freelist here.
 	 */
-	mutex_lock(&qi->qi_lru_lock);
 	ASSERT(!list_empty(&dqp->q_lru));
-	list_del_init(&dqp->q_lru);
-	qi->qi_lru_count--;
+	list_lru_del(&qi->qi_lru, &dqp->q_lru);
 	XFS_STATS_DEC(xs_qm_dquot_unused);
-	mutex_unlock(&qi->qi_lru_lock);
 
 	xfs_qm_dqdestroy(dqp);
 
@@ -617,6 +615,139 @@ xfs_qm_dqdetach(
 	}
 }
 
+struct xfs_qm_isolate {
+	struct list_head	buffers;
+	struct list_head	dispose;
+};
+
+static int
+xfs_qm_dquot_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+{
+	struct xfs_dquot	*dqp = container_of(item,
+						struct xfs_dquot, q_lru);
+	struct xfs_qm_isolate	*isol = arg;
+
+	if (!xfs_dqlock_nowait(dqp))
+		goto out_miss_busy;
+
+	/*
+	 * This dquot has acquired a reference in the meantime remove it from
+	 * the freelist and try again.
+	 */
+	if (dqp->q_nrefs) {
+		xfs_dqunlock(dqp);
+		XFS_STATS_INC(xs_qm_dqwants);
+
+		trace_xfs_dqreclaim_want(dqp);
+		list_del_init(&dqp->q_lru);
+		XFS_STATS_DEC(xs_qm_dquot_unused);
+		return 0;
+	}
+
+	/*
+	 * If the dquot is dirty, flush it. If it's already being flushed, just
+	 * skip it so there is time for the IO to complete before we try to
+	 * reclaim it again on the next LRU pass.
+	 */
+	if (!xfs_dqflock_nowait(dqp))
+		xfs_dqunlock(dqp);
+		goto out_miss_busy;
+
+	if (XFS_DQ_IS_DIRTY(dqp)) {
+		struct xfs_buf	*bp = NULL;
+		int		error;
+
+		trace_xfs_dqreclaim_dirty(dqp);
+
+		/* we have to drop the LRU lock to flush the dquot */
+		spin_unlock(lru_lock);
+
+		error = xfs_qm_dqflush(dqp, &bp);
+		if (error) {
+			xfs_warn(dqp->q_mount, "%s: dquot %p flush failed",
+				 __func__, dqp);
+			goto out_unlock_dirty;
+		}
+
+		xfs_buf_delwri_queue(bp, &isol->buffers);
+		xfs_buf_relse(bp);
+		goto out_unlock_dirty;
+	}
+	xfs_dqfunlock(dqp);
+
+	/*
+	 * Prevent lookups now that we are past the point of no return.
+	 */
+	dqp->dq_flags |= XFS_DQ_FREEING;
+	xfs_dqunlock(dqp);
+
+	ASSERT(dqp->q_nrefs == 0);
+	list_move_tail(&dqp->q_lru, &isol->dispose);
+	XFS_STATS_DEC(xs_qm_dquot_unused);
+	trace_xfs_dqreclaim_done(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaims);
+	return 0;
+
+out_miss_busy:
+	trace_xfs_dqreclaim_busy(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaim_misses);
+	return 2;
+
+out_unlock_dirty:
+	trace_xfs_dqreclaim_busy(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaim_misses);
+	return 3;
+}
+
+static long
+xfs_qm_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_quotainfo	*qi = container_of(shrink,
+					struct xfs_quotainfo, qi_shrinker);
+	struct xfs_qm_isolate	isol;
+	long			freed;
+	int			error;
+
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+		return 0;
+
+	INIT_LIST_HEAD(&isol.buffers);
+	INIT_LIST_HEAD(&isol.dispose);
+
+	freed = list_lru_walk_nodemask(&qi->qi_lru, xfs_qm_dquot_isolate, &isol,
+					sc->nr_to_scan, &sc->nodes_to_scan);
+
+	error = xfs_buf_delwri_submit(&isol.buffers);
+	if (error)
+		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
+
+	while (!list_empty(&isol.dispose)) {
+		struct xfs_dquot	*dqp;
+
+		dqp = list_first_entry(&isol.dispose, struct xfs_dquot, q_lru);
+		list_del_init(&dqp->q_lru);
+		xfs_qm_dqfree_one(dqp);
+	}
+
+	return freed;
+}
+
+static long
+xfs_qm_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_quotainfo	*qi = container_of(shrink,
+					struct xfs_quotainfo, qi_shrinker);
+
+	return list_lru_count_nodemask(&qi->qi_lru, &sc->nodes_to_scan);
+}
+
 /*
  * This initializes all the quota information that's kept in the
  * mount structure
@@ -647,9 +778,7 @@ xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
 
-	INIT_LIST_HEAD(&qinf->qi_lru_list);
-	qinf->qi_lru_count = 0;
-	mutex_init(&qinf->qi_lru_lock);
+	list_lru_init(&qinf->qi_lru);
 
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
@@ -716,7 +845,8 @@ xfs_qm_init_quotainfo(
 		qinf->qi_rtbwarnlimit = XFS_QM_RTBWARNLIMIT;
 	}
 
-	qinf->qi_shrinker.shrink = xfs_qm_shake;
+	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
+	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
 	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&qinf->qi_shrinker);
 	return 0;
@@ -1445,132 +1575,6 @@ xfs_qm_dqfree_one(
 	xfs_qm_dqdestroy(dqp);
 }
 
-STATIC void
-xfs_qm_dqreclaim_one(
-	struct xfs_dquot	*dqp,
-	struct list_head	*buffer_list,
-	struct list_head	*dispose_list)
-{
-	struct xfs_mount	*mp = dqp->q_mount;
-	struct xfs_quotainfo	*qi = mp->m_quotainfo;
-	int			error;
-
-	if (!xfs_dqlock_nowait(dqp))
-		goto out_move_tail;
-
-	/*
-	 * This dquot has acquired a reference in the meantime remove it from
-	 * the freelist and try again.
-	 */
-	if (dqp->q_nrefs) {
-		xfs_dqunlock(dqp);
-
-		trace_xfs_dqreclaim_want(dqp);
-		XFS_STATS_INC(xs_qm_dqwants);
-
-		list_del_init(&dqp->q_lru);
-		qi->qi_lru_count--;
-		XFS_STATS_DEC(xs_qm_dquot_unused);
-		return;
-	}
-
-	/*
-	 * Try to grab the flush lock. If this dquot is in the process of
-	 * getting flushed to disk, we don't want to reclaim it.
-	 */
-	if (!xfs_dqflock_nowait(dqp))
-		goto out_unlock_move_tail;
-
-	if (XFS_DQ_IS_DIRTY(dqp)) {
-		struct xfs_buf	*bp = NULL;
-
-		trace_xfs_dqreclaim_dirty(dqp);
-
-		error = xfs_qm_dqflush(dqp, &bp);
-		if (error) {
-			xfs_warn(mp, "%s: dquot %p flush failed",
-				 __func__, dqp);
-			goto out_unlock_move_tail;
-		}
-
-		xfs_buf_delwri_queue(bp, buffer_list);
-		xfs_buf_relse(bp);
-		/*
-		 * Give the dquot another try on the freelist, as the
-		 * flushing will take some time.
-		 */
-		goto out_unlock_move_tail;
-	}
-	xfs_dqfunlock(dqp);
-
-	/*
-	 * Prevent lookups now that we are past the point of no return.
-	 */
-	dqp->dq_flags |= XFS_DQ_FREEING;
-	xfs_dqunlock(dqp);
-
-	ASSERT(dqp->q_nrefs == 0);
-	list_move_tail(&dqp->q_lru, dispose_list);
-	qi->qi_lru_count--;
-	XFS_STATS_DEC(xs_qm_dquot_unused);
-
-	trace_xfs_dqreclaim_done(dqp);
-	XFS_STATS_INC(xs_qm_dqreclaims);
-	return;
-
-	/*
-	 * Move the dquot to the tail of the list so that we don't spin on it.
-	 */
-out_unlock_move_tail:
-	xfs_dqunlock(dqp);
-out_move_tail:
-	list_move_tail(&dqp->q_lru, &qi->qi_lru_list);
-	trace_xfs_dqreclaim_busy(dqp);
-	XFS_STATS_INC(xs_qm_dqreclaim_misses);
-}
-
-STATIC int
-xfs_qm_shake(
-	struct shrinker		*shrink,
-	struct shrink_control	*sc)
-{
-	struct xfs_quotainfo	*qi =
-		container_of(shrink, struct xfs_quotainfo, qi_shrinker);
-	int			nr_to_scan = sc->nr_to_scan;
-	LIST_HEAD		(buffer_list);
-	LIST_HEAD		(dispose_list);
-	struct xfs_dquot	*dqp;
-	int			error;
-
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
-		return 0;
-	if (!nr_to_scan)
-		goto out;
-
-	mutex_lock(&qi->qi_lru_lock);
-	while (!list_empty(&qi->qi_lru_list)) {
-		if (nr_to_scan-- <= 0)
-			break;
-		dqp = list_first_entry(&qi->qi_lru_list, struct xfs_dquot,
-				       q_lru);
-		xfs_qm_dqreclaim_one(dqp, &buffer_list, &dispose_list);
-	}
-	mutex_unlock(&qi->qi_lru_lock);
-
-	error = xfs_buf_delwri_submit(&buffer_list);
-	if (error)
-		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
-
-	while (!list_empty(&dispose_list)) {
-		dqp = list_first_entry(&dispose_list, struct xfs_dquot, q_lru);
-		list_del_init(&dqp->q_lru);
-		xfs_qm_dqfree_one(dqp);
-	}
-
-out:
-	return vfs_pressure_ratio(qi->qi_lru_count);
-}
-
 /*
  * Start a transaction and write the incore superblock changes to
  * disk. flags parameter indicates which fields have changed.
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 44b858b..d08b72d 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -47,9 +47,7 @@ typedef struct xfs_quotainfo {
 	struct mutex qi_tree_lock;
 	xfs_inode_t	*qi_uquotaip;	 /* user quota inode */
 	xfs_inode_t	*qi_gquotaip;	 /* group quota inode */
-	struct list_head qi_lru_list;
-	struct mutex	 qi_lru_lock;
-	int		 qi_lru_count;
+	struct list_lru	 qi_lru;
 	int		 qi_dquots;
 	time_t		 qi_btimelimit;	 /* limit for blks timer */
 	time_t		 qi_itimelimit;	 /* limit for inodes timer */
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 16/28] fs: convert fs shrinkers to new scan/count API
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (14 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 15/28] xfs: convert dquot cache lru to list_lru Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:13 ` [PATCH v2 17/28] drivers: convert shrinkers to new count/scan API Glauber Costa
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Glauber Costa
From: Dave Chinner <dchinner@redhat.com>
Convert the filesystem shrinkers to use the new API, and standardise
some of the behaviours of the shrinkers at the same time. For
example, nr_to_scan means the number of objects to scan, not the
number of objects to free.
I refactored the CIFS idmap shrinker a little - it really needs to
be broken up into a shrinker per tree and keep an item count with
the tree root so that we don't need to walk the tree every time the
shrinker needs to count the number of objects in the tree (i.e.
all the time under memory pressure).
[ glommer: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are
  needed mainly due to new code merged in the tree ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/ext4/extents_status.c | 30 ++++++++++++++++------------
 fs/gfs2/glock.c          | 28 +++++++++++++++-----------
 fs/gfs2/main.c           |  3 ++-
 fs/gfs2/quota.c          | 12 +++++++-----
 fs/gfs2/quota.h          |  4 +++-
 fs/mbcache.c             | 51 ++++++++++++++++++++++++++++--------------------
 fs/nfs/dir.c             | 18 ++++++++++++++---
 fs/nfs/internal.h        |  4 +++-
 fs/nfs/super.c           |  3 ++-
 fs/nfsd/nfscache.c       | 31 ++++++++++++++++++++---------
 fs/quota/dquot.c         | 34 +++++++++++++++-----------------
 fs/ubifs/shrinker.c      | 20 +++++++++++--------
 fs/ubifs/super.c         |  3 ++-
 fs/ubifs/ubifs.h         |  3 ++-
 14 files changed, 151 insertions(+), 93 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index fe3337a..7120f31 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -871,20 +871,26 @@ int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
 				     EXTENT_STATUS_WRITTEN);
 }
 
-static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
+
+static long ext4_es_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	long nr;
+	struct ext4_sb_info *sbi = container_of(shrink,
+					struct ext4_sb_info, s_es_shrinker);
+
+	nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+	trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
+	return nr;
+}
+
+static long ext4_es_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct ext4_sb_info *sbi = container_of(shrink,
 					struct ext4_sb_info, s_es_shrinker);
 	struct ext4_inode_info *ei;
 	struct list_head *cur, *tmp, scanned;
 	int nr_to_scan = sc->nr_to_scan;
-	int ret, nr_shrunk = 0;
-
-	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
-	trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
-
-	if (!nr_to_scan)
-		return ret;
+	int ret = 0, nr_shrunk = 0;
 
 	INIT_LIST_HEAD(&scanned);
 
@@ -913,9 +919,8 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 	list_splice_tail(&scanned, &sbi->s_es_lru);
 	spin_unlock(&sbi->s_es_lru_lock);
 
-	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
 	trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
-	return ret;
+	return nr_shrunk;
 }
 
 void ext4_es_register_shrinker(struct super_block *sb)
@@ -925,7 +930,8 @@ void ext4_es_register_shrinker(struct super_block *sb)
 	sbi = EXT4_SB(sb);
 	INIT_LIST_HEAD(&sbi->s_es_lru);
 	spin_lock_init(&sbi->s_es_lru_lock);
-	sbi->s_es_shrinker.shrink = ext4_es_shrink;
+	sbi->s_es_shrinker.scan_objects = ext4_es_scan;
+	sbi->s_es_shrinker.count_objects = ext4_es_count;
 	sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&sbi->s_es_shrinker);
 }
@@ -966,7 +972,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	struct ext4_es_tree *tree = &ei->i_es_tree;
 	struct rb_node *node;
 	struct extent_status *es;
-	int nr_shrunk = 0;
+	long nr_shrunk = 0;
 
 	if (ei->i_es_lru_nr == 0)
 		return 0;
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 078daa5..d2df2fd 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1441,21 +1441,22 @@ __acquires(&lru_lock)
  * gfs2_dispose_glock_lru() above.
  */
 
-static void gfs2_scan_glock_lru(int nr)
+static long gfs2_scan_glock_lru(int nr)
 {
 	struct gfs2_glock *gl;
 	LIST_HEAD(skipped);
 	LIST_HEAD(dispose);
+	long freed = 0;
 
 	spin_lock(&lru_lock);
-	while(nr && !list_empty(&lru_list)) {
+	while ((nr-- >= 0) && !list_empty(&lru_list)) {
 		gl = list_entry(lru_list.next, struct gfs2_glock, gl_lru);
 
 		/* Test for being demotable */
 		if (!test_and_set_bit(GLF_LOCK, &gl->gl_flags)) {
 			list_move(&gl->gl_lru, &dispose);
 			atomic_dec(&lru_count);
-			nr--;
+			freed++;
 			continue;
 		}
 
@@ -1465,23 +1466,28 @@ static void gfs2_scan_glock_lru(int nr)
 	if (!list_empty(&dispose))
 		gfs2_dispose_glock_lru(&dispose);
 	spin_unlock(&lru_lock);
+
+	return freed;
 }
 
-static int gfs2_shrink_glock_memory(struct shrinker *shrink,
-				    struct shrink_control *sc)
+static long gfs2_glock_shrink_scan(struct shrinker *shrink,
+				   struct shrink_control *sc)
 {
-	if (sc->nr_to_scan) {
-		if (!(sc->gfp_mask & __GFP_FS))
-			return -1;
-		gfs2_scan_glock_lru(sc->nr_to_scan);
-	}
+	if (!(sc->gfp_mask & __GFP_FS))
+		return -1;
+	return gfs2_scan_glock_lru(sc->nr_to_scan);
+}
 
+static long gfs2_glock_shrink_count(struct shrinker *shrink,
+				    struct shrink_control *sc)
+{
 	return vfs_pressure_ratio(atomic_read(&lru_count));
 }
 
 static struct shrinker glock_shrinker = {
-	.shrink = gfs2_shrink_glock_memory,
 	.seeks = DEFAULT_SEEKS,
+	.count_objects = gfs2_glock_shrink_count,
+	.scan_objects = gfs2_glock_shrink_scan,
 };
 
 /**
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index e04d0e0..a105d84 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -32,7 +32,8 @@
 struct workqueue_struct *gfs2_control_wq;
 
 static struct shrinker qd_shrinker = {
-	.shrink = gfs2_shrink_qd_memory,
+	.count_objects = gfs2_qd_shrink_count,
+	.scan_objects = gfs2_qd_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 5c14206..77b82f6 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -75,14 +75,12 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
+long gfs2_qd_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
 	int nr_to_scan = sc->nr_to_scan;
-
-	if (nr_to_scan == 0)
-		goto out;
+	long freed = 0;
 
 	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
@@ -110,10 +108,14 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
 		kmem_cache_free(gfs2_quotad_cachep, qd);
 		spin_lock(&qd_lru_lock);
 		nr_to_scan--;
+		freed++;
 	}
 	spin_unlock(&qd_lru_lock);
+	return freed;
+}
 
-out:
+long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
 	return vfs_pressure_ratio(atomic_read(&qd_lru_count));
 }
 
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 4f5e6e4..4f61708 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -53,7 +53,9 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(struct shrinker *shrink,
+extern long gfs2_qd_shrink_count(struct shrinker *shrink,
+				 struct shrink_control *sc);
+extern long gfs2_qd_shrink_scan(struct shrinker *shrink,
 				 struct shrink_control *sc);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 5eb0476..009a463 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -86,18 +86,6 @@ static LIST_HEAD(mb_cache_list);
 static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
-/*
- * What the mbcache registers as to get shrunk dynamically.
- */
-
-static int mb_cache_shrink_fn(struct shrinker *shrink,
-			      struct shrink_control *sc);
-
-static struct shrinker mb_cache_shrinker = {
-	.shrink = mb_cache_shrink_fn,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static inline int
 __mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
 {
@@ -151,7 +139,7 @@ forget:
 
 
 /*
- * mb_cache_shrink_fn()  memory pressure callback
+ * mb_cache_shrink_scan()  memory pressure callback
  *
  * This function is called by the kernel memory management when memory
  * gets low.
@@ -159,17 +147,18 @@ forget:
  * @shrink: (ignored)
  * @sc: shrink_control passed from reclaim
  *
- * Returns the number of objects which are present in the cache.
+ * Returns the number of objects freed.
  */
-static int
-mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
+static long
+mb_cache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	LIST_HEAD(free_list);
-	struct mb_cache *cache;
 	struct mb_cache_entry *entry, *tmp;
-	int count = 0;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	long freed = 0;
 
 	mb_debug("trying to free %d entries", nr_to_scan);
 	spin_lock(&mb_cache_spinlock);
@@ -179,19 +168,39 @@ mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 				   struct mb_cache_entry, e_lru_list);
 		list_move_tail(&ce->e_lru_list, &free_list);
 		__mb_cache_entry_unhash(ce);
+		freed++;
+	}
+	spin_unlock(&mb_cache_spinlock);
+	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
+		__mb_cache_entry_forget(entry, gfp_mask);
 	}
+	return freed;
+}
+
+static long
+mb_cache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct mb_cache *cache;
+	long count = 0;
+
+	spin_lock(&mb_cache_spinlock);
 	list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
 		mb_debug("cache %s (%d)", cache->c_name,
 			  atomic_read(&cache->c_entry_count));
 		count += atomic_read(&cache->c_entry_count);
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
-		__mb_cache_entry_forget(entry, gfp_mask);
-	}
+
 	return vfs_pressure_ratio(count);
 }
 
+static struct shrinker mb_cache_shrinker = {
+	.count_objects = mb_cache_shrink_count,
+	.scan_objects = mb_cache_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+};
 
 /*
  * mb_cache_create()  create a new cache
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 197bfff..e04f4fe 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1956,17 +1956,20 @@ static void nfs_access_free_list(struct list_head *head)
 	}
 }
 
-int nfs_access_cache_shrinker(struct shrinker *shrink,
-			      struct shrink_control *sc)
+long
+nfs_access_cache_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi, *next;
 	struct nfs_access_entry *cache;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	long freed = 0;
 
 	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return (nr_to_scan == 0) ? 0 : -1;
+		return -1;
 
 	spin_lock(&nfs_access_lru_lock);
 	list_for_each_entry_safe(nfsi, next, &nfs_access_lru_list, access_cache_inode_lru) {
@@ -1982,6 +1985,7 @@ int nfs_access_cache_shrinker(struct shrinker *shrink,
 				struct nfs_access_entry, lru);
 		list_move(&cache->lru, &head);
 		rb_erase(&cache->rb_node, &nfsi->access_cache);
+		freed++;
 		if (!list_empty(&nfsi->access_cache_entry_lru))
 			list_move_tail(&nfsi->access_cache_inode_lru,
 					&nfs_access_lru_list);
@@ -1996,6 +2000,14 @@ remove_lru_entry:
 	}
 	spin_unlock(&nfs_access_lru_lock);
 	nfs_access_free_list(&head);
+	return freed;
+}
+
+long
+nfs_access_cache_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
 	return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 541c9eb..eafb056 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -262,7 +262,9 @@ extern struct nfs_client *nfs_init_client(struct nfs_client *clp,
 			   const char *ip_addr, rpc_authflavor_t authflavour);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(struct shrinker *shrink,
+extern long nfs_access_cache_count(struct shrinker *shrink,
+					struct shrink_control *sc);
+extern long nfs_access_cache_scan(struct shrinker *shrink,
 					struct shrink_control *sc);
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 int nfs_create(struct inode *, struct dentry *, umode_t, bool);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 2f8a29d..5301056 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -359,7 +359,8 @@ static void unregister_nfs4_fs(void)
 #endif
 
 static struct shrinker acl_shrinker = {
-	.shrink		= nfs_access_cache_shrinker,
+	.count_objects	= nfs_access_cache_count,
+	.scan_objects	= nfs_access_cache_scan,
 	.seeks		= DEFAULT_SEEKS,
 };
 
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 62c1ee1..9675be6 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -38,11 +38,14 @@ static inline u32 request_hash(u32 xid)
 
 static int	nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
 static void	cache_cleaner_func(struct work_struct *unused);
-static int 	nfsd_reply_cache_shrink(struct shrinker *shrink,
-					struct shrink_control *sc);
+static long	nfsd_reply_cache_count(struct shrinker *shrink,
+				       struct shrink_control *sc);
+static long	nfsd_reply_cache_scan(struct shrinker *shrink,
+				      struct shrink_control *sc);
 
 struct shrinker nfsd_reply_cache_shrinker = {
-	.shrink	= nfsd_reply_cache_shrink,
+	.scan_objects = nfsd_reply_cache_scan,
+	.count_objects = nfsd_reply_cache_count,
 	.seeks	= 1,
 };
 
@@ -193,16 +196,18 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
  * Walk the LRU list and prune off entries that are older than RC_EXPIRE.
  * Also prune the oldest ones when the total exceeds the max number of entries.
  */
-static void
+static long
 prune_cache_entries(void)
 {
 	struct svc_cacherep *rp, *tmp;
+	long freed = 0;
 
 	list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
 		if (!nfsd_cache_entry_expired(rp) &&
 		    num_drc_entries <= max_drc_entries)
 			break;
 		nfsd_reply_cache_free_locked(rp);
+		freed++;
 	}
 
 	/*
@@ -215,6 +220,7 @@ prune_cache_entries(void)
 		cancel_delayed_work(&cache_cleaner);
 	else
 		mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
+	return freed;
 }
 
 static void
@@ -225,20 +231,27 @@ cache_cleaner_func(struct work_struct *unused)
 	spin_unlock(&cache_lock);
 }
 
-static int
-nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
+static long
+nfsd_reply_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 {
-	unsigned int num;
+	long num;
 
 	spin_lock(&cache_lock);
-	if (sc->nr_to_scan)
-		prune_cache_entries();
 	num = num_drc_entries;
 	spin_unlock(&cache_lock);
 
 	return num;
 }
 
+static long
+nfsd_reply_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	long freed;
+	spin_lock(&cache_lock);
+	freed = prune_cache_entries();
+	spin_unlock(&cache_lock);
+	return freed;
+}
 /*
  * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
  */
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 762b09c..fd6b762 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -687,44 +687,42 @@ int dquot_quota_sync(struct super_block *sb, int type)
 }
 EXPORT_SYMBOL(dquot_quota_sync);
 
-/* Free unused dquots from cache */
-static void prune_dqcache(int count)
+static long
+dqcache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct list_head *head;
 	struct dquot *dquot;
+	long freed = 0;
 
 	head = free_dquots.prev;
-	while (head != &free_dquots && count) {
+	while (head != &free_dquots && sc->nr_to_scan) {
 		dquot = list_entry(head, struct dquot, dq_free);
 		remove_dquot_hash(dquot);
 		remove_free_dquot(dquot);
 		remove_inuse(dquot);
 		do_destroy_dquot(dquot);
-		count--;
+		sc->nr_to_scan--;
+		freed++;
 		head = free_dquots.prev;
 	}
+	return freed;
 }
 
-/*
- * This is called from kswapd when we think we need some
- * more memory
- */
-static int shrink_dqcache_memory(struct shrinker *shrink,
-				 struct shrink_control *sc)
-{
-	int nr = sc->nr_to_scan;
+static long
+dqcache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 
-	if (nr) {
-		spin_lock(&dq_list_lock);
-		prune_dqcache(nr);
-		spin_unlock(&dq_list_lock);
-	}
+{
 	return vfs_pressure_ratio(
 	percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
 }
 
 static struct shrinker dqcache_shrinker = {
-	.shrink = shrink_dqcache_memory,
+	.count_objects = dqcache_shrink_count,
+	.scan_objects = dqcache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index 9e1d056..669d8c0 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -277,19 +277,23 @@ static int kick_a_thread(void)
 	return 0;
 }
 
-int ubifs_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+long ubifs_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
+
+	/*
+	 * Due to the way UBIFS updates the clean znode counter it may
+	 * temporarily be negative.
+	 */
+	return clean_zn_cnt >= 0 ? clean_zn_cnt : 1;
+}
+
+long ubifs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	int nr = sc->nr_to_scan;
 	int freed, contention = 0;
 	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
 
-	if (nr == 0)
-		/*
-		 * Due to the way UBIFS updates the clean znode counter it may
-		 * temporarily be negative.
-		 */
-		return clean_zn_cnt >= 0 ? clean_zn_cnt : 1;
-
 	if (!clean_zn_cnt) {
 		/*
 		 * No clean znodes, nothing to reap. All we can do in this case
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index ac838b8..c87b890 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -49,7 +49,8 @@ struct kmem_cache *ubifs_inode_slab;
 
 /* UBIFS TNC shrinker description */
 static struct shrinker ubifs_shrinker_info = {
-	.shrink = ubifs_shrinker,
+	.scan_objects = ubifs_shrink_scan,
+	.count_objects = ubifs_shrink_count,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index b2babce..bcdafcc 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1624,7 +1624,8 @@ int ubifs_tnc_start_commit(struct ubifs_info *c, struct ubifs_zbranch *zroot);
 int ubifs_tnc_end_commit(struct ubifs_info *c);
 
 /* shrinker.c */
-int ubifs_shrinker(struct shrinker *shrink, struct shrink_control *sc);
+long ubifs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc);
+long ubifs_shrink_count(struct shrinker *shrink, struct shrink_control *sc);
 
 /* commit.c */
 int ubifs_bg_thread(void *info);
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 17/28] drivers: convert shrinkers to new count/scan API
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (15 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 16/28] fs: convert fs shrinkers to new scan/count API Glauber Costa
@ 2013-03-29  9:13 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 18/28] shrinker: convert remaining shrinkers to " Glauber Costa
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:13 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Glauber Costa
From: Dave Chinner <dchinner@redhat.com>
Convert the driver shrinkers to the new API. Most changes are
compile tested only because I either don't have the hardware or it's
staging stuff.
FWIW, the md and android code is pretty good, but the rest of it
makes me want to claw my eyes out.  The amount of broken code I just
encountered is mind boggling.  I've added comments explaining what
is broken, but I fear that some of the code would be best dealt with
by being dragged behind the bike shed, burying in mud up to it's
neck and then run over repeatedly with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't
co-exist in the build at the same time, they are under different
menu options in menuconfig, they only show up when you've got the
right set of mm subsystem options configured and so even compile
testing is an exercise in pulling teeth.  And that doesn't even take
into account the horrible, broken code...
[ glommer: fixes for i915, android lowmem, zcache ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 drivers/gpu/drm/i915/i915_dma.c            |  4 +-
 drivers/gpu/drm/i915/i915_drv.h            |  2 +-
 drivers/gpu/drm/i915/i915_gem.c            | 69 ++++++++++++++++++++++--------
 drivers/gpu/drm/i915/i915_gem_evict.c      | 10 +++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c       | 48 ++++++++++++++-------
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c   | 55 ++++++++++++++++--------
 drivers/md/dm-bufio.c                      | 65 ++++++++++++++++++----------
 drivers/staging/android/ashmem.c           | 44 +++++++++++++------
 drivers/staging/android/lowmemorykiller.c  | 40 ++++++++++-------
 drivers/staging/zcache/zcache-main.c       | 29 ++++++++-----
 11 files changed, 242 insertions(+), 126 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 4fa6beb..e3e3d13 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1646,7 +1646,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	return 0;
 
 out_gem_unload:
-	if (dev_priv->mm.inactive_shrinker.shrink)
+	if (dev_priv->mm.inactive_shrinker.scan_objects)
 		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
 	if (dev->pdev->msi_enabled)
@@ -1683,7 +1683,7 @@ int i915_driver_unload(struct drm_device *dev)
 
 	i915_teardown_sysfs(dev);
 
-	if (dev_priv->mm.inactive_shrinker.shrink)
+	if (dev_priv->mm.inactive_shrinker.scan_objects)
 		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
 	mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e95337c..321f297 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1708,7 +1708,7 @@ int __must_check i915_gem_evict_something(struct drm_device *dev, int min_size,
 					  unsigned cache_level,
 					  bool mappable,
 					  bool nonblock);
-int i915_gem_evict_everything(struct drm_device *dev);
+long i915_gem_evict_everything(struct drm_device *dev);
 
 /* i915_gem_stolen.c */
 int i915_gem_init_stolen(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 0e207e6..7852632 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -53,10 +53,12 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
 					 struct drm_i915_fence_reg *fence,
 					 bool enable);
 
-static int i915_gem_inactive_shrink(struct shrinker *shrinker,
+static long i915_gem_inactive_count(struct shrinker *shrinker,
 				    struct shrink_control *sc);
+static long i915_gem_inactive_scan(struct shrinker *shrinker,
+				   struct shrink_control *sc);
 static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
-static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
+static long i915_gem_shrink_all(struct drm_i915_private *dev_priv);
 static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
 
 static inline void i915_gem_object_fence_lost(struct drm_i915_gem_object *obj)
@@ -1738,15 +1740,20 @@ i915_gem_purge(struct drm_i915_private *dev_priv, long target)
 	return __i915_gem_shrink(dev_priv, target, true);
 }
 
-static void
+static long
 i915_gem_shrink_all(struct drm_i915_private *dev_priv)
 {
 	struct drm_i915_gem_object *obj, *next;
+	long freed = 0;
 
-	i915_gem_evict_everything(dev_priv->dev);
+	freed += i915_gem_evict_everything(dev_priv->dev);
 
-	list_for_each_entry_safe(obj, next, &dev_priv->mm.unbound_list, gtt_list)
+	list_for_each_entry_safe(obj, next, &dev_priv->mm.unbound_list, gtt_list) {
+		if (obj->pages_pin_count == 0)
+			freed += obj->base.size >> PAGE_SHIFT;
 		i915_gem_object_put_pages(obj);
+	}
+	return freed;
 }
 
 static int
@@ -4158,7 +4165,8 @@ i915_gem_load(struct drm_device *dev)
 
 	dev_priv->mm.interruptible = true;
 
-	dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
+	dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
+	dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
 	dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&dev_priv->mm.inactive_shrinker);
 }
@@ -4381,8 +4389,8 @@ static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task)
 #endif
 }
 
-static int
-i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
+static long
+i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(shrinker,
@@ -4390,9 +4398,8 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
 			     mm.inactive_shrinker);
 	struct drm_device *dev = dev_priv->dev;
 	struct drm_i915_gem_object *obj;
-	int nr_to_scan = sc->nr_to_scan;
 	bool unlock = true;
-	int cnt;
+	long cnt;
 
 	if (!mutex_trylock(&dev->struct_mutex)) {
 		if (!mutex_is_locked_by(&dev->struct_mutex, current))
@@ -4404,15 +4411,6 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
 		unlock = false;
 	}
 
-	if (nr_to_scan) {
-		nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
-		if (nr_to_scan > 0)
-			nr_to_scan -= __i915_gem_shrink(dev_priv, nr_to_scan,
-							false);
-		if (nr_to_scan > 0)
-			i915_gem_shrink_all(dev_priv);
-	}
-
 	cnt = 0;
 	list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
 		if (obj->pages_pin_count == 0)
@@ -4425,3 +4423,36 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
 		mutex_unlock(&dev->struct_mutex);
 	return cnt;
 }
+static long
+i915_gem_inactive_scan(struct shrinker *shrinker, struct shrink_control *sc)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(shrinker,
+			     struct drm_i915_private,
+			     mm.inactive_shrinker);
+	struct drm_device *dev = dev_priv->dev;
+	int nr_to_scan = sc->nr_to_scan;
+	long freed;
+	bool unlock = true;
+
+	if (!mutex_trylock(&dev->struct_mutex)) {
+		if (!mutex_is_locked_by(&dev->struct_mutex, current))
+			return 0;
+
+		if (dev_priv->mm.shrinker_no_lock_stealing)
+			return 0;
+
+		unlock = false;
+	}
+
+	freed = i915_gem_purge(dev_priv, nr_to_scan);
+	if (freed < nr_to_scan)
+		freed += __i915_gem_shrink(dev_priv, nr_to_scan,
+							false);
+	if (freed < nr_to_scan)
+		freed += i915_gem_shrink_all(dev_priv);
+
+	if (unlock)
+		mutex_unlock(&dev->struct_mutex);
+	return freed;
+}
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index c86d5d9..e379340 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -150,13 +150,13 @@ found:
 	return ret;
 }
 
-int
+long
 i915_gem_evict_everything(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
 	struct drm_i915_gem_object *obj, *next;
 	bool lists_empty;
-	int ret;
+	long ret = 0;
 
 	lists_empty = (list_empty(&dev_priv->mm.inactive_list) &&
 		       list_empty(&dev_priv->mm.active_list));
@@ -178,8 +178,10 @@ i915_gem_evict_everything(struct drm_device *dev)
 	/* Having flushed everything, unbind() should never raise an error */
 	list_for_each_entry_safe(obj, next,
 				 &dev_priv->mm.inactive_list, mm_list)
-		if (obj->pin_count == 0)
+		if (obj->pin_count == 0) {
+			ret += obj->base.size >> PAGE_SHIFT;
 			WARN_ON(i915_gem_object_unbind(obj));
+		}
 
-	return 0;
+	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 3b11ab0..c0ad264 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -568,7 +568,7 @@ err:		/* Decrement pin count for bound objects */
 			return ret;
 
 		ret = i915_gem_evict_everything(ring->dev);
-		if (ret)
+		if (ret < 0)
 			return ret;
 	} while (1);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index bd2a3b4..83058a2 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -377,28 +377,28 @@ out:
 	return nr_free;
 }
 
-/* Get good estimation how many pages are free in pools */
-static int ttm_pool_get_num_unused_pages(void)
-{
-	unsigned i;
-	int total = 0;
-	for (i = 0; i < NUM_POOLS; ++i)
-		total += _manager->pools[i].npages;
-
-	return total;
-}
-
 /**
  * Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
+ * this can deadlock when called a sc->gfp_mask that is not equal to
+ * GFP_KERNEL.
+ *
+ * This code is crying out for a shrinker per pool....
  */
-static int ttm_pool_mm_shrink(struct shrinker *shrink,
-			      struct shrink_control *sc)
+static long
+ttm_pool_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static atomic_t start_pool = ATOMIC_INIT(0);
 	unsigned i;
 	unsigned pool_offset = atomic_add_return(1, &start_pool);
 	struct ttm_page_pool *pool;
 	int shrink_pages = sc->nr_to_scan;
+	long freed = 0;
 
 	pool_offset = pool_offset % NUM_POOLS;
 	/* select start pool in round robin fashion */
@@ -408,14 +408,30 @@ static int ttm_pool_mm_shrink(struct shrinker *shrink,
 			break;
 		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
 		shrink_pages = ttm_page_pool_free(pool, nr_free);
+		freed += nr_free - shrink_pages;
 	}
-	/* return estimated number of unused pages in pool */
-	return ttm_pool_get_num_unused_pages();
+	return freed;
+}
+
+
+static long
+ttm_pool_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	unsigned i;
+	long count = 0;
+
+	for (i = 0; i < NUM_POOLS; ++i)
+		count += _manager->pools[i].npages;
+
+	return count;
 }
 
 static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
 {
-	manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
+	manager->mm_shrink.count_objects = &ttm_pool_shrink_count;
+	manager->mm_shrink.scan_objects = &ttm_pool_shrink_scan;
 	manager->mm_shrink.seeks = 1;
 	register_shrinker(&manager->mm_shrink);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index b8b3943..b3b4f99 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -918,19 +918,6 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(ttm_dma_populate);
 
-/* Get good estimation how many pages are free in pools */
-static int ttm_dma_pool_get_num_unused_pages(void)
-{
-	struct device_pools *p;
-	unsigned total = 0;
-
-	mutex_lock(&_manager->lock);
-	list_for_each_entry(p, &_manager->pools, pools)
-		total += p->pool->npages_free;
-	mutex_unlock(&_manager->lock);
-	return total;
-}
-
 /* Put all pages in pages list to correct pool to wait for reuse */
 void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 {
@@ -1002,18 +989,31 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
 
 /**
  * Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
+ * needs to be paid to sc->gfp_mask to determine if this can be done or not.
+ * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
+ * bad.
+ *
+ * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
+ * shrinkers
  */
-static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
-				  struct shrink_control *sc)
+static long
+ttm_dma_pool_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static atomic_t start_pool = ATOMIC_INIT(0);
 	unsigned idx = 0;
 	unsigned pool_offset = atomic_add_return(1, &start_pool);
 	unsigned shrink_pages = sc->nr_to_scan;
 	struct device_pools *p;
+	long freed = 0;
 
 	if (list_empty(&_manager->pools))
-		return 0;
+		return -1;
 
 	mutex_lock(&_manager->lock);
 	pool_offset = pool_offset % _manager->npools;
@@ -1029,18 +1029,35 @@ static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
 			continue;
 		nr_free = shrink_pages;
 		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
+		freed += nr_free - shrink_pages;
+
 		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
 			 p->pool->dev_name, p->pool->name, current->pid,
 			 nr_free, shrink_pages);
 	}
 	mutex_unlock(&_manager->lock);
-	/* return estimated number of unused pages in pool */
-	return ttm_dma_pool_get_num_unused_pages();
+	return freed;
+}
+
+static long
+ttm_dma_pool_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct device_pools *p;
+	long count = 0;
+
+	mutex_lock(&_manager->lock);
+	list_for_each_entry(p, &_manager->pools, pools)
+		count += p->pool->npages_free;
+	mutex_unlock(&_manager->lock);
+	return count;
 }
 
 static void ttm_dma_pool_mm_shrink_init(struct ttm_pool_manager *manager)
 {
-	manager->mm_shrink.shrink = &ttm_dma_pool_mm_shrink;
+	manager->mm_shrink.count_objects = &ttm_dma_pool_shrink_count;
+	manager->mm_shrink.scan_objects = &ttm_dma_pool_shrink_scan;
 	manager->mm_shrink.seeks = 1;
 	register_shrinker(&manager->mm_shrink);
 }
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index c608313..b615e12 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1360,62 +1360,80 @@ static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
 				unsigned long max_jiffies)
 {
 	if (jiffies - b->last_accessed < max_jiffies)
-		return 1;
+		return 0;
 
 	if (!(gfp & __GFP_IO)) {
 		if (test_bit(B_READING, &b->state) ||
 		    test_bit(B_WRITING, &b->state) ||
 		    test_bit(B_DIRTY, &b->state))
-			return 1;
+			return 0;
 	}
 
 	if (b->hold_count)
-		return 1;
+		return 0;
 
 	__make_buffer_clean(b);
 	__unlink_buffer(b);
 	__free_buffer_wake(b);
 
-	return 0;
+	return 1;
 }
 
-static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
-		   struct shrink_control *sc)
+static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
+		   gfp_t gfp_mask)
 {
 	int l;
 	struct dm_buffer *b, *tmp;
+	long freed = 0;
 
 	for (l = 0; l < LIST_SIZE; l++) {
-		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
-			if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
-			    !--nr_to_scan)
-				return;
+		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
+			freed += __cleanup_old_buffer(b, gfp_mask, 0);
+			if (!--nr_to_scan)
+				break;
+		}
 		dm_bufio_cond_resched();
 	}
+	return freed;
 }
 
-static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
+static long
+dm_bufio_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct dm_bufio_client *c =
-	    container_of(shrinker, struct dm_bufio_client, shrinker);
-	unsigned long r;
-	unsigned long nr_to_scan = sc->nr_to_scan;
+	    container_of(shrink, struct dm_bufio_client, shrinker);
+	long freed;
 
 	if (sc->gfp_mask & __GFP_IO)
 		dm_bufio_lock(c);
 	else if (!dm_bufio_trylock(c))
-		return !nr_to_scan ? 0 : -1;
+		return -1;
 
-	if (nr_to_scan)
-		__scan(c, nr_to_scan, sc);
+	freed  = __scan(c, sc->nr_to_scan, sc->gfp_mask);
+	dm_bufio_unlock(c);
+	return freed;
+}
 
-	r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
-	if (r > INT_MAX)
-		r = INT_MAX;
+static long
+dm_bufio_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct dm_bufio_client *c =
+	    container_of(shrink, struct dm_bufio_client, shrinker);
+	long count;
+
+	if (sc->gfp_mask & __GFP_IO)
+		dm_bufio_lock(c);
+	else if (!dm_bufio_trylock(c))
+		return 0;
 
+	count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
 	dm_bufio_unlock(c);
+	return count;
 
-	return r;
 }
 
 /*
@@ -1517,7 +1535,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
 	__cache_size_refresh();
 	mutex_unlock(&dm_bufio_clients_lock);
 
-	c->shrinker.shrink = shrink;
+	c->shrinker.count_objects = dm_bufio_shrink_count;
+	c->shrinker.scan_objects = dm_bufio_shrink_scan;
 	c->shrinker.seeks = 1;
 	c->shrinker.batch = 0;
 	register_shrinker(&c->shrinker);
@@ -1604,7 +1623,7 @@ static void cleanup_old_buffers(void)
 			struct dm_buffer *b;
 			b = list_entry(c->lru[LIST_CLEAN].prev,
 				       struct dm_buffer, lru_list);
-			if (__cleanup_old_buffer(b, 0, max_age * HZ))
+			if (!__cleanup_old_buffer(b, 0, max_age * HZ))
 				break;
 			dm_bufio_cond_resched();
 		}
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 634b9ae..30f9f8e 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -341,27 +341,28 @@ out:
 /*
  * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
  *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
+ * 'nr_to_scan' is the number of objects to scan for freeing.
  *
  * 'gfp_mask' is the mask of the allocation that got us into this mess.
  *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
+ * Return value is the number of objects freed or -1 if we cannot
  * proceed without risk of deadlock (due to gfp_mask).
  *
  * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
  * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
  * pages freed.
  */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
+static long
+ashmem_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct ashmem_range *range, *next;
+	long freed = 0;
 
 	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
 
 	mutex_lock(&ashmem_mutex);
 	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
@@ -374,17 +375,34 @@ static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
 		range->purged = ASHMEM_WAS_PURGED;
 		lru_del(range);
 
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
+		freed += range_size(range);
+		if (--sc->nr_to_scan <= 0)
 			break;
 	}
 	mutex_unlock(&ashmem_mutex);
+	return freed;
+}
 
+static long
+ashmem_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	/*
+	 * note that lru_count is count of pages on the lru, not a count of
+	 * objects on the list. This means the scan function needs to return the
+	 * number of pages freed, not the number of objects scanned.
+	 */
 	return lru_count;
 }
 
 static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
+	.count_objects = ashmem_shrink_count,
+	.scan_objects = ashmem_shrink_scan,
+	/*
+	 * XXX (dchinner): I wish people would comment on why they need on
+	 * significant changes to the default value here
+	 */
 	.seeks = DEFAULT_SEEKS * 4,
 };
 
@@ -671,11 +689,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		if (capable(CAP_SYS_ADMIN)) {
 			struct shrink_control sc = {
 				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
+				.nr_to_scan = LONG_MAX,
 			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
+			ashmem_shrink_scan(&ashmem_shrinker, &sc);
 		}
 		break;
 	}
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 3b91b0f..98a9a89 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -63,7 +63,15 @@ static unsigned long lowmem_deathpending_timeout;
 			printk(x);			\
 	} while (0)
 
-static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
+static long lowmem_count(struct shrinker *s, struct shrink_control *sc)
+{
+	return global_page_state(NR_ACTIVE_ANON) +
+		global_page_state(NR_ACTIVE_FILE) +
+		global_page_state(NR_INACTIVE_ANON) +
+		global_page_state(NR_INACTIVE_FILE);
+}
+
+static long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 {
 	struct task_struct *tsk;
 	struct task_struct *selected = NULL;
@@ -89,19 +97,17 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
 			break;
 		}
 	}
-	if (sc->nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %hd\n",
-				sc->nr_to_scan, sc->gfp_mask, other_free,
-				other_file, min_score_adj);
-	rem = global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE) +
-		global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE);
-	if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
-		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
-			     sc->nr_to_scan, sc->gfp_mask, rem);
-		return rem;
+
+	lowmem_print(3, "lowmem_scan %lu, %x, ofree %d %d, ma %hd\n",
+			sc->nr_to_scan, sc->gfp_mask, other_free,
+			other_file, min_score_adj);
+
+	if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
+		lowmem_print(5, "lowmem_scan %lu, %x, return 0\n",
+			     sc->nr_to_scan, sc->gfp_mask);
+		return 0;
 	}
+
 	selected_oom_score_adj = min_score_adj;
 
 	rcu_read_lock();
@@ -151,16 +157,18 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
 		lowmem_deathpending_timeout = jiffies + HZ;
 		send_sig(SIGKILL, selected, 0);
 		set_tsk_thread_flag(selected, TIF_MEMDIE);
-		rem -= selected_tasksize;
+		rem += selected_tasksize;
 	}
-	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
+
+	lowmem_print(4, "lowmem_scan %lu, %x, return %d\n",
 		     sc->nr_to_scan, sc->gfp_mask, rem);
 	rcu_read_unlock();
 	return rem;
 }
 
 static struct shrinker lowmem_shrinker = {
-	.shrink = lowmem_shrink,
+	.scan_objects = lowmem_scan,
+	.count_objects = lowmem_count,
 	.seeks = DEFAULT_SEEKS * 16
 };
 
diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index 328898e..eae6b0a 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -1252,23 +1252,19 @@ static bool zcache_freeze;
  * pageframes in use.  FIXME POLICY: Probably the writeback should only occur
  * if the eviction doesn't free enough pages.
  */
-static int shrink_zcache_memory(struct shrinker *shrink,
-				struct shrink_control *sc)
+static long scan_zcache_memory(struct shrinker *shrink,
+			       struct shrink_control *sc)
 {
 	static bool in_progress;
-	int ret = -1;
-	int nr = sc->nr_to_scan;
 	int nr_evict = 0;
 	int nr_writeback = 0;
 	struct page *page;
 	int  file_pageframes_inuse, anon_pageframes_inuse;
-
-	if (nr <= 0)
-		goto skip_evict;
+	long freed = 0;
 
 	/* don't allow more than one eviction thread at a time */
 	if (in_progress)
-		goto skip_evict;
+		return 0;
 
 	in_progress = true;
 
@@ -1288,6 +1284,7 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 		if (page == NULL)
 			break;
 		zcache_free_page(page);
+		freed++;
 	}
 
 	zcache_last_active_anon_pageframes =
@@ -1304,13 +1301,22 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 #ifdef CONFIG_ZCACHE_WRITEBACK
 		int writeback_ret;
 		writeback_ret = zcache_frontswap_writeback();
-		if (writeback_ret == -ENOMEM)
+		if (writeback_ret != -ENOMEM)
+			freed++;
+		else
 #endif
 			break;
 	}
 	in_progress = false;
 
-skip_evict:
+	return freed;
+}
+
+static long count_zcache_memory(struct shrinker *shrink,
+				struct shrink_control *sc)
+{
+	int ret = -1;
+
 	/* resample: has changed, but maybe not all the way yet */
 	zcache_last_active_file_pageframes =
 		global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
@@ -1324,7 +1330,8 @@ skip_evict:
 }
 
 static struct shrinker zcache_shrinker = {
-	.shrink = shrink_zcache_memory,
+	.scan_objects = scan_zcache_memory,
+	.count_objects = count_zcache_memory,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 18/28] shrinker: convert remaining shrinkers to count/scan API
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (16 preceding siblings ...)
  2013-03-29  9:13 ` [PATCH v2 17/28] drivers: convert shrinkers to new count/scan API Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 19/28] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Convert the remaining couple of random shrinkers in the tree to the
new API.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++----------
 net/sunrpc/auth.c  | 45 +++++++++++++++++++++++++++++++--------------
 2 files changed, 56 insertions(+), 24 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 956ca35..bebb8b6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4185,26 +4185,28 @@ restart:
 	spin_unlock(&kvm->mmu_lock);
 }
 
-static void kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
+static long kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
 						struct list_head *invalid_list)
 {
 	struct kvm_mmu_page *page;
 
 	if (list_empty(&kvm->arch.active_mmu_pages))
-		return;
+		return 0;
 
 	page = container_of(kvm->arch.active_mmu_pages.prev,
 			    struct kvm_mmu_page, link);
-	kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
+	return kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
 }
 
-static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
+
+static long
+mmu_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct kvm *kvm;
 	int nr_to_scan = sc->nr_to_scan;
-
-	if (nr_to_scan == 0)
-		goto out;
+	long freed = 0;
 
 	raw_spin_lock(&kvm_lock);
 
@@ -4232,24 +4234,37 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
 		idx = srcu_read_lock(&kvm->srcu);
 		spin_lock(&kvm->mmu_lock);
 
-		kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
+		freed += kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
 		kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
 		spin_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, idx);
 
+		/*
+		 * unfair on small ones
+		 * per-vm shrinkers cry out
+		 * sadness comes quickly
+		 */
 		list_move_tail(&kvm->vm_list, &vm_list);
 		break;
 	}
 
 	raw_spin_unlock(&kvm_lock);
+	return freed;
 
-out:
+}
+
+static long
+mmu_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
 	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
 }
 
 static struct shrinker mmu_shrinker = {
-	.shrink = mmu_shrink,
+	.count_objects = mmu_shrink_count,
+	.scan_objects = mmu_shrink_scan,
 	.seeks = DEFAULT_SEEKS * 10,
 };
 
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index f529404..f340090 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -340,12 +340,13 @@ EXPORT_SYMBOL_GPL(rpcauth_destroy_credcache);
 /*
  * Remove stale credentials. Avoid sleeping inside the loop.
  */
-static int
+static long
 rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 {
 	spinlock_t *cache_lock;
 	struct rpc_cred *cred, *next;
 	unsigned long expired = jiffies - RPC_AUTH_EXPIRY_MORATORIUM;
+	long freed = 0;
 
 	list_for_each_entry_safe(cred, next, &cred_unused, cr_lru) {
 
@@ -357,10 +358,11 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 		 */
 		if (time_in_range(cred->cr_expire, expired, jiffies) &&
 		    test_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags) != 0)
-			return 0;
+			break;
 
 		list_del_init(&cred->cr_lru);
 		number_cred_unused--;
+		freed++;
 		if (atomic_read(&cred->cr_count) != 0)
 			continue;
 
@@ -373,29 +375,43 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 		}
 		spin_unlock(cache_lock);
 	}
-	return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
+	return freed;
 }
 
 /*
  * Run memory cache shrinker.
  */
-static int
-rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+rpcauth_cache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+
 {
 	LIST_HEAD(free);
-	int res;
-	int nr_to_scan = sc->nr_to_scan;
-	gfp_t gfp_mask = sc->gfp_mask;
+	long freed;
+
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		return -1;
 
-	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return (nr_to_scan == 0) ? 0 : -1;
+	/* nothing left, don't come back */
 	if (list_empty(&cred_unused))
-		return 0;
+		return -1;
+
 	spin_lock(&rpc_credcache_lock);
-	res = rpcauth_prune_expired(&free, nr_to_scan);
+	freed = rpcauth_prune_expired(&free, sc->nr_to_scan);
 	spin_unlock(&rpc_credcache_lock);
 	rpcauth_destroy_credlist(&free);
-	return res;
+
+	return freed;
+}
+
+static long
+rpcauth_cache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+
+{
+	return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 /*
@@ -711,7 +727,8 @@ rpcauth_uptodatecred(struct rpc_task *task)
 }
 
 static struct shrinker rpc_cred_shrinker = {
-	.shrink = rpcauth_cache_shrinker,
+	.count_objects = rpcauth_cache_shrink_count,
+	.scan_objects = rpcauth_cache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 19/28] hugepage: convert huge zero page shrinker to new shrinker API
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (17 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 18/28] shrinker: convert remaining shrinkers to " Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 20/28] shrinker: Kill old ->shrink API Glauber Costa
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner
It consists of:
* returning long instead of int
* separating count from scan
* returning the number of freed entities in scan
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <dchinner@redhat.com>
---
 mm/huge_memory.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..8bf43d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -212,24 +212,30 @@ static void put_huge_zero_page(void)
 	BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
 }
 
-static int shrink_huge_zero_page(struct shrinker *shrink,
-		struct shrink_control *sc)
+
+static long shrink_huge_zero_page_count(struct shrinker *shrink,
+					struct shrink_control *sc)
 {
-	if (!sc->nr_to_scan)
-		/* we can free zero page only if last reference remains */
-		return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+	/* we can free zero page only if last reference remains */
+	return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+}
 
+static long shrink_huge_zero_page_scan(struct shrinker *shrink,
+				       struct shrink_control *sc)
+{
 	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
 		unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
 		BUG_ON(zero_pfn == 0);
 		__free_page(__pfn_to_page(zero_pfn));
+		return HPAGE_PMD_NR;
 	}
 
 	return 0;
 }
 
 static struct shrinker huge_zero_page_shrinker = {
-	.shrink = shrink_huge_zero_page,
+	.scan_objects = shrink_huge_zero_page_scan,
+	.count_objects = shrink_huge_zero_page_count,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 20/28] shrinker: Kill old ->shrink API.
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (18 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 19/28] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure Glauber Costa
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
There are no more users of this API, so kill it dead, dead, dead and
quietly bury the corpse in a shallow, unmarked grave in a dark
forest deep in the hills...
[ glommer: added flowers to the grave ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h      | 15 +++++----------
 include/trace/events/vmscan.h |  4 ++--
 mm/vmscan.c                   | 40 ++++++++--------------------------------
 3 files changed, 15 insertions(+), 44 deletions(-)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index e71286f..d4636a0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -7,14 +7,15 @@
  *
  * The 'gfpmask' refers to the allocation we are currently trying to
  * fulfil.
- *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
 	gfp_t gfp_mask;
 
-	/* How many slab objects shrinker() should scan and try to reclaim */
+	/*
+	 * How many objects scan_objects should scan and try to reclaim.
+	 * This is reset before every call, so it is safe for callees
+	 * to modify.
+	 */
 	long nr_to_scan;
 
 	/* shrink from these nodes */
@@ -24,11 +25,6 @@ struct shrink_control {
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * @shrink() should look through the least-recently-used 'nr_to_scan' entries
- * and attempt to free them up.  It should return the number of objects which
- * remain in the cache.  If it returns -1, it means it cannot do any scanning at
- * this time (eg. there is a risk of deadlock).
- *
  * @count_objects should return the number of freeable items in the cache. If
  * there are no objects to free or the number of freeable items cannot be
  * determined, it should return 0. No deadlock checks should be done during the
@@ -44,7 +40,6 @@ struct shrink_control {
  * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
-	int (*shrink)(struct shrinker *, struct shrink_control *sc);
 	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
 	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 63cfccc..132a985 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -202,7 +202,7 @@ TRACE_EVENT(mm_shrink_slab_start,
 
 	TP_fast_assign(
 		__entry->shr = shr;
-		__entry->shrink = shr->shrink;
+		__entry->shrink = shr->scan_objects;
 		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
 		__entry->gfp_flags = sc->gfp_mask;
 		__entry->pgs_scanned = pgs_scanned;
@@ -241,7 +241,7 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 	TP_fast_assign(
 		__entry->shr = shr;
-		__entry->shrink = shr->shrink;
+		__entry->shrink = shr->scan_objects;
 		__entry->unused_scan = unused_scan_cnt;
 		__entry->new_scan = new_scan_cnt;
 		__entry->retval = shrinker_retval;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6926e09..232dfcb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -176,14 +176,6 @@ void unregister_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-static inline int do_shrinker_shrink(struct shrinker *shrinker,
-				     struct shrink_control *sc,
-				     unsigned long nr_to_scan)
-{
-	sc->nr_to_scan = nr_to_scan;
-	return (*shrinker->shrink)(shrinker, sc);
-}
-
 #define SHRINK_BATCH 128
 /*
  * Call the shrink functions to age shrinkable caches
@@ -229,11 +221,8 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
-		if (shrinker->scan_objects) {
-			max_pass = shrinker->count_objects(shrinker, sc);
-			WARN_ON(max_pass < 0);
-		} else
-			max_pass = do_shrinker_shrink(shrinker, sc, 0);
+		max_pass = shrinker->count_objects(shrinker, sc);
+		WARN_ON(max_pass < 0);
 		if (max_pass <= 0)
 			continue;
 
@@ -252,7 +241,7 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		if (total_scan < 0) {
 			printk(KERN_ERR
 			"shrink_slab: %pF negative objects to delete nr=%ld\n",
-			       shrinker->shrink, total_scan);
+			       shrinker->scan_objects, total_scan);
 			total_scan = max_pass;
 		}
 
@@ -286,24 +275,11 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		do {
 			long ret;
 
-			if (shrinker->scan_objects) {
-				sc->nr_to_scan = batch_size;
-				ret = shrinker->scan_objects(shrinker, sc);
-
-				if (ret == -1)
-					break;
-				freed += ret;
-			} else {
-				int nr_before;
-
-				nr_before = do_shrinker_shrink(shrinker, sc, 0);
-				ret = do_shrinker_shrink(shrinker, sc,
-								batch_size);
-				if (ret == -1)
-					break;
-				if (ret < nr_before)
-					freed += nr_before - ret;
-			}
+			sc->nr_to_scan = batch_size;
+			ret = shrinker->scan_objects(shrinker, sc);
+			if (ret == -1)
+				break;
+			freed += ret;
 
 			count_vm_events(SLABS_SCANNED, batch_size);
 			total_scan -= batch_size;
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (19 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 20/28] shrinker: Kill old ->shrink API Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-04-01  7:46   ` Kamezawa Hiroyuki
  2013-04-03 10:11   ` Sha Zhengju
  2013-03-29  9:14 ` [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
                   ` (7 subsequent siblings)
  28 siblings, 2 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
Without the surrounding infrastructure, this patch is a bit of a hammer:
it will basically shrink objects from all memcgs under memcg pressure.
At least, however, we will keep the scan limited to the shrinkers marked
as per-memcg.
Future patches will implement the in-shrinker logic to filter objects
based on its memcg association.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h | 17 +++++++++++++++++
 include/linux/shrinker.h   |  4 ++++
 mm/memcontrol.c            | 16 +++++++++++++++-
 mm/vmscan.c                | 46 +++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 79 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..4c24249 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -199,6 +199,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -377,6 +380,12 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
@@ -429,6 +438,8 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -562,6 +573,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
+
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index d4636a0..4e9e53b 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,6 +20,9 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* reclaim from this memcg only (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
 };
 
 /*
@@ -45,6 +48,7 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
+	bool memcg_shrinker; /* memcg-aware shrinker */
 
 	/* These are for internal use */
 	struct list_head list;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2b55222..ecdae39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -386,7 +386,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -942,6 +942,20 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 	return ret;
 }
 
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long val;
+
+	val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
+	if (do_swap_account)
+		val += mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+						    LRU_ALL_ANON);
+	return val;
+}
+
 static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 232dfcb..43928fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+/*
+ * kmem reclaim should usually not be triggered when we are doing targetted
+ * reclaim. It is only valid when global reclaim is triggered, or when the
+ * underlying memcg has kmem objects.
+ */
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return !sc->target_mem_cgroup ||
+		memcg_kmem_is_active(sc->target_mem_cgroup);
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	if (global_reclaim(sc))
+		return zone_reclaimable_pages(zone);
+	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
+}
+
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	return zone_reclaimable_pages(zone);
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -221,6 +252,14 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
+		/*
+		 * If we don't have a target mem cgroup, we scan them all.
+		 * Otherwise we will limit our scan to shrinkers marked as
+		 * memcg aware
+		 */
+		if (sc->target_mem_cgroup && !shrinker->memcg_shrinker)
+			continue;
+
 		max_pass = shrinker->count_objects(shrinker, sc);
 		WARN_ON(max_pass < 0);
 		if (max_pass <= 0)
@@ -2163,9 +2202,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
+		 * over limit cgroups, unless we know they have kmem objects
 		 */
-		if (global_reclaim(sc)) {
+		if (has_kmem_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 
 			nodes_clear(shrink->nodes_to_scan);
@@ -2174,7 +2213,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
-				lru_pages += zone_reclaimable_pages(zone);
+				lru_pages += zone_nr_reclaimable_pages(sc, zone);
 				node_set(zone_to_nid(zone),
 					 shrink->nodes_to_scan);
 			}
@@ -2443,6 +2482,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
+		.target_mem_cgroup = memcg,
 	};
 
 	/*
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (20 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-04-01  8:05   ` Kamezawa Hiroyuki
  2013-03-29  9:14 ` [PATCH v2 23/28] lru: add an element to a memcg list Glauber Costa
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
When a new memcg is created, we need to open up room for its descriptors
in all of the list_lrus that are marked per-memcg. The process is quite
similar to the one we are using for the kmem caches: we initialize the
new structures in an array indexed by kmemcg_id, and grow the array if
needed. Key data like the size of the array will be shared between the
kmem cache code and the list_lru code (they basically describe the same
thing)
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   |  37 ++++++++++-
 include/linux/memcontrol.h |  12 ++++
 lib/list_lru.c             | 101 +++++++++++++++++++++++++++---
 mm/memcontrol.c            | 151 +++++++++++++++++++++++++++++++++++++++++++--
 mm/slab_common.c           |   1 -
 5 files changed, 285 insertions(+), 17 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 02796da..d6cf126 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -16,12 +16,47 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * This is supposed to be M x N matrix, where M is kmem-limited memcg,
+ * and N is the number of nodes.
+ */
+struct list_lru_array {
+	struct list_lru_node node[1];
+};
+
 struct list_lru {
 	struct list_lru_node	node[MAX_NUMNODES];
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_head	lrus;
+	struct list_lru_array	**memcg_lrus;
+#endif
 };
 
-int list_lru_init(struct list_lru *lru);
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_KMEM
+struct list_lru_array *lru_alloc_array(void);
+int memcg_update_all_lrus(unsigned long num);
+void list_lru_destroy(struct list_lru *lru);
+void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+int __memcg_init_lru(struct list_lru *lru);
+#else
+static inline void list_lru_destroy(struct list_lru *lru)
+{
+}
+#endif
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
+
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
 long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4c24249..ee3199d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/list_lru.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -469,6 +470,12 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
+int memcg_new_lru(struct list_lru *lru);
+int memcg_init_lru(struct list_lru *lru);
+
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru);
+
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
@@ -632,6 +639,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_init_lru(struct list_lru *lru)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 0f08ed6..a9616a0 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/list_lru.h>
+#include <linux/memcontrol.h>
 
 int
 list_lru_add(
@@ -184,18 +185,100 @@ list_lru_dispose_all(
 	return total;
 }
 
-int
-list_lru_init(
-	struct list_lru	*lru)
+/*
+ * This protects the list of all LRU in the system. One only needs
+ * to take when registering an LRU, or when duplicating the list of lrus.
+ * Transversing an LRU can and should be done outside the lock
+ */
+static DEFINE_MUTEX(all_memcg_lrus_mutex);
+static LIST_HEAD(all_memcg_lrus);
+
+static void list_lru_init_one(struct list_lru_node *lru)
 {
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+}
+
+struct list_lru_array *lru_alloc_array(void)
+{
+	struct list_lru_array *lru_array;
 	int i;
 
-	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
+				GFP_KERNEL);
+	if (!lru_array)
+		return NULL;
+
+	for (i = 0; i < nr_node_ids ; i++)
+		list_lru_init_one(&lru_array->node[i]);
+
+	return lru_array;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+int __memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&lru->lrus);
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_add(&lru->lrus, &all_memcg_lrus);
+	ret = memcg_new_lru(lru);
+	mutex_unlock(&all_memcg_lrus_mutex);
+	return ret;
+}
+
+int memcg_update_all_lrus(unsigned long num)
+{
+	int ret = 0;
+	struct list_lru *lru;
+
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
+		ret = memcg_kmem_update_lru_size(lru, num, false);
+		if (ret)
+			goto out;
+	}
+out:
+	mutex_unlock(&all_memcg_lrus_mutex);
+	return ret;
+}
+
+void list_lru_destroy(struct list_lru *lru)
+{
+	if (!lru->memcg_lrus)
+		return;
+
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_del(&lru->lrus);
+	mutex_unlock(&all_memcg_lrus_mutex);
+}
+
+void list_lru_destroy_memcg(struct mem_cgroup *memcg)
+{
+	struct list_lru *lru;
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
+		kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);
+		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
+		/* everybody must beaware that this memcg is no longer valid */
+		wmb();
 	}
+	mutex_unlock(&all_memcg_lrus_mutex);
+}
+#endif
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int i;
+
+	nodes_clear(lru->active_nodes);
+	for (i = 0; i < MAX_NUMNODES; i++)
+		list_lru_init_one(&lru->node[i]);
+
+	if (memcg_enabled)
+		return memcg_init_lru(lru);
 	return 0;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ecdae39..c6c90d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2988,16 +2988,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out;
+
+	/*
+	 * We should make sure that the array size is not updated until we are
+	 * done; otherwise we have no easy way to know whether or not we should
+	 * grow the array.
+	 */
+	ret = memcg_update_all_lrus(num + 1);
+	if (ret)
+		goto out;
 
 	memcg->kmemcg_id = num;
+
+	memcg_update_array_size(num + 1);
+
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
 	mutex_init(&memcg->slab_caches_mutex);
+
 	return 0;
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
@@ -3081,6 +3095,129 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+/*
+ * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
+ *
+ * @lru: the lru we are operating with
+ * @num_groups: how many kmem-limited cgroups we have
+ * @new_lru: true if this is a new_lru being created, false if this
+ * was triggered from the memcg side
+ *
+ * Returns 0 on success, and an error code otherwise.
+ *
+ * This function can be called either when a new kmem-limited memcg appears,
+ * or when a new list_lru is created. The work is roughly the same in two cases,
+ * but in the later we never have to expand the array size.
+ *
+ * This is always protected by the all_lrus_mutex from the list_lru side.  But
+ * a race can still exists if a new memcg becomes kmem limited at the same time
+ * that we are registering a new memcg. Creation is protected by the
+ * memcg_mutex, so the creation of a new lru have to be protected by that as
+ * well.
+ *
+ * The lock ordering is that the memcg_mutex needs to be acquired before the
+ * lru-side mutex.
+ */
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru)
+{
+	struct list_lru_array **new_lru_array;
+	struct list_lru_array *lru_array;
+
+	lru_array = lru_alloc_array();
+	if (!lru_array)
+		return -ENOMEM;
+
+	/*
+	 * When a new LRU is created, we still need to update all data for that
+	 * LRU. The procedure for late LRUs and new memcgs are quite similar, we
+	 * only need to make sure we get into the loop even if num_groups <
+	 * memcg_limited_groups_array_size.
+	 */
+	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
+		int i;
+		struct list_lru_array **old_array;
+		size_t size = memcg_caches_array_size(num_groups);
+		int num_memcgs = memcg_limited_groups_array_size;
+
+		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
+		if (!new_lru_array) {
+			kfree(lru_array);
+			return -ENOMEM;
+		}
+
+		for (i = 0; lru->memcg_lrus && (i < num_memcgs); i++) {
+			if (lru->memcg_lrus && lru->memcg_lrus[i])
+				continue;
+			new_lru_array[i] =  lru->memcg_lrus[i];
+		}
+
+		old_array = lru->memcg_lrus;
+		lru->memcg_lrus = new_lru_array;
+		/*
+		 * We don't need a barrier here because we are just copying
+		 * information over. Anybody operating in memcg_lrus will
+		 * either follow the new array or the old one and they contain
+		 * exactly the same information. The new space in the end is
+		 * always empty anyway.
+		 */
+		if (lru->memcg_lrus)
+			kfree(old_array);
+	}
+
+	if (lru->memcg_lrus) {
+		lru->memcg_lrus[num_groups - 1] = lru_array;
+		/*
+		 * Here we do need the barrier, because of the state transition
+		 * implied by the assignment of the array. All users should be
+		 * able to see it
+		 */
+		wmb();
+	}
+	return 0;
+}
+
+/*
+ * This is called with the LRU-mutex being held.
+ */
+int memcg_new_lru(struct list_lru *lru)
+{
+	struct mem_cgroup *iter;
+
+	if (!memcg_kmem_enabled())
+		return 0;
+
+	for_each_mem_cgroup(iter) {
+		int ret;
+		int memcg_id = memcg_cache_id(iter);
+		if (memcg_id < 0)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
+		if (ret) {
+			mem_cgroup_iter_break(root_mem_cgroup, iter);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+/*
+ * We need to call back and forth from memcg to LRU because of the lock
+ * ordering.  This complicates the flow a little bit, but since the memcg mutex
+ * is held through the whole duration of memcg creation, we need to hold it
+ * before we hold the LRU-side mutex in the case of a new list creation as
+ * well.
+ */
+int memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+	mutex_lock(&memcg_create_mutex);
+	ret = __memcg_init_lru(lru);
+	mutex_unlock(&memcg_create_mutex);
+	return ret;
+}
+
 int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
 			 struct kmem_cache *root_cache)
 {
@@ -5775,8 +5912,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
 	 * possible that the charges went down to 0 between mark_dead and the
 	 * res_counter read, so in that case, we don't need the put
 	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
+	if (memcg_kmem_test_and_clear_dead(memcg)) {
+		list_lru_destroy_memcg(memcg);
 		mem_cgroup_put(memcg);
+	}
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3f3cd97..2470d11 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
 			goto out;
 	}
 
-	memcg_update_array_size(num_memcgs);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 23/28] lru: add an element to a memcg list
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (21 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-04-01  8:18   ` Kamezawa Hiroyuki
  2013-03-29  9:14 ` [PATCH v2 24/28] list_lru: also include memcg lists in counts and scans Glauber Costa
                   ` (5 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
With the infrastructure we now have, we can add an element to a memcg
LRU list instead of the global list. The memcg lists are still
per-node.
Technically, we will never trigger per-node shrinking in the memcg is
short of memory. Therefore an alternative to this would be to add the
element to *both* a single-node memcg array and a per-node global array.
There are two main reasons for this design choice:
1) adding an extra list_head to each of the objects would waste 16-bytes
per object, always remembering that we are talking about 1 dentry + 1
inode in the common case. This means a close to 10 % increase in the
dentry size, and a lower yet significant increase in the inode size. In
terms of total memory, this design pays 32-byte per-superblock-per-node
(size of struct list_lru_node), which means that in any scenario where
we have more than 10 dentries + inodes, we would already be paying more
memory in the two-list-heads approach than we will here with 1 node x 10
superblocks. The turning point of course depends on the workload, but I
hope the figures above would convince you that the memory footprint is
in my side in any workload that matters.
2) The main drawback of this, namely, that we loose global LRU order, is
not really seen by me as a disadvantage: if we are using memcg to
isolate the workloads, global pressure should try to balance the amount
reclaimed from all memcgs the same way the shrinkers will already
naturally balance the amount reclaimed from each superblock. (This
patchset needs some love in this regard, btw).
To help us easily tracking down which nodes have and which nodes doesn't
have elements in the list, we will count on an auxiliary node bitmap in
the global level.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   | 10 +++++++
 include/linux/memcontrol.h | 10 +++++++
 lib/list_lru.c             | 68 +++++++++++++++++++++++++++++++++++++++-------
 mm/memcontrol.c            | 38 +++++++++++++++++++++++++-
 4 files changed, 115 insertions(+), 11 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index d6cf126..0856899 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -26,6 +26,7 @@ struct list_lru_array {
 
 struct list_lru {
 	struct list_lru_node	node[MAX_NUMNODES];
+	atomic_long_t		node_totals[MAX_NUMNODES];
 	nodemask_t		active_nodes;
 #ifdef CONFIG_MEMCG_KMEM
 	struct list_head	lrus;
@@ -40,10 +41,19 @@ int memcg_update_all_lrus(unsigned long num);
 void list_lru_destroy(struct list_lru *lru);
 void list_lru_destroy_memcg(struct mem_cgroup *memcg);
 int __memcg_init_lru(struct list_lru *lru);
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
 #else
 static inline void list_lru_destroy(struct list_lru *lru)
 {
 }
+
+static inline struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid)
+{
+	BUG_ON(index < 0); /* index != -1 with !MEMCG_KMEM. Impossible */
+	return &lru->node[nid];
+}
 #endif
 
 int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ee3199d..f55f875 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
 #include <linux/list_lru.h>
+#include <linux/mm.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -473,6 +474,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int memcg_new_lru(struct list_lru *lru);
 int memcg_init_lru(struct list_lru *lru);
 
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page);
+
 int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 			       bool new_lru);
 
@@ -644,6 +648,12 @@ static inline int memcg_init_lru(struct list_lru *lru)
 {
 	return 0;
 }
+
+static inline struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	return &lru->node[page_to_nid(page)];
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index a9616a0..734ff91 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -15,14 +15,22 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		nlru->nr_items++;
+		/*
+		 * We only consider a node active or inactive based on the
+		 * total figure for all involved children.
+		 */
+		if (atomic_long_add_return(1, &lru->node_totals[nid]) == 1)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -37,14 +45,20 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		nlru->nr_items--;
+
+		if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 			node_clear(nid, lru->active_nodes);
+
 		BUG_ON(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -97,7 +111,9 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case 0:	/* item removed from list */
-			if (--nlru->nr_items == 0)
+			nlru->nr_items--;
+
+			if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 				node_clear(nid, lru->active_nodes);
 			BUG_ON(nlru->nr_items < 0);
 			isolated++;
@@ -245,11 +261,41 @@ out:
 	return ret;
 }
 
-void list_lru_destroy(struct list_lru *lru)
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid)
 {
+	struct list_lru_node *nlru;
+
+	if (index < 0)
+		return &lru->node[nid];
+
 	if (!lru->memcg_lrus)
-		return;
+		return NULL;
+
+	/*
+	 * because we will only ever free the memcg_lrus after synchronize_rcu,
+	 * we are safe with the rcu lock here: even if we are operating in the
+	 * stale version of the array, the data is still valid and we are not
+	 * risking anything.
+	 *
+	 * The read barrier is needed to make sure that we see the pointer
+	 * assigment for the specific memcg
+	 */
+	rcu_read_lock();
+	rmb();
+	/* The array exist, but the particular memcg does not */
+	if (!lru->memcg_lrus[index]) {
+		nlru = NULL;
+		goto out;
+	}
+	nlru = &lru->memcg_lrus[index]->node[nid];
+out:
+	rcu_read_unlock();
+	return nlru;
+}
 
+void list_lru_destroy(struct list_lru *lru)
+{
 	mutex_lock(&all_memcg_lrus_mutex);
 	list_del(&lru->lrus);
 	mutex_unlock(&all_memcg_lrus_mutex);
@@ -274,8 +320,10 @@ int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++)
+	for (i = 0; i < MAX_NUMNODES; i++) {
 		list_lru_init_one(&lru->node[i]);
+		atomic_long_set(&lru->node_totals[i], 0);
+	}
 
 	if (memcg_enabled)
 		return memcg_init_lru(lru);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6c90d8..89b7ffb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3160,9 +3160,15 @@ int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 		 * either follow the new array or the old one and they contain
 		 * exactly the same information. The new space in the end is
 		 * always empty anyway.
+		 *
+		 * We do have to make sure that no more users of the old
+		 * memcg_lrus array exist before we free, and this is achieved
+		 * by the synchronize_lru below.
 		 */
-		if (lru->memcg_lrus)
+		if (lru->memcg_lrus) {
+			synchronize_rcu();
 			kfree(old_array);
+		}
 	}
 
 	if (lru->memcg_lrus) {
@@ -3306,6 +3312,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return NULL;
+
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc))
+		memcg = pc->mem_cgroup;
+	unlock_page_cgroup(pc);
+	return memcg;
+}
+
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem_page(page);
+	int nid = page_to_nid(page);
+	int memcg_id;
+
+	if (!memcg_kmem_enabled())
+		return &lru->node[nid];
+
+	memcg_id = memcg_cache_id(memcg);
+	return lru_node_of_index(lru, memcg_id, nid);
+}
+
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 24/28] list_lru: also include memcg lists in counts and scans
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (22 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 23/28] lru: add an element to a memcg list Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 25/28] list_lru: per-memcg walks Glauber Costa
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
As elements are added to per-memcg lists, they will be invisible to
global reclaimers. This patch mainly modifies list_lru walk and count
functions to take that into account.
Counting is very simple: since we already have total figures for the
node, which we use to figure out when to set or clear the node in the
bitmap, we can just use that.
For walking, we need to walk the memcg lists as well as the global list.
To achieve that, this patch introduces the helper macro
for_each_memcg_lru_index. Locking semantics are simple, since
introducing a new LRU in the list does not influence the memcg walkers.
The only operation we race against is memcg creation and teardown.  For
those, barriers should be enough to guarantee that we are seeing
up-to-date information and not accessing invalid pointers.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  2 ++
 lib/list_lru.c             | 90 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 69 insertions(+), 23 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f55f875..99b36fe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -593,6 +593,8 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
+#define memcg_limited_groups_array_size 0
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 734ff91..e8d04a1 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -10,6 +10,23 @@
 #include <linux/list_lru.h>
 #include <linux/memcontrol.h>
 
+/*
+ * This helper will loop through all node-data in the LRU, either global or
+ * per-memcg.  If memcg is either not present or not used,
+ * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
+ * still be allowed to execute once.
+ *
+ * We convention that for _idx = -1, the global node info should be used.
+ * After that, we will go through each of the memcgs, starting at 0.
+ *
+ * We don't need any kind of locking for the loop because
+ * memcg_limited_groups_array_size can only grow, gaining new fields at the
+ * end. The old ones are just copied, and any interesting manipulation happen
+ * in the node list itself, and we already lock the list.
+ */
+#define for_each_memcg_lru_index(_idx)	\
+	for ((_idx) = -1; ((_idx) < memcg_limited_groups_array_size); (_idx)++)
+
 int
 list_lru_add(
 	struct list_lru	*lru,
@@ -77,12 +94,12 @@ list_lru_count_nodemask(
 	int nid;
 
 	for_each_node_mask(nid, *nodes_to_count) {
-		struct list_lru_node *nlru = &lru->node[nid];
-
-		spin_lock(&nlru->lock);
-		BUG_ON(nlru->nr_items < 0);
-		count += nlru->nr_items;
-		spin_unlock(&nlru->lock);
+		/*
+		 * We don't need to loop through all memcgs here, because we
+		 * have the node_totals information for the node. If we hadn't,
+		 * this would still be achieavable by a loop-over-all-groups
+		 */
+		count += atomic_long_read(&lru->node_totals[nid]);
 	}
 
 	return count;
@@ -92,12 +109,12 @@ EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
 static long
 list_lru_walk_node(
 	struct list_lru		*lru,
+	struct list_lru_node	*nlru,
 	int			nid,
 	list_lru_walk_cb	isolate,
 	void			*cb_arg,
 	long			*nr_to_walk)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	long isolated = 0;
 restart:
@@ -143,12 +160,28 @@ list_lru_walk_nodemask(
 {
 	long isolated = 0;
 	int nid;
+	nodemask_t nodes;
+	int idx;
+	struct list_lru_node *nlru;
 
-	for_each_node_mask(nid, *nodes_to_walk) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_walk, node_online_map);
+
+	for_each_node_mask(nid, nodes) {
+		for_each_memcg_lru_index(idx) {
+
+			nlru = lru_node_of_index(lru, idx, nid);
+			if (!nlru)
+				continue;
+
+			isolated += list_lru_walk_node(lru, nlru, nid, isolate,
+						       cb_arg, &nr_to_walk);
+			if (nr_to_walk <= 0)
+				break;
+		}
 	}
 	return isolated;
 }
@@ -160,23 +193,34 @@ list_lru_dispose_all_node(
 	int			nid,
 	list_lru_dispose_cb	dispose)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru;
 	LIST_HEAD(dispose_list);
 	long disposed = 0;
+	int idx;
 
-	spin_lock(&nlru->lock);
-	while (!list_empty(&nlru->list)) {
-		list_splice_init(&nlru->list, &dispose_list);
-		disposed += nlru->nr_items;
-		nlru->nr_items = 0;
-		node_clear(nid, lru->active_nodes);
-		spin_unlock(&nlru->lock);
-
-		dispose(&dispose_list);
+	for_each_memcg_lru_index(idx) {
+		nlru = lru_node_of_index(lru, idx, nid);
+		if (!nlru)
+			continue;
 
 		spin_lock(&nlru->lock);
+		while (!list_empty(&nlru->list)) {
+			list_splice_init(&nlru->list, &dispose_list);
+
+			if (atomic_long_sub_and_test(nlru->nr_items,
+							&lru->node_totals[nid]))
+				node_clear(nid, lru->active_nodes);
+			disposed += nlru->nr_items;
+			nlru->nr_items = 0;
+			spin_unlock(&nlru->lock);
+
+			dispose(&dispose_list);
+
+			spin_lock(&nlru->lock);
+		}
+		spin_unlock(&nlru->lock);
 	}
-	spin_unlock(&nlru->lock);
+
 	return disposed;
 }
 
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 25/28] list_lru: per-memcg walks
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (23 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 24/28] list_lru: also include memcg lists in counts and scans Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 26/28] memcg: per-memcg kmem shrinking Glauber Costa
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
This patch extend the list_lru interfaces to allow for a memcg
parameter. Because most of its users won't need it, instead of
modifying the function signatures we create a new set of _memcg()
functions and write the old API ontop of that.
At this point, the infrastructure is mostly in place. We already walk
the nodes using all memcg indexes, so we just need to make sure we skip
all but the one we're interested in. We could just go directly to the
memcg of interest, but I am assuming that given the gained simplicity,
spending a few cycles here won't hurt *that* much (but that can be
improved if needed, of course).
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h | 24 +++++++++++++++++----
 lib/list_lru.c           | 56 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 62 insertions(+), 18 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 0856899..2481756 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -69,20 +69,36 @@ static inline int list_lru_init_memcg(struct list_lru *lru)
 
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+long list_lru_count_nodemask_memcg(struct list_lru *lru,
+			nodemask_t *nodes_to_count, struct mem_cgroup *memcg);
+
+static inline long
+list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count)
+{
+	return list_lru_count_nodemask_memcg(lru, nodes_to_count, NULL);
+}
 
 static inline long list_lru_count(struct list_lru *lru)
 {
 	return list_lru_count_nodemask(lru, &lru->active_nodes);
 }
 
-
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+long list_lru_walk_nodemask_memcg(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk, struct mem_cgroup *memcg);
+
+static inline long list_lru_walk_nodemask(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk)
+{
+	return list_lru_walk_nodemask_memcg(lru, isolate, cb_arg, nr_to_walk,
+					    &lru->active_nodes, NULL);
+}
 
 static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 				 void *cb_arg, long nr_to_walk)
diff --git a/lib/list_lru.c b/lib/list_lru.c
index e8d04a1..a49a9b5 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -16,6 +16,11 @@
  * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
  * still be allowed to execute once.
  *
+ * If a memcg is specified at memcg_id, we will make sure that the loop only
+ * have one iteration, corresponding to that memcg. This makes sure that the
+ * interface is kept for both cases and there is no need for separate code to
+ * handle that case, at the price of complicating the macro a bit.
+ *
  * We convention that for _idx = -1, the global node info should be used.
  * After that, we will go through each of the memcgs, starting at 0.
  *
@@ -24,8 +29,11 @@
  * end. The old ones are just copied, and any interesting manipulation happen
  * in the node list itself, and we already lock the list.
  */
-#define for_each_memcg_lru_index(_idx)	\
-	for ((_idx) = -1; ((_idx) < memcg_limited_groups_array_size); (_idx)++)
+#define for_each_memcg_lru_index(_idx, memcg_id)		\
+	for ((_idx) = ((memcg_id) >= 0) ? memcg_id : -1;	\
+	     ((memcg_id < 0) || ((_idx) <= (memcg_id))) &&	\
+	     ((_idx) < memcg_limited_groups_array_size);	\
+	     (_idx)++)
 
 int
 list_lru_add(
@@ -86,25 +94,44 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_count_nodemask(
+list_lru_count_nodemask_memcg(
 	struct list_lru *lru,
-	nodemask_t	*nodes_to_count)
+	nodemask_t	*nodes_to_count,
+	struct mem_cgroup *memcg)
 {
 	long count = 0;
 	int nid;
+	nodemask_t nodes;
+	struct list_lru_node *nlru;
+	int memcg_id = memcg_cache_id(memcg);
+
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_count, node_online_map);
 
-	for_each_node_mask(nid, *nodes_to_count) {
+	for_each_node_mask(nid, nodes) {
 		/*
 		 * We don't need to loop through all memcgs here, because we
 		 * have the node_totals information for the node. If we hadn't,
 		 * this would still be achieavable by a loop-over-all-groups
 		 */
-		count += atomic_long_read(&lru->node_totals[nid]);
-	}
+		if (!memcg)
+			count += atomic_long_read(&lru->node_totals[nid]);
+		else {
+			nlru = lru_node_of_index(lru, memcg_id, nid);
+			WARN_ON(!nlru);
 
+			spin_lock(&nlru->lock);
+			BUG_ON(nlru->nr_items < 0);
+			count += nlru->nr_items;
+			spin_unlock(&nlru->lock);
+		}
+	}
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask_memcg);
 
 static long
 list_lru_walk_node(
@@ -151,16 +178,18 @@ restart:
 }
 
 long
-list_lru_walk_nodemask(
+list_lru_walk_nodemask_memcg(
 	struct list_lru	*lru,
 	list_lru_walk_cb isolate,
 	void		*cb_arg,
 	long		nr_to_walk,
-	nodemask_t	*nodes_to_walk)
+	nodemask_t	*nodes_to_walk,
+	struct mem_cgroup *memcg)
 {
 	long isolated = 0;
 	int nid;
 	nodemask_t nodes;
+	int memcg_id = memcg_cache_id(memcg);
 	int idx;
 	struct list_lru_node *nlru;
 
@@ -171,8 +200,7 @@ list_lru_walk_nodemask(
 	nodes_and(nodes, *nodes_to_walk, node_online_map);
 
 	for_each_node_mask(nid, nodes) {
-		for_each_memcg_lru_index(idx) {
-
+		for_each_memcg_lru_index(idx, memcg_id) {
 			nlru = lru_node_of_index(lru, idx, nid);
 			if (!nlru)
 				continue;
@@ -185,7 +213,7 @@ list_lru_walk_nodemask(
 	}
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask_memcg);
 
 long
 list_lru_dispose_all_node(
@@ -198,7 +226,7 @@ list_lru_dispose_all_node(
 	long disposed = 0;
 	int idx;
 
-	for_each_memcg_lru_index(idx) {
+	for_each_memcg_lru_index(idx, -1) {
 		nlru = lru_node_of_index(lru, idx, nid);
 		if (!nlru)
 			continue;
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 26/28] memcg: per-memcg kmem shrinking
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (24 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 25/28] list_lru: per-memcg walks Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-04-01  8:31   ` Kamezawa Hiroyuki
  2013-03-29  9:14 ` [PATCH v2 27/28] list_lru: reclaim proportionaly between memcgs and nodes Glauber Costa
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
If the kernel limit is smaller than the user limit, we will have
situations in which our allocations fail but freeing user pages will buy
us nothing.  In those, we would like to call a specialized memcg
reclaimer that only frees kernel memory and leave the user memory alone.
Those are also expected to fail when we account memcg->kmem, instead of
when we account memcg->res. Based on that, this patch implements a
memcg-specific reclaimer, that only shrinks kernel objects, withouth
touching user pages.
There might be situations in which there are plenty of objects to
shrink, but we can't do it because the __GFP_FS flag is not set.
Although they can happen with user pages, they are a lot more common
with fs-metadata: this is the case with almost all inode allocation.
Those allocations are, however, capable of waiting.  So we can just span
a worker, let it finish its job and proceed with the allocation. As slow
as it is, at this point we are already past any hopes anyway.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/swap.h |   2 +
 mm/memcontrol.c      | 182 ++++++++++++++++++++++++++++++++++++++++-----------
 mm/vmscan.c          |  37 ++++++++++-
 3 files changed, 183 insertions(+), 38 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..80f6635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,6 +268,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
+						 gfp_t gfp_mask);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89b7ffb..a5a0f39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -353,6 +353,8 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
+	/* when kmem shrinkers can sleep but can't proceed due to context */
+	struct work_struct kmemcg_shrink_work;
 	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
@@ -369,11 +371,14 @@ static size_t memcg_size(void)
 		nr_node_ids * sizeof(struct mem_cgroup_per_node);
 }
 
+static DEFINE_MUTEX(set_limit_mutex);
+
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_MAY_SHRINK, /* kmem limit < mem limit, shrink kmem only */
 };
 
 /* We account when limit is on, but only after call sites are patched */
@@ -412,6 +417,31 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
 	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
 				  &memcg->kmem_account_flags);
 }
+
+/*
+ * If the kernel limit is smaller than the user limit, we will have situations
+ * in which our allocations fail but freeing user pages will buy us nothing.
+ * In those, we would like to call a specialized memcg reclaimer that only
+ * frees kernel memory and leave the user memory alone.
+ *
+ * This test exists so we can differentiate between those. Everytime one of the
+ * limits is updated, we need to run it. The set_limit_mutex must be held, so
+ * they don't change again.
+ */
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+	mutex_lock(&set_limit_mutex);
+	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
+		res_counter_read_u64(&memcg->res, RES_LIMIT))
+		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	else
+		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	mutex_unlock(&set_limit_mutex);
+}
+#else
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /* Stuffs for move charges at task migration. */
@@ -2838,8 +2868,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	memcg_check_events(memcg, page);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -2881,16 +2909,92 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
 }
 #endif
 
+/*
+ * During the creation a new cache, we need to disable our accounting mechanism
+ * altogether. This is true even if we are not creating, but rather just
+ * enqueing new caches to be created.
+ *
+ * This is because that process will trigger allocations; some visible, like
+ * explicit kmallocs to auxiliary data structures, name strings and internal
+ * cache structures; some well concealed, like INIT_WORK() that can allocate
+ * objects during debug.
+ *
+ * If any allocation happens during memcg_kmem_get_cache, we will recurse back
+ * to it. This may not be a bounded recursion: since the first cache creation
+ * failed to complete (waiting on the allocation), we'll just try to create the
+ * cache again, failing at the same point.
+ *
+ * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
+ * memcg_kmem_skip_account. So we enclose anything that might allocate memory
+ * inside the following two functions.
+ */
+static inline void memcg_stop_kmem_account(void)
+{
+	VM_BUG_ON(!current->mm);
+	current->memcg_kmem_skip_account++;
+}
+
+static inline void memcg_resume_kmem_account(void)
+{
+	VM_BUG_ON(!current->mm);
+	current->memcg_kmem_skip_account--;
+}
+
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+	int retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	int ret;
+
+	do {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			return ret;
+
+		if (!(gfp & __GFP_WAIT))
+			return ret;
+
+		/*
+		 * We will try to shrink kernel memory present in caches. We
+		 * are sure that we can wait, so we will. The duration of our
+		 * wait is determined by congestion, the same way as vmscan.c
+		 *
+		 * If we are in FS context, though, then although we can wait,
+		 * we cannot call the shrinkers. Most fs shrinkers (which
+		 * comprises most of our kmem data) will not run without
+		 * __GFP_FS since they can deadlock. The solution is to
+		 * synchronously run that in a different context.
+		 */
+		if (!(gfp & __GFP_FS)) {
+			/*
+			 * we are already short on memory, every queue
+			 * allocation is likely to fail
+			 */
+			memcg_stop_kmem_account();
+			schedule_work(&memcg->kmemcg_shrink_work);
+			flush_work(&memcg->kmemcg_shrink_work);
+			memcg_resume_kmem_account();
+		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+	} while (retries--);
+
+	return ret;
+}
+
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
 	struct mem_cgroup *_memcg;
 	int ret = 0;
 	bool may_oom;
+	bool kmem_first = test_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
-	if (ret)
-		return ret;
+	if (kmem_first) {
+		ret = memcg_try_charge_kmem(memcg, gfp, size);
+		if (ret)
+			return ret;
+	}
 
 	/*
 	 * Conditions under which we can wait for the oom_killer. Those are
@@ -2923,12 +3027,43 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 			res_counter_charge_nofail(&memcg->memsw, size,
 						  &fail_res);
 		ret = 0;
-	} else if (ret)
+		if (!kmem_first)
+			res_counter_charge_nofail(&memcg->kmem, size, &fail_res);
+	} else if (ret && kmem_first)
 		res_counter_uncharge(&memcg->kmem, size);
 
+	if (!kmem_first) {
+		ret = memcg_try_charge_kmem(memcg, gfp, size);
+		if (!ret)
+			return ret;
+
+		res_counter_uncharge(&memcg->res, size);
+		if (do_swap_account)
+			res_counter_uncharge(&memcg->memsw, size);
+	}
+
 	return ret;
 }
 
+/*
+ * There might be situations in which there are plenty of objects to shrink,
+ * but we can't do it because the __GFP_FS flag is not set.  This is the case
+ * with almost all inode allocation. They do are, however, capable of waiting.
+ * So we can just span a worker, let it finish its job and proceed with the
+ * allocation. As slow as it is, at this point we are already past any hopes
+ * anyway.
+ */
+static void kmemcg_shrink_work_fn(struct work_struct *w)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(w, struct mem_cgroup, kmemcg_shrink_work);
+
+	if (!try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL))
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
+
 static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 {
 	res_counter_uncharge(&memcg->res, size);
@@ -3005,6 +3140,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_update_array_size(num + 1);
 
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
 	mutex_init(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3281,37 +3417,6 @@ out:
 	kfree(s->memcg_params);
 }
 
-/*
- * During the creation a new cache, we need to disable our accounting mechanism
- * altogether. This is true even if we are not creating, but rather just
- * enqueing new caches to be created.
- *
- * This is because that process will trigger allocations; some visible, like
- * explicit kmallocs to auxiliary data structures, name strings and internal
- * cache structures; some well concealed, like INIT_WORK() that can allocate
- * objects during debug.
- *
- * If any allocation happens during memcg_kmem_get_cache, we will recurse back
- * to it. This may not be a bounded recursion: since the first cache creation
- * failed to complete (waiting on the allocation), we'll just try to create the
- * cache again, failing at the same point.
- *
- * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
- * memcg_kmem_skip_account. So we enclose anything that might allocate memory
- * inside the following two functions.
- */
-static inline void memcg_stop_kmem_account(void)
-{
-	VM_BUG_ON(!current->mm);
-	current->memcg_kmem_skip_account++;
-}
-
-static inline void memcg_resume_kmem_account(void)
-{
-	VM_BUG_ON(!current->mm);
-	current->memcg_kmem_skip_account--;
-}
-
 static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
 {
 	struct page_cgroup *pc;
@@ -5292,6 +5397,9 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			ret = memcg_update_kmem_limit(cont, val);
 		else
 			return -EINVAL;
+
+		if (!ret)
+			memcg_update_shrink_status(memcg);
 		break;
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 43928fd..dd235e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2504,7 +2504,42 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
-#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This function is called when we are under kmem-specific pressure.  It will
+ * only trigger in environments with kmem.limit_in_bytes < limit_in_bytes, IOW,
+ * with a lower kmem allowance than the memory allowance.
+ *
+ * In this situation, freeing user pages from the cgroup won't do us any good.
+ * What we really need is to call the memcg-aware shrinkers, in the hope of
+ * freeing pages holding kmem objects. It may also be that we won't be able to
+ * free any pages, but will get rid of old objects opening up space for new
+ * ones.
+ */
+unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
+					  gfp_t gfp_mask)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = gfp_mask,
+		.target_mem_cgroup = memcg,
+	};
+
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+
+	nodes_setall(shrink.nodes_to_scan);
+
+	/*
+	 * We haven't scanned any user LRU, so we basically come up with
+	 * crafted values of nr_scanned and LRU page (1 and 0 respectively).
+	 * This should be enough to tell shrink_slab that the freeing
+	 * responsibility is all on himself.
+	 */
+	return shrink_slab(&shrink, 1, 0);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_MEMCG */
 
 static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 27/28] list_lru: reclaim proportionaly between memcgs and nodes
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (25 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 26/28] memcg: per-memcg kmem shrinking Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-03-29  9:14 ` [PATCH v2 28/28] super: targeted memcg reclaim Glauber Costa
  2013-04-01 12:38 ` [PATCH v2 00/28] memcg-aware slab shrinking Serge Hallyn
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
The current list_lru code will try to scan objects until nr_to_walk is
reached, and then stop. This number can be different from the total
number of objects we have as returned by our count function. This is
because the main shrinker driver is the one ultimately responsible for
determining how many objects to shrink from each shrinker.
Specially if this number is lower than the number of objects, and
because we transverse the list always in the same order, we can have
the last node and/or the last memcg always being less penalized than
the others.
My proposed solution is to introduce some metric of proportionality
based on the total number of objects per node and then scan all nodes
and memcgs up until their share is reached.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 lib/list_lru.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 83 insertions(+), 13 deletions(-)
diff --git a/lib/list_lru.c b/lib/list_lru.c
index a49a9b5..af67725 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -177,6 +177,43 @@ restart:
 	return isolated;
 }
 
+static long
+memcg_isolate_lru(
+	struct list_lru	*lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	long		nr_to_walk,
+	struct mem_cgroup *memcg,
+	int nid, unsigned long total_node)
+{
+	int memcg_id = memcg_cache_id(memcg);
+	unsigned long nr_to_walk_this;
+	long isolated = 0;
+	int idx;
+	struct list_lru_node *nlru;
+
+	for_each_memcg_lru_index(idx, memcg_id) {
+		nlru = lru_node_of_index(lru, idx, nid);
+		if (!nlru || !nlru->nr_items)
+			continue;
+
+		/*
+		 * no memcg: walk every memcg proportionally.
+		 * memcg case: scan everything (total_node)
+		 */
+		if (!memcg)
+			nr_to_walk_this = mult_frac(nlru->nr_items, nr_to_walk,
+						    total_node);
+		else
+			nr_to_walk_this = total_node;
+
+		isolated += list_lru_walk_node(lru, nlru, nid, isolate,
+				       cb_arg, &nr_to_walk_this);
+	}
+
+	return isolated;
+}
+
 long
 list_lru_walk_nodemask_memcg(
 	struct list_lru	*lru,
@@ -189,9 +226,7 @@ list_lru_walk_nodemask_memcg(
 	long isolated = 0;
 	int nid;
 	nodemask_t nodes;
-	int memcg_id = memcg_cache_id(memcg);
-	int idx;
-	struct list_lru_node *nlru;
+	unsigned long n_node, total_node, total = 0;
 
 	/*
 	 * Conservative code can call this setting nodes with node_setall.
@@ -199,17 +234,52 @@ list_lru_walk_nodemask_memcg(
 	 */
 	nodes_and(nodes, *nodes_to_walk, node_online_map);
 
+	/*
+	 * We will first find out how many objects there are in the LRU, in
+	 * total. We could store that in a per-LRU counter as well, the same
+	 * way we store it in a per-NLRU. But lru_add and lru_del are way more
+	 * frequent operations, so it is better to pay the price here.
+	 *
+	 * Once we have that number, we will try to scan the nodes
+	 * proportionally to the amount of objects they have. The main shrinker
+	 * driver in vmscan.c will often ask us to shrink a quantity different
+	 * from the total quantity we reported in the count function (usually
+	 * less). This means that not scanning proportionally may leave nodes
+	 * (usually the last), unfairly charged.
+	 *
+	 * The final number we want is
+	 *
+	 * n_node = nr_to_scan * total_node / total
+	 */
+	for_each_node_mask(nid, nodes)
+		total += atomic_long_read(&lru->node_totals[nid]);
+
 	for_each_node_mask(nid, nodes) {
-		for_each_memcg_lru_index(idx, memcg_id) {
-			nlru = lru_node_of_index(lru, idx, nid);
-			if (!nlru)
-				continue;
-
-			isolated += list_lru_walk_node(lru, nlru, nid, isolate,
-						       cb_arg, &nr_to_walk);
-			if (nr_to_walk <= 0)
-				break;
-		}
+		total_node = atomic_long_read(&lru->node_totals[nid]);
+		if (!total_node)
+			continue;
+
+		 /*
+		  * There are items, but in less proportion. Because we have no
+		  * information about where exactly the pressure originates
+		  * from, it is better to try shrinking the few we have than to
+		  * skip it.  It might very well be that this node is under
+		  * pressure and any help would be welcome.
+		  */
+		n_node = mult_frac(total_node, nr_to_walk, total);
+		if (!n_node)
+			n_node = total_node;
+
+		/*
+		 * We will now scan all memcg-like entities (which includes the
+		 * global LRU, of index -1, and also try to mantain
+		 * proportionality among them.
+		 *
+		 * We will try to isolate:
+		 *	nr_memcg = n_node * nr_memcg_lru / total_node
+		 */
+		isolated += memcg_isolate_lru(lru, isolate, cb_arg,
+				      n_node, memcg, nid, total_node);
 	}
 	return isolated;
 }
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* [PATCH v2 28/28] super: targeted memcg reclaim
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (26 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 27/28] list_lru: reclaim proportionaly between memcgs and nodes Glauber Costa
@ 2013-03-29  9:14 ` Glauber Costa
  2013-04-01 12:38 ` [PATCH v2 00/28] memcg-aware slab shrinking Serge Hallyn
  28 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-03-29  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Glauber Costa, Dave Chinner, Mel Gorman,
	Rik van Riel
We now have all our dentries and inodes placed in memcg-specific LRU
lists. All we have to do is restrict the reclaim to the said lists in
case of memcg pressure.
That can't be done so easily for the fs_objects part of the equation,
since this is heavily fs-specific. What we do is pass on the context,
and let the filesystems decide if they ever chose or want to. At this
time, we just don't shrink them in memcg pressure (none is supported),
leaving that for global pressure only.
Marking the superblock shrinker and its LRUs as memcg-aware will
guarantee that the shrinkers will get invoked during targetted reclaim.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dcache.c   |  6 +++---
 fs/inode.c    |  6 +++---
 fs/internal.h |  5 +++--
 fs/super.c    | 39 +++++++++++++++++++++++++++------------
 4 files changed, 36 insertions(+), 20 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 79f6820..e56291a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -882,13 +882,13 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * use.
  */
 long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+		     nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
-				       &dispose, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_dentry_lru,
+		dentry_lru_isolate, &dispose, nr_to_scan, nodes_to_walk, memcg);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 1332eef..291423c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -746,13 +746,13 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * then are freed outside inode_lock by dispose_list().
  */
 long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+			nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
-				       &freeable, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_inode_lru,
+		inode_lru_isolate, &freeable, nr_to_scan, nodes_to_walk, memcg);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index ed6944e..88b292e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct mem_cgroup;
 
 /*
  * block_dev.c
@@ -111,7 +112,7 @@ extern int open_check_o_direct(struct file *f);
  */
 extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -128,4 +129,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
diff --git a/fs/super.c b/fs/super.c
index 5c7b879..e92ebcb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
 #include <linux/cleancache.h>
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
+#include <linux/memcontrol.h>
 #include "internal.h"
 
 
@@ -56,6 +57,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 	long	fs_objects = 0;
 	long	total_objects;
 	long	freed = 0;
@@ -74,11 +76,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (!grab_super_passive(sb))
 		return -1;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);
 
-	inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
-	dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
+	inodes = list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+					 &sc->nodes_to_scan, memcg);
+	dentries = list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+					   &sc->nodes_to_scan, memcg);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,8 +93,8 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
-	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
+	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan, memcg);
+	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan, memcg);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
@@ -107,20 +111,26 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 {
 	struct super_block *sb;
 	long	total_objects = 0;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	if (!grab_super_passive(sb))
 		return -1;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	/*
+	 * Ideally we would pass memcg to nr_cached_objects, and
+	 * let the underlying filesystem decide. Most likely the
+	 * path will be if (!memcg) return;, but even then.
+	 */
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 &sc->nodes_to_scan);
 
-	total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
-						 &sc->nodes_to_scan);
-	total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
-						 &sc->nodes_to_scan);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+					 &sc->nodes_to_scan, memcg);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+					 &sc->nodes_to_scan, memcg);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -199,8 +209,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
-		list_lru_init(&s->s_dentry_lru);
-		list_lru_init(&s->s_inode_lru);
+
+		list_lru_init_memcg(&s->s_dentry_lru);
+		list_lru_init_memcg(&s->s_inode_lru);
+
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -236,6 +248,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_shrink.scan_objects = super_cache_scan;
 		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
+		s->s_shrink.memcg_shrinker = true;
 	}
 out:
 	return s;
@@ -318,6 +331,8 @@ void deactivate_locked_super(struct super_block *s)
 
 		/* caches are now gone, we can safely kill the shrinker now */
 		unregister_shrinker(&s->s_shrink);
+		list_lru_destroy(&s->s_dentry_lru);
+		list_lru_destroy(&s->s_inode_lru);
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers
  2013-03-29  9:13 ` [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers Glauber Costa
@ 2013-04-01  7:16   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  7:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
(2013/03/29 18:13), Glauber Costa wrote:
> The sysctl knob sysctl_vfs_cache_pressure is used to determine which
> percentage of the shrinkable objects in our cache we should actively try
> to shrink.
> 
> It works great in situations in which we have many objects (at least
> more than 100), because the aproximation errors will be negligible. But
> if this is not the case, specially when total_objects < 100, we may end
> up concluding that we have no objects at all (total / 100 = 0,  if total
> < 100).
> 
> This is certainly not the biggest killer in the world, but may matter in
> very low kernel memory situations.
> 
> [ v2: fix it for all occurrences of sysctl_vfs_cache_pressure ]
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: "Theodore Ts'o" <tytso@mit.edu>
> CC: Al Viro <viro@zeniv.linux.org.uk>
I think reasonable.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-03-29  9:13 ` [PATCH v2 02/28] vmscan: take at least one pass with shrinkers Glauber Costa
@ 2013-04-01  7:26   ` Kamezawa Hiroyuki
  2013-04-01  8:10     ` Glauber Costa
  2013-04-08  8:42   ` Joonsoo Kim
  1 sibling, 1 reply; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  7:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
(2013/03/29 18:13), Glauber Costa wrote:
> In very low free kernel memory situations, it may be the case that we
> have less objects to free than our initial batch size. If this is the
> case, it is better to shrink those, and open space for the new workload
> then to keep them and fail the new allocations.
> 
> More specifically, this happens because we encode this in a loop with
> the condition: "while (total_scan >= batch_size)". So if we are in such
> a case, we'll not even enter the loop.
> 
> This patch modifies turns it into a do () while {} loop, that will
> guarantee that we scan it at least once, while keeping the behaviour
> exactly the same for the cases in which total_scan > batch_size.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Reviewed-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
> CC: "Theodore Ts'o" <tytso@mit.edu>
> CC: Al Viro <viro@zeniv.linux.org.uk>
> ---
>   mm/vmscan.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
Doesn't this break
==
                /*
                 * copy the current shrinker scan count into a local variable
                 * and zero it so that other concurrent shrinker invocations
                 * don't also do this scanning work.
                 */
                nr = atomic_long_xchg(&shrinker->nr_in_batch, 0);
==
This xchg magic ?
Thnks,
-Kame
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..fc6d45a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -280,7 +280,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>   					nr_pages_scanned, lru_pages,
>   					max_pass, delta, total_scan);
>   
> -		while (total_scan >= batch_size) {
> +		do {
>   			int nr_before;
>   
>   			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
> @@ -294,7 +294,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>   			total_scan -= batch_size;
>   
>   			cond_resched();
> -		}
> +		} while (total_scan >= batch_size);
>   
>   		/*
>   		 * move the unused scan count back into the shrinker in a
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-03-29  9:14 ` [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure Glauber Costa
@ 2013-04-01  7:46   ` Kamezawa Hiroyuki
  2013-04-01  8:51     ` Glauber Costa
  2013-04-03 10:11   ` Sha Zhengju
  1 sibling, 1 reply; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  7:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
(2013/03/29 18:14), Glauber Costa wrote:
> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
> 
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/memcontrol.h | 17 +++++++++++++++++
>   include/linux/shrinker.h   |  4 ++++
>   mm/memcontrol.c            | 16 +++++++++++++++-
>   mm/vmscan.c                | 46 +++++++++++++++++++++++++++++++++++++++++++---
>   4 files changed, 79 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..4c24249 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -199,6 +199,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>   bool mem_cgroup_bad_page_check(struct page *page);
>   void mem_cgroup_print_bad_page(struct page *page);
>   #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>   #else /* CONFIG_MEMCG */
>   struct mem_cgroup;
>   
> @@ -377,6 +380,12 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>   				struct page *newpage)
>   {
>   }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_MEMCG */
>   
>   #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -429,6 +438,8 @@ static inline bool memcg_kmem_enabled(void)
>   	return static_key_false(&memcg_kmem_enabled_key);
>   }
>   
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>   /*
>    * In general, we'll do everything in our power to not incur in any overhead
>    * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -562,6 +573,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>   	return __memcg_kmem_get_cache(cachep, gfp);
>   }
>   #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>   #define for_each_memcg_cache_index(_idx)	\
>   	for (; NULL; )
>   
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..4e9e53b 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>   
>   	/* shrink from these nodes */
>   	nodemask_t nodes_to_scan;
> +
> +	/* reclaim from this memcg only (if not NULL) */
> +	struct mem_cgroup *target_mem_cgroup;
>   };
Does this works only with kmem ? If so, please rename to some explicit
name for now.
  shrink_slab_memcg_target or some ?
>   
>   /*
> @@ -45,6 +48,7 @@ struct shrinker {
>   
>   	int seeks;	/* seeks to recreate an obj */
>   	long batch;	/* reclaim batch size, 0 = default */
> +	bool memcg_shrinker; /* memcg-aware shrinker */
>   
>   	/* These are for internal use */
>   	struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2b55222..ecdae39 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -386,7 +386,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>   	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>   }
>   
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>   {
>   	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>   }
> @@ -942,6 +942,20 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>   	return ret;
>   }
>   
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +	unsigned long val;
> +
> +	val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
> +	if (do_swap_account)
> +		val += mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
> +						    LRU_ALL_ANON);
> +	return val;
> +}
> +
>   static unsigned long
>   mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>   			int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 232dfcb..43928fd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>   {
>   	return !sc->target_mem_cgroup;
>   }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return !sc->target_mem_cgroup ||
> +		memcg_kmem_is_active(sc->target_mem_cgroup);
> +}
Is this test hierarchy aware ?
For example, in following case,
  A      no kmem limit
   \
    B    kmem limit=XXX
     \
      C  kmem limit=XXX
what happens when A is the target.
Thanks
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-03-29  9:14 ` [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
@ 2013-04-01  8:05   ` Kamezawa Hiroyuki
  2013-04-01  8:22     ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  8:05 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
(2013/03/29 18:14), Glauber Costa wrote:
> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/list_lru.h   |  37 ++++++++++-
>   include/linux/memcontrol.h |  12 ++++
>   lib/list_lru.c             | 101 +++++++++++++++++++++++++++---
>   mm/memcontrol.c            | 151 +++++++++++++++++++++++++++++++++++++++++++--
>   mm/slab_common.c           |   1 -
>   5 files changed, 285 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 02796da..d6cf126 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -16,12 +16,47 @@ struct list_lru_node {
>   	long			nr_items;
>   } ____cacheline_aligned_in_smp;
>   
> +/*
> + * This is supposed to be M x N matrix, where M is kmem-limited memcg,
> + * and N is the number of nodes.
> + */
Could you add a comment that M can be changed and the array can be resized.
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};
> +
>   struct list_lru {
>   	struct list_lru_node	node[MAX_NUMNODES];
>   	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_head	lrus;
> +	struct list_lru_array	**memcg_lrus;
> +#endif
please add comments, for what ....
>   };
>   
> -int list_lru_init(struct list_lru *lru);
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +struct list_lru_array *lru_alloc_array(void);
> +int memcg_update_all_lrus(unsigned long num);
> +void list_lru_destroy(struct list_lru *lru);
> +void list_lru_destroy_memcg(struct mem_cgroup *memcg);
> +int __memcg_init_lru(struct list_lru *lru);
> +#else
> +static inline void list_lru_destroy(struct list_lru *lru)
> +{
> +}
> +#endif
> +
> +int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
> +static inline int list_lru_init(struct list_lru *lru)
> +{
> +	return __list_lru_init(lru, false);
> +}
> +
> +static inline int list_lru_init_memcg(struct list_lru *lru)
> +{
> +	return __list_lru_init(lru, true);
> +}
> +
>   int list_lru_add(struct list_lru *lru, struct list_head *item);
>   int list_lru_del(struct list_lru *lru, struct list_head *item);
>   long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4c24249..ee3199d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,6 +23,7 @@
>   #include <linux/vm_event_item.h>
>   #include <linux/hardirq.h>
>   #include <linux/jump_label.h>
> +#include <linux/list_lru.h>
>   
>   struct mem_cgroup;
>   struct page_cgroup;
> @@ -469,6 +470,12 @@ void memcg_update_array_size(int num_groups);
>   struct kmem_cache *
>   __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
>   
> +int memcg_new_lru(struct list_lru *lru);
> +int memcg_init_lru(struct list_lru *lru);
> +
> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
> +			       bool new_lru);
> +
>   void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
>   void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
>   
> @@ -632,6 +639,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>   static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>   {
>   }
> +
> +static inline int memcg_init_lru(struct list_lru *lru)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_MEMCG_KMEM */
>   #endif /* _LINUX_MEMCONTROL_H */
>   
> diff --git a/lib/list_lru.c b/lib/list_lru.c
> index 0f08ed6..a9616a0 100644
> --- a/lib/list_lru.c
> +++ b/lib/list_lru.c
> @@ -8,6 +8,7 @@
>   #include <linux/module.h>
>   #include <linux/mm.h>
>   #include <linux/list_lru.h>
> +#include <linux/memcontrol.h>
>   
>   int
>   list_lru_add(
> @@ -184,18 +185,100 @@ list_lru_dispose_all(
>   	return total;
>   }
>   
> -int
> -list_lru_init(
> -	struct list_lru	*lru)
> +/*
> + * This protects the list of all LRU in the system. One only needs
> + * to take when registering an LRU, or when duplicating the list of lrus.
> + * Transversing an LRU can and should be done outside the lock
> + */
> +static DEFINE_MUTEX(all_memcg_lrus_mutex);
> +static LIST_HEAD(all_memcg_lrus);
> +
> +static void list_lru_init_one(struct list_lru_node *lru)
>   {
> +	spin_lock_init(&lru->lock);
> +	INIT_LIST_HEAD(&lru->list);
> +	lru->nr_items = 0;
> +}
> +
> +struct list_lru_array *lru_alloc_array(void)
> +{
> +	struct list_lru_array *lru_array;
>   	int i;
>   
> -	nodes_clear(lru->active_nodes);
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		spin_lock_init(&lru->node[i].lock);
> -		INIT_LIST_HEAD(&lru->node[i].list);
> -		lru->node[i].nr_items = 0;
> +	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
> +				GFP_KERNEL);
A nitpick...you can use kmalloc() here. All field will be overwritten.
> +	if (!lru_array)
> +		return NULL;
> +
> +	for (i = 0; i < nr_node_ids ; i++)
> +		list_lru_init_one(&lru_array->node[i]);
> +
> +	return lru_array;
> +}
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int __memcg_init_lru(struct list_lru *lru)
> +{
> +	int ret;
> +
> +	INIT_LIST_HEAD(&lru->lrus);
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_add(&lru->lrus, &all_memcg_lrus);
> +	ret = memcg_new_lru(lru);
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +	return ret;
> +}
returns 0 at success ? what kind of error can be shown here ?
> +
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
> +		ret = memcg_kmem_update_lru_size(lru, num, false);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +	return ret;
> +}
> +
> +void list_lru_destroy(struct list_lru *lru)
> +{
> +	if (!lru->memcg_lrus)
> +		return;
> +
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_del(&lru->lrus);
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +}
> +
> +void list_lru_destroy_memcg(struct mem_cgroup *memcg)
> +{
> +	struct list_lru *lru;
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
> +		kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);
> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
> +		/* everybody must beaware that this memcg is no longer valid */
> +		wmb();
>   	}
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +}
> +#endif
> +
> +int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
> +{
> +	int i;
> +
> +	nodes_clear(lru->active_nodes);
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		list_lru_init_one(&lru->node[i]);
> +
> +	if (memcg_enabled)
> +		return memcg_init_lru(lru);
>   	return 0;
>   }
> -EXPORT_SYMBOL_GPL(list_lru_init);
> +EXPORT_SYMBOL_GPL(__list_lru_init);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ecdae39..c6c90d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2988,16 +2988,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>   	memcg_kmem_set_activated(memcg);
>   
>   	ret = memcg_update_all_caches(num+1);
> -	if (ret) {
> -		ida_simple_remove(&kmem_limited_groups, num);
> -		memcg_kmem_clear_activated(memcg);
> -		return ret;
> -	}
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * We should make sure that the array size is not updated until we are
> +	 * done; otherwise we have no easy way to know whether or not we should
> +	 * grow the array.
> +	 */
> +	ret = memcg_update_all_lrus(num + 1);
> +	if (ret)
> +		goto out;
>   
>   	memcg->kmemcg_id = num;
> +
> +	memcg_update_array_size(num + 1);
> +
>   	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
>   	mutex_init(&memcg->slab_caches_mutex);
> +
>   	return 0;
> +out:
> +	ida_simple_remove(&kmem_limited_groups, num);
> +	memcg_kmem_clear_activated(memcg);
> +	return ret;
When this failure can happens ? This happens only when the user
tries to set kmem_limit and doesn't affect kernel internal logic ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-01  7:26   ` Kamezawa Hiroyuki
@ 2013-04-01  8:10     ` Glauber Costa
  2013-04-10  5:09       ` Ric Mason
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-01  8:10 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
Hi Kame,
> Doesn't this break
> 
> ==
>                 /*
>                  * copy the current shrinker scan count into a local variable
>                  * and zero it so that other concurrent shrinker invocations
>                  * don't also do this scanning work.
>                  */
>                 nr = atomic_long_xchg(&shrinker->nr_in_batch, 0);
> ==
> 
> This xchg magic ?
> 
> Thnks,
> -Kame
This is done before the actual reclaim attempt, and all it does is to
indicate to other concurrent shrinkers that "I've got it", and others
should not attempt to shrink.
Even before I touch this, this quantity represents the number of
entities we will try to shrink. Not necessarily we will succeed. What my
patch does, is to try at least once if the number is too small.
Before it, we will try to shrink 512 objects and succeed at 0 (because
batch is 1024). After this, we will try to free 512 objects and succeed
at an undefined quantity between 0 and 512.
In both cases, we will zero out nr_in_batch in the shrinker structure to
notify other shrinkers that we are the ones shrinking.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 23/28] lru: add an element to a memcg list
  2013-03-29  9:14 ` [PATCH v2 23/28] lru: add an element to a memcg list Glauber Costa
@ 2013-04-01  8:18   ` Kamezawa Hiroyuki
  2013-04-01  8:29     ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  8:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
(2013/03/29 18:14), Glauber Costa wrote:
> With the infrastructure we now have, we can add an element to a memcg
> LRU list instead of the global list. The memcg lists are still
> per-node.
> 
> Technically, we will never trigger per-node shrinking in the memcg is
> short of memory. Therefore an alternative to this would be to add the
> element to *both* a single-node memcg array and a per-node global array.
> 
per-node shrinking by memcg pressure is not imporant, I think.
> There are two main reasons for this design choice:
> 
> 1) adding an extra list_head to each of the objects would waste 16-bytes
> per object, always remembering that we are talking about 1 dentry + 1
> inode in the common case. This means a close to 10 % increase in the
> dentry size, and a lower yet significant increase in the inode size. In
> terms of total memory, this design pays 32-byte per-superblock-per-node
> (size of struct list_lru_node), which means that in any scenario where
> we have more than 10 dentries + inodes, we would already be paying more
> memory in the two-list-heads approach than we will here with 1 node x 10
> superblocks. The turning point of course depends on the workload, but I
> hope the figures above would convince you that the memory footprint is
> in my side in any workload that matters.
> 
> 2) The main drawback of this, namely, that we loose global LRU order, is
> not really seen by me as a disadvantage: if we are using memcg to
> isolate the workloads, global pressure should try to balance the amount
> reclaimed from all memcgs the same way the shrinkers will already
> naturally balance the amount reclaimed from each superblock. (This
> patchset needs some love in this regard, btw).
> 
> To help us easily tracking down which nodes have and which nodes doesn't
> have elements in the list, we will count on an auxiliary node bitmap in
> the global level.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/list_lru.h   | 10 +++++++
>   include/linux/memcontrol.h | 10 +++++++
>   lib/list_lru.c             | 68 +++++++++++++++++++++++++++++++++++++++-------
>   mm/memcontrol.c            | 38 +++++++++++++++++++++++++-
>   4 files changed, 115 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index d6cf126..0856899 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -26,6 +26,7 @@ struct list_lru_array {
>   
>   struct list_lru {
>   	struct list_lru_node	node[MAX_NUMNODES];
> +	atomic_long_t		node_totals[MAX_NUMNODES];
some comments will be helpful. 
>   	nodemask_t		active_nodes;
>   #ifdef CONFIG_MEMCG_KMEM
>   	struct list_head	lrus;
> @@ -40,10 +41,19 @@ int memcg_update_all_lrus(unsigned long num);
>   void list_lru_destroy(struct list_lru *lru);
>   void list_lru_destroy_memcg(struct mem_cgroup *memcg);
>   int __memcg_init_lru(struct list_lru *lru);
> +struct list_lru_node *
> +lru_node_of_index(struct list_lru *lru, int index, int nid);
>   #else
>   static inline void list_lru_destroy(struct list_lru *lru)
>   {
>   }
> +
> +static inline struct list_lru_node *
> +lru_node_of_index(struct list_lru *lru, int index, int nid)
> +{
> +	BUG_ON(index < 0); /* index != -1 with !MEMCG_KMEM. Impossible */
> +	return &lru->node[nid];
> +}
>   #endif
I'm sorry ...what "lru_node_of_index" means ? What is the "index" ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-04-01  8:05   ` Kamezawa Hiroyuki
@ 2013-04-01  8:22     ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-01  8:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
Hi Kame,
>>   
>> +/*
>> + * This is supposed to be M x N matrix, where M is kmem-limited memcg,
>> + * and N is the number of nodes.
>> + */
> 
> Could you add a comment that M can be changed and the array can be resized.
Yes, I can.
> 
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
>> +
>>   struct list_lru {
>>   	struct list_lru_node	node[MAX_NUMNODES];
>>   	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_head	lrus;
>> +	struct list_lru_array	**memcg_lrus;
>> +#endif
> 
> please add comments, for what ....
> 
ok.
>> +struct list_lru_array *lru_alloc_array(void)
>> +{
>> +	struct list_lru_array *lru_array;
>>   	int i;
>>   
>> -	nodes_clear(lru->active_nodes);
>> -	for (i = 0; i < MAX_NUMNODES; i++) {
>> -		spin_lock_init(&lru->node[i].lock);
>> -		INIT_LIST_HEAD(&lru->node[i].list);
>> -		lru->node[i].nr_items = 0;
>> +	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
>> +				GFP_KERNEL);
> 
> A nitpick...you can use kmalloc() here. All field will be overwritten.
It is, however, not future-proof if anyone wants to add more fields, and
forget to zero out the structure. If you really feel strongly for
kmalloc I can change. But I don't see this as a big issue, specially
this not being a fast path.
>> +#ifdef CONFIG_MEMCG_KMEM
>> +int __memcg_init_lru(struct list_lru *lru)
>> +{
>> +	int ret;
>> +
>> +	INIT_LIST_HEAD(&lru->lrus);
>> +	mutex_lock(&all_memcg_lrus_mutex);
>> +	list_add(&lru->lrus, &all_memcg_lrus);
>> +	ret = memcg_new_lru(lru);
>> +	mutex_unlock(&all_memcg_lrus_mutex);
>> +	return ret;
>> +}
> 
> returns 0 at success ? what kind of error can be shown here ?
> 
memcg_new_lru will allocate memory, and therefore can fail with ENOMEM.
It will already return 0 itself on success, so just forwarding its
return value is around.
>> -EXPORT_SYMBOL_GPL(list_lru_init);
>> +EXPORT_SYMBOL_GPL(__list_lru_init);
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ecdae39..c6c90d8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2988,16 +2988,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>>   	memcg_kmem_set_activated(memcg);
>>   
>>   	ret = memcg_update_all_caches(num+1);
>> -	if (ret) {
>> -		ida_simple_remove(&kmem_limited_groups, num);
>> -		memcg_kmem_clear_activated(memcg);
>> -		return ret;
>> -	}
>> +	if (ret)
>> +		goto out;
>> +
>> +	/*
>> +	 * We should make sure that the array size is not updated until we are
>> +	 * done; otherwise we have no easy way to know whether or not we should
>> +	 * grow the array.
>> +	 */
>> +	ret = memcg_update_all_lrus(num + 1);
>> +	if (ret)
>> +		goto out;
>>   
>>   	memcg->kmemcg_id = num;
>> +
>> +	memcg_update_array_size(num + 1);
>> +
>>   	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
>>   	mutex_init(&memcg->slab_caches_mutex);
>> +
>>   	return 0;
>> +out:
>> +	ida_simple_remove(&kmem_limited_groups, num);
>> +	memcg_kmem_clear_activated(memcg);
>> +	return ret;
> 
> When this failure can happens ? This happens only when the user
> tries to set kmem_limit and doesn't affect kernel internal logic ?
> 
There are 2 points of failure for this:
1) setting kmem limit from a previously unset scenario,
2) creating a new child memcg, as a child of a kmem limited memcg
Those are the same as the slab, and indeed they are attempted right
after it.
LRU initialization failures can still exist in a 3rd way, when a new LRU
is created and we have no memory available to hold its structures. But
this will not be called from here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 23/28] lru: add an element to a memcg list
  2013-04-01  8:18   ` Kamezawa Hiroyuki
@ 2013-04-01  8:29     ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-01  8:29 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
On 04/01/2013 12:18 PM, Kamezawa Hiroyuki wrote:
> (2013/03/29 18:14), Glauber Costa wrote:
>> With the infrastructure we now have, we can add an element to a memcg
>> LRU list instead of the global list. The memcg lists are still
>> per-node.
>>
>> Technically, we will never trigger per-node shrinking in the memcg is
>> short of memory. Therefore an alternative to this would be to add the
>> element to *both* a single-node memcg array and a per-node global array.
>>
> 
> per-node shrinking by memcg pressure is not imporant, I think.
> 
No, it is not. And this is precisely what I've stated: "we will never
trigger per-node shrinking in the memcg is short of memory."
This is to clarify that this design decision does not come from the need
to do that, which we don't have, but rather to save memory. Keeping
memcg objects per-node is less memory-expensive than adding an extra LRU
to the dentries and inodes. Therefore I do that, and when global
pressure kicks in I will scan all memcgs that belong to that node.
This will break global LRU order, but will help maintain fairness among
different memcgs.
>>   
>>   struct list_lru {
>>   	struct list_lru_node	node[MAX_NUMNODES];
>> +	atomic_long_t		node_totals[MAX_NUMNODES];
> 
> some comments will be helpful. 
> 
Yes, they will!
>> +
>> +static inline struct list_lru_node *
>> +lru_node_of_index(struct list_lru *lru, int index, int nid)
>> +{
>> +	BUG_ON(index < 0); /* index != -1 with !MEMCG_KMEM. Impossible */
>> +	return &lru->node[nid];
>> +}
>>   #endif
> 
> I'm sorry ...what "lru_node_of_index" means ? What is the "index" ?
There is extensive documentation for this above the macro
for_each_memcg_lru_index, so I didn't bother rewriting it here. But I
can add pointers like "see more at for_each..."
Basically, this will be either the memcg index if we want memcg reclaim,
or -1 for the global LRU. This is not 100 % the memcg index, so I called
it just "index".
IOW, it is the index in the memcg array if index >= 0, or the global
array if index < 0.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 26/28] memcg: per-memcg kmem shrinking
  2013-03-29  9:14 ` [PATCH v2 26/28] memcg: per-memcg kmem shrinking Glauber Costa
@ 2013-04-01  8:31   ` Kamezawa Hiroyuki
  2013-04-01  8:48     ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  8:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
(2013/03/29 18:14), Glauber Costa wrote:
> If the kernel limit is smaller than the user limit, we will have
> situations in which our allocations fail but freeing user pages will buy
> us nothing.  In those, we would like to call a specialized memcg
> reclaimer that only frees kernel memory and leave the user memory alone.
> Those are also expected to fail when we account memcg->kmem, instead of
> when we account memcg->res. Based on that, this patch implements a
> memcg-specific reclaimer, that only shrinks kernel objects, withouth
> touching user pages.
> 
> There might be situations in which there are plenty of objects to
> shrink, but we can't do it because the __GFP_FS flag is not set.
> Although they can happen with user pages, they are a lot more common
> with fs-metadata: this is the case with almost all inode allocation.
> 
> Those allocations are, however, capable of waiting.  So we can just span
> a worker, let it finish its job and proceed with the allocation. As slow
> as it is, at this point we are already past any hopes anyway.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/swap.h |   2 +
>   mm/memcontrol.c      | 182 ++++++++++++++++++++++++++++++++++++++++-----------
>   mm/vmscan.c          |  37 ++++++++++-
>   3 files changed, 183 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2818a12..80f6635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -268,6 +268,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
>   extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>   						  gfp_t gfp_mask, bool noswap);
> +extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
> +						 gfp_t gfp_mask);
>   extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>   						gfp_t gfp_mask, bool noswap,
>   						struct zone *zone,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 89b7ffb..a5a0f39 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -353,6 +353,8 @@ struct mem_cgroup {
>   	atomic_t	numainfo_events;
>   	atomic_t	numainfo_updating;
>   #endif
> +	/* when kmem shrinkers can sleep but can't proceed due to context */
> +	struct work_struct kmemcg_shrink_work;
>   	/*
>   	 * Per cgroup active and inactive list, similar to the
>   	 * per zone LRU lists.
> @@ -369,11 +371,14 @@ static size_t memcg_size(void)
>   		nr_node_ids * sizeof(struct mem_cgroup_per_node);
>   }
>   
> +static DEFINE_MUTEX(set_limit_mutex);
> +
>   /* internal only representation about the status of kmem accounting. */
>   enum {
>   	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
>   	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
>   	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
> +	KMEM_MAY_SHRINK, /* kmem limit < mem limit, shrink kmem only */
>   };
>   
>   /* We account when limit is on, but only after call sites are patched */
> @@ -412,6 +417,31 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
>   	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
>   				  &memcg->kmem_account_flags);
>   }
> +
> +/*
> + * If the kernel limit is smaller than the user limit, we will have situations
> + * in which our allocations fail but freeing user pages will buy us nothing.
> + * In those, we would like to call a specialized memcg reclaimer that only
> + * frees kernel memory and leave the user memory alone.
> + *
> + * This test exists so we can differentiate between those. Everytime one of the
> + * limits is updated, we need to run it. The set_limit_mutex must be held, so
> + * they don't change again.
> + */
> +static void memcg_update_shrink_status(struct mem_cgroup *memcg)
> +{
> +	mutex_lock(&set_limit_mutex);
> +	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
> +		res_counter_read_u64(&memcg->res, RES_LIMIT))
> +		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
> +	else
> +		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
> +	mutex_unlock(&set_limit_mutex);
> +}
> +#else
> +static void memcg_update_shrink_status(struct mem_cgroup *memcg)
> +{
> +}
>   #endif
>   
>   /* Stuffs for move charges at task migration. */
> @@ -2838,8 +2868,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>   	memcg_check_events(memcg, page);
>   }
>   
> -static DEFINE_MUTEX(set_limit_mutex);
> -
>   #ifdef CONFIG_MEMCG_KMEM
>   static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>   {
> @@ -2881,16 +2909,92 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
>   }
>   #endif
>   
> +/*
> + * During the creation a new cache, we need to disable our accounting mechanism
> + * altogether. This is true even if we are not creating, but rather just
> + * enqueing new caches to be created.
> + *
> + * This is because that process will trigger allocations; some visible, like
> + * explicit kmallocs to auxiliary data structures, name strings and internal
> + * cache structures; some well concealed, like INIT_WORK() that can allocate
> + * objects during debug.
> + *
> + * If any allocation happens during memcg_kmem_get_cache, we will recurse back
> + * to it. This may not be a bounded recursion: since the first cache creation
> + * failed to complete (waiting on the allocation), we'll just try to create the
> + * cache again, failing at the same point.
> + *
> + * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
> + * memcg_kmem_skip_account. So we enclose anything that might allocate memory
> + * inside the following two functions.
> + */
> +static inline void memcg_stop_kmem_account(void)
> +{
> +	VM_BUG_ON(!current->mm);
> +	current->memcg_kmem_skip_account++;
> +}
> +
> +static inline void memcg_resume_kmem_account(void)
> +{
> +	VM_BUG_ON(!current->mm);
> +	current->memcg_kmem_skip_account--;
> +}
> +
> +static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
> +{
> +	int retries = MEM_CGROUP_RECLAIM_RETRIES;
I'm not sure this retry numbers, for anon/file LRUs is suitable for kmem.
> +	struct res_counter *fail_res;
> +	int ret;
> +
> +	do {
> +		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> +		if (!ret)
> +			return ret;
> +
> +		if (!(gfp & __GFP_WAIT))
> +			return ret;
> +
> +		/*
> +		 * We will try to shrink kernel memory present in caches. We
> +		 * are sure that we can wait, so we will. The duration of our
> +		 * wait is determined by congestion, the same way as vmscan.c
> +		 *
> +		 * If we are in FS context, though, then although we can wait,
> +		 * we cannot call the shrinkers. Most fs shrinkers (which
> +		 * comprises most of our kmem data) will not run without
> +		 * __GFP_FS since they can deadlock. The solution is to
> +		 * synchronously run that in a different context.
> +		 */
> +		if (!(gfp & __GFP_FS)) {
> +			/*
> +			 * we are already short on memory, every queue
> +			 * allocation is likely to fail
> +			 */
> +			memcg_stop_kmem_account();
> +			schedule_work(&memcg->kmemcg_shrink_work);
> +			flush_work(&memcg->kmemcg_shrink_work);
> +			memcg_resume_kmem_account();
> +		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
Why congestion_wait() ? I think calling congestion_wait() in vmscan.c is
a part of memory-reclaim logic but I don't think the caller should do
this kind of voluteer wait without good reason..
> +
> +	} while (retries--);
> +
> +	return ret;
> +}
> +
>   static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>   {
>   	struct res_counter *fail_res;
>   	struct mem_cgroup *_memcg;
>   	int ret = 0;
>   	bool may_oom;
> +	bool kmem_first = test_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
>   
> -	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> -	if (ret)
> -		return ret;
> +	if (kmem_first) {
> +		ret = memcg_try_charge_kmem(memcg, gfp, size);
> +		if (ret)
> +			return ret;
> +	}
>   
>   	/*
>   	 * Conditions under which we can wait for the oom_killer. Those are
> @@ -2923,12 +3027,43 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>   			res_counter_charge_nofail(&memcg->memsw, size,
>   						  &fail_res);
>   		ret = 0;
> -	} else if (ret)
> +		if (!kmem_first)
> +			res_counter_charge_nofail(&memcg->kmem, size, &fail_res);
> +	} else if (ret && kmem_first)
>   		res_counter_uncharge(&memcg->kmem, size);
>   
> +	if (!kmem_first) {
> +		ret = memcg_try_charge_kmem(memcg, gfp, size);
> +		if (!ret)
> +			return ret;
> +
> +		res_counter_uncharge(&memcg->res, size);
> +		if (do_swap_account)
> +			res_counter_uncharge(&memcg->memsw, size);
> +	}
> +
>   	return ret;
>   }
>   
> +/*
> + * There might be situations in which there are plenty of objects to shrink,
> + * but we can't do it because the __GFP_FS flag is not set.  This is the case
> + * with almost all inode allocation. They do are, however, capable of waiting.
> + * So we can just span a worker, let it finish its job and proceed with the
> + * allocation. As slow as it is, at this point we are already past any hopes
> + * anyway.
> + */
> +static void kmemcg_shrink_work_fn(struct work_struct *w)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	memcg = container_of(w, struct mem_cgroup, kmemcg_shrink_work);
> +
> +	if (!try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL))
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
ditto..
> +}
> +
> +
>   static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
>   {
>   	res_counter_uncharge(&memcg->res, size);
> @@ -3005,6 +3140,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>   	memcg_update_array_size(num + 1);
>   
>   	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
> +	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
>   	mutex_init(&memcg->slab_caches_mutex);
>   
>   	return 0;
> @@ -3281,37 +3417,6 @@ out:
>   	kfree(s->memcg_params);
>   }
>   
> -/*
> - * During the creation a new cache, we need to disable our accounting mechanism
> - * altogether. This is true even if we are not creating, but rather just
> - * enqueing new caches to be created.
> - *
> - * This is because that process will trigger allocations; some visible, like
> - * explicit kmallocs to auxiliary data structures, name strings and internal
> - * cache structures; some well concealed, like INIT_WORK() that can allocate
> - * objects during debug.
> - *
> - * If any allocation happens during memcg_kmem_get_cache, we will recurse back
> - * to it. This may not be a bounded recursion: since the first cache creation
> - * failed to complete (waiting on the allocation), we'll just try to create the
> - * cache again, failing at the same point.
> - *
> - * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
> - * memcg_kmem_skip_account. So we enclose anything that might allocate memory
> - * inside the following two functions.
> - */
> -static inline void memcg_stop_kmem_account(void)
> -{
> -	VM_BUG_ON(!current->mm);
> -	current->memcg_kmem_skip_account++;
> -}
> -
> -static inline void memcg_resume_kmem_account(void)
> -{
> -	VM_BUG_ON(!current->mm);
> -	current->memcg_kmem_skip_account--;
> -}
> -
>   static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
>   {
>   	struct page_cgroup *pc;
> @@ -5292,6 +5397,9 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>   			ret = memcg_update_kmem_limit(cont, val);
>   		else
>   			return -EINVAL;
> +
> +		if (!ret)
> +			memcg_update_shrink_status(memcg);
>   		break;
>   	case RES_SOFT_LIMIT:
>   		ret = res_counter_memparse_write_strategy(buffer, &val);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 43928fd..dd235e6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2504,7 +2504,42 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   
>   	return nr_reclaimed;
>   }
> -#endif
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * This function is called when we are under kmem-specific pressure.  It will
> + * only trigger in environments with kmem.limit_in_bytes < limit_in_bytes, IOW,
> + * with a lower kmem allowance than the memory allowance.
> + *
> + * In this situation, freeing user pages from the cgroup won't do us any good.
> + * What we really need is to call the memcg-aware shrinkers, in the hope of
> + * freeing pages holding kmem objects. It may also be that we won't be able to
> + * free any pages, but will get rid of old objects opening up space for new
> + * ones.
> + */
> +unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
> +					  gfp_t gfp_mask)
> +{
> +	struct shrink_control shrink = {
> +		.gfp_mask = gfp_mask,
> +		.target_mem_cgroup = memcg,
> +	};
> +
> +	if (!(gfp_mask & __GFP_WAIT))
> +		return 0;
> +
> +	nodes_setall(shrink.nodes_to_scan);
> +
> +	/*
> +	 * We haven't scanned any user LRU, so we basically come up with
> +	 * crafted values of nr_scanned and LRU page (1 and 0 respectively).
> +	 * This should be enough to tell shrink_slab that the freeing
> +	 * responsibility is all on himself.
> +	 */
> +	return shrink_slab(&shrink, 1, 0);
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +#endif /* CONFIG_MEMCG */
>   
>   static void age_active_anon(struct zone *zone, struct scan_control *sc)
>   {
> 
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 26/28] memcg: per-memcg kmem shrinking
  2013-04-01  8:31   ` Kamezawa Hiroyuki
@ 2013-04-01  8:48     ` Glauber Costa
  2013-04-01  9:01       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-01  8:48 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
>> +static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>> +{
>> +	int retries = MEM_CGROUP_RECLAIM_RETRIES;
> 
> I'm not sure this retry numbers, for anon/file LRUs is suitable for kmem.
> 
Suggestions ?
>> +	struct res_counter *fail_res;
>> +	int ret;
>> +
>> +	do {
>> +		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
>> +		if (!ret)
>> +			return ret;
>> +
>> +		if (!(gfp & __GFP_WAIT))
>> +			return ret;
>> +
>> +		/*
>> +		 * We will try to shrink kernel memory present in caches. We
>> +		 * are sure that we can wait, so we will. The duration of our
>> +		 * wait is determined by congestion, the same way as vmscan.c
>> +		 *
>> +		 * If we are in FS context, though, then although we can wait,
>> +		 * we cannot call the shrinkers. Most fs shrinkers (which
>> +		 * comprises most of our kmem data) will not run without
>> +		 * __GFP_FS since they can deadlock. The solution is to
>> +		 * synchronously run that in a different context.
>> +		 */
>> +		if (!(gfp & __GFP_FS)) {
>> +			/*
>> +			 * we are already short on memory, every queue
>> +			 * allocation is likely to fail
>> +			 */
>> +			memcg_stop_kmem_account();
>> +			schedule_work(&memcg->kmemcg_shrink_work);
>> +			flush_work(&memcg->kmemcg_shrink_work);
>> +			memcg_resume_kmem_account();
>> +		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
>> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> Why congestion_wait() ? I think calling congestion_wait() in vmscan.c is
> a part of memory-reclaim logic but I don't think the caller should do
> this kind of voluteer wait without good reason..
> 
> 
Although it is not the case with dentries (or inodes, since only
non-dirty inodes goes to the lru list), some objects we are freeing may
need time to be written back to disk. This is the case for instance with
the buffer heads and bio's. They will not be actively shrunk in
shrinkers, but it is my understanding that they will be released. Inodes
as well, may have time to be written back and become non-dirty.
In practice, in my tests, this would almost-always fail after a retry if
we don't wait, and almost always succeed in a retry if we do wait.
Am I missing something in this interpretation ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-04-01  7:46   ` Kamezawa Hiroyuki
@ 2013-04-01  8:51     ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-01  8:51 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
Hi Kame,
>>   /*
>>    * In general, we'll do everything in our power to not incur in any overhead
>>    * for non-memcg users for the kmem functions. Not even a function call, if we
>> @@ -562,6 +573,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>>   	return __memcg_kmem_get_cache(cachep, gfp);
>>   }
>>   #else
>> +
>> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>> +{
>> +	return false;
>> +}
>> +
>>   #define for_each_memcg_cache_index(_idx)	\
>>   	for (; NULL; )
>>   
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index d4636a0..4e9e53b 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -20,6 +20,9 @@ struct shrink_control {
>>   
>>   	/* shrink from these nodes */
>>   	nodemask_t nodes_to_scan;
>> +
>> +	/* reclaim from this memcg only (if not NULL) */
>> +	struct mem_cgroup *target_mem_cgroup;
>>   };
> 
> Does this works only with kmem ? If so, please rename to some explicit
> name for now.
> 
>   shrink_slab_memcg_target or some ?
No, this is not kmem specific. It will be used (so far) to determine
which shrinkers to shrink from, but since we are now including
shrink_slab in user pressure as well, this can very well be filled by
user memory pressure code. (This will be the case, for instance, if umem
== kmem)
Therefore, it is the same target_mem_cgroup context we are already
passing around in other vmscan functions. But shrink_control had none,
and now we are attaching it there.
Therefore I would like to maintain it neutral, just as memcg.
> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2b55222..ecdae39 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -386,7 +386,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>>   	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>>   }
>>   
>> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>>   {
>>   	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>>   }
>> @@ -942,6 +942,20 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>>   	return ret;
>>   }
>>   
>> +unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
>> +	int nid = zone_to_nid(zone);
>> +	int zid = zone_idx(zone);
>> +	unsigned long val;
>> +
>> +	val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
>> +	if (do_swap_account)
>> +		val += mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
>> +						    LRU_ALL_ANON);
>> +	return val;
>> +}
>> +
>>   static unsigned long
>>   mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>>   			int nid, unsigned int lru_mask)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 232dfcb..43928fd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>>   {
>>   	return !sc->target_mem_cgroup;
>>   }
>> +
>> +/*
>> + * kmem reclaim should usually not be triggered when we are doing targetted
>> + * reclaim. It is only valid when global reclaim is triggered, or when the
>> + * underlying memcg has kmem objects.
>> + */
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return !sc->target_mem_cgroup ||
>> +		memcg_kmem_is_active(sc->target_mem_cgroup);
>> +}
> 
> Is this test hierarchy aware ?
> 
> For example, in following case,
> 
>   A      no kmem limit
>    \
>     B    kmem limit=XXX
>      \
>       C  kmem limit=XXX
> 
> what happens when A is the target.
> 
When A is under pressure, we won't scan A. I coded it like this because
the slabs are local, even if the charges are not.
In other words, because I won't scan the memcgs hierarchically, I didn't
bother noticing about their kmem awareness hierarchically.
But I am still thinking about that, and your input is very welcome.
In one hand, A won't have a kmem res_counter, so we won't be able to
uncharge anything from it. On the other hand, the charges are also
accumulated on the user res_counter of A. Under user pressure, it may be
important to free this memory. So I am inclined to change that.
Do you agree?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 26/28] memcg: per-memcg kmem shrinking
  2013-04-01  8:48     ` Glauber Costa
@ 2013-04-01  9:01       ` Kamezawa Hiroyuki
  2013-04-01  9:14         ` Glauber Costa
  2013-04-01  9:35         ` Kamezawa Hiroyuki
  0 siblings, 2 replies; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  9:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
(2013/04/01 17:48), Glauber Costa wrote:
>>> +static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>>> +{
>>> +	int retries = MEM_CGROUP_RECLAIM_RETRIES;
>>
>> I'm not sure this retry numbers, for anon/file LRUs is suitable for kmem.
>>
> Suggestions ?
> 
I think you did tests.
>>> +	struct res_counter *fail_res;
>>> +	int ret;
>>> +
>>> +	do {
>>> +		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
>>> +		if (!ret)
>>> +			return ret;
>>> +
>>> +		if (!(gfp & __GFP_WAIT))
>>> +			return ret;
>>> +
>>> +		/*
>>> +		 * We will try to shrink kernel memory present in caches. We
>>> +		 * are sure that we can wait, so we will. The duration of our
>>> +		 * wait is determined by congestion, the same way as vmscan.c
>>> +		 *
>>> +		 * If we are in FS context, though, then although we can wait,
>>> +		 * we cannot call the shrinkers. Most fs shrinkers (which
>>> +		 * comprises most of our kmem data) will not run without
>>> +		 * __GFP_FS since they can deadlock. The solution is to
>>> +		 * synchronously run that in a different context.
>>> +		 */
>>> +		if (!(gfp & __GFP_FS)) {
>>> +			/*
>>> +			 * we are already short on memory, every queue
>>> +			 * allocation is likely to fail
>>> +			 */
>>> +			memcg_stop_kmem_account();
>>> +			schedule_work(&memcg->kmemcg_shrink_work);
>>> +			flush_work(&memcg->kmemcg_shrink_work);
>>> +			memcg_resume_kmem_account();
>>> +		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
>>> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
>>
>> Why congestion_wait() ? I think calling congestion_wait() in vmscan.c is
>> a part of memory-reclaim logic but I don't think the caller should do
>> this kind of voluteer wait without good reason..
>>
>>
> 
> Although it is not the case with dentries (or inodes, since only
> non-dirty inodes goes to the lru list), some objects we are freeing may
> need time to be written back to disk. This is the case for instance with
> the buffer heads and bio's. They will not be actively shrunk in
> shrinkers, but it is my understanding that they will be released. Inodes
> as well, may have time to be written back and become non-dirty.
> 
> In practice, in my tests, this would almost-always fail after a retry if
> we don't wait, and almost always succeed in a retry if we do wait.
> 
> Am I missing something in this interpretation ?
> 
Ah, sorry. Can't we put this wait into try_to_free_mem_cgroup_kmem().
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 26/28] memcg: per-memcg kmem shrinking
  2013-04-01  9:01       ` Kamezawa Hiroyuki
@ 2013-04-01  9:14         ` Glauber Costa
  2013-04-01  9:35         ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-01  9:14 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
On 04/01/2013 01:01 PM, Kamezawa Hiroyuki wrote:
> (2013/04/01 17:48), Glauber Costa wrote:
>>>> +static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>>>> +{
>>>> +	int retries = MEM_CGROUP_RECLAIM_RETRIES;
>>>
>>> I'm not sure this retry numbers, for anon/file LRUs is suitable for kmem.
>>>
>> Suggestions ?
>>
> 
> I think you did tests.
> 
Indeed. And in my tests, 2 or 3 retries are already enough to seal the
fate of this.
I though it was safer to go with the same number, though, exactly not to
be too biased by my specific test environments.
I am fine with >= 3.
Michal, you have input here?
>>>> +	struct res_counter *fail_res;
>>>> +	int ret;
>>>> +
>>>> +	do {
>>>> +		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
>>>> +		if (!ret)
>>>> +			return ret;
>>>> +
>>>> +		if (!(gfp & __GFP_WAIT))
>>>> +			return ret;
>>>> +
>>>> +		/*
>>>> +		 * We will try to shrink kernel memory present in caches. We
>>>> +		 * are sure that we can wait, so we will. The duration of our
>>>> +		 * wait is determined by congestion, the same way as vmscan.c
>>>> +		 *
>>>> +		 * If we are in FS context, though, then although we can wait,
>>>> +		 * we cannot call the shrinkers. Most fs shrinkers (which
>>>> +		 * comprises most of our kmem data) will not run without
>>>> +		 * __GFP_FS since they can deadlock. The solution is to
>>>> +		 * synchronously run that in a different context.
>>>> +		 */
>>>> +		if (!(gfp & __GFP_FS)) {
>>>> +			/*
>>>> +			 * we are already short on memory, every queue
>>>> +			 * allocation is likely to fail
>>>> +			 */
>>>> +			memcg_stop_kmem_account();
>>>> +			schedule_work(&memcg->kmemcg_shrink_work);
>>>> +			flush_work(&memcg->kmemcg_shrink_work);
>>>> +			memcg_resume_kmem_account();
>>>> +		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
>>>> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
>>>
>>> Why congestion_wait() ? I think calling congestion_wait() in vmscan.c is
>>> a part of memory-reclaim logic but I don't think the caller should do
>>> this kind of voluteer wait without good reason..
>>>
>>>
>>
>> Although it is not the case with dentries (or inodes, since only
>> non-dirty inodes goes to the lru list), some objects we are freeing may
>> need time to be written back to disk. This is the case for instance with
>> the buffer heads and bio's. They will not be actively shrunk in
>> shrinkers, but it is my understanding that they will be released. Inodes
>> as well, may have time to be written back and become non-dirty.
>>
>> In practice, in my tests, this would almost-always fail after a retry if
>> we don't wait, and almost always succeed in a retry if we do wait.
>>
>> Am I missing something in this interpretation ?
>>
> 
> Ah, sorry. Can't we put this wait into try_to_free_mem_cgroup_kmem().
> 
That I believe we can easily do.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 26/28] memcg: per-memcg kmem shrinking
  2013-04-01  9:01       ` Kamezawa Hiroyuki
  2013-04-01  9:14         ` Glauber Costa
@ 2013-04-01  9:35         ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 97+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-01  9:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, Andrew Morton, Dave Shrinnker, Greg Thelen,
	hughd, yinghan, Dave Chinner, Mel Gorman, Rik van Riel
(2013/04/01 18:01), Kamezawa Hiroyuki wrote:
> (2013/04/01 17:48), Glauber Costa wrote:
>>>> +static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>>>> +{
>>>> +	int retries = MEM_CGROUP_RECLAIM_RETRIES;
>>>
>>> I'm not sure this retry numbers, for anon/file LRUs is suitable for kmem.
>>>
>> Suggestions ?
>>
> 
> I think you did tests.
sorry..
I think you did tests and know what number is good by tests.
If it's the same number to MEM_CGROUP_RECLAIM_RETRIES, I have no objections.
I think no reason is bad.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 00/28] memcg-aware slab shrinking
  2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
                   ` (27 preceding siblings ...)
  2013-03-29  9:14 ` [PATCH v2 28/28] super: targeted memcg reclaim Glauber Costa
@ 2013-04-01 12:38 ` Serge Hallyn
  2013-04-01 12:45   ` Glauber Costa
  2013-04-02  4:58   ` Dave Chinner
  28 siblings, 2 replies; 97+ messages in thread
From: Serge Hallyn @ 2013-04-01 12:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, hughd, containers, Dave Shrinnker, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
Quoting Glauber Costa (glommer@parallels.com):
> Hi,
> 
> Notes:
> ======
> 
> This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
> and it behaves well, at least from the isolation point of view. However,
> I feel some more testing is needed before we commit to it. Still, this is
> doing the job fairly well. Comments welcome.
Do you have any performance tests (preferably with enough runs with and
without this patchset to show 95% confidence interval) to show the
impact this has?  Certainly the feature sounds worthwhile, but I'm
curious about the cost of maintaining this extra state.
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 00/28] memcg-aware slab shrinking
  2013-04-01 12:38 ` [PATCH v2 00/28] memcg-aware slab shrinking Serge Hallyn
@ 2013-04-01 12:45   ` Glauber Costa
  2013-04-01 14:12     ` Serge Hallyn
  2013-04-02  4:58   ` Dave Chinner
  1 sibling, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-01 12:45 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-mm, hughd, containers, Dave Shrinnker, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On 04/01/2013 04:38 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> Hi,
>>
>> Notes:
>> ======
>>
>> This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
>> and it behaves well, at least from the isolation point of view. However,
>> I feel some more testing is needed before we commit to it. Still, this is
>> doing the job fairly well. Comments welcome.
> 
> Do you have any performance tests (preferably with enough runs with and
> without this patchset to show 95% confidence interval) to show the
> impact this has?  Certainly the feature sounds worthwhile, but I'm
> curious about the cost of maintaining this extra state.
> 
> -serge
> 
Not yet. I intend to include them in my next run. I haven't yet decided
on a set of tests to run (maybe just a memcg-contained kernel compile?)
So if you have suggestions of what I could run to show this, feel free
to lay them down here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 00/28] memcg-aware slab shrinking
  2013-04-01 12:45   ` Glauber Costa
@ 2013-04-01 14:12     ` Serge Hallyn
  2013-04-08  8:11       ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Serge Hallyn @ 2013-04-01 14:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, hughd, containers, Dave Shrinnker, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
Quoting Glauber Costa (glommer@parallels.com):
> On 04/01/2013 04:38 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer@parallels.com):
> >> Hi,
> >>
> >> Notes:
> >> ======
> >>
> >> This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
> >> and it behaves well, at least from the isolation point of view. However,
> >> I feel some more testing is needed before we commit to it. Still, this is
> >> doing the job fairly well. Comments welcome.
> > 
> > Do you have any performance tests (preferably with enough runs with and
> > without this patchset to show 95% confidence interval) to show the
> > impact this has?  Certainly the feature sounds worthwhile, but I'm
> > curious about the cost of maintaining this extra state.
> > 
> > -serge
> > 
> Not yet. I intend to include them in my next run. I haven't yet decided
> on a set of tests to run (maybe just a memcg-contained kernel compile?)
> 
> So if you have suggestions of what I could run to show this, feel free
> to lay them down here.
Perhaps mount a 4G tmpfs, copy kernel tree there, and build kernel on
that tmpfs?
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 00/28] memcg-aware slab shrinking
  2013-04-01 12:38 ` [PATCH v2 00/28] memcg-aware slab shrinking Serge Hallyn
  2013-04-01 12:45   ` Glauber Costa
@ 2013-04-02  4:58   ` Dave Chinner
  2013-04-02  7:55     ` Glauber Costa
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-02  4:58 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Glauber Costa, linux-mm, hughd, containers, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On Mon, Apr 01, 2013 at 07:38:43AM -0500, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
> > Hi,
> > 
> > Notes:
> > ======
> > 
> > This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
> > and it behaves well, at least from the isolation point of view. However,
> > I feel some more testing is needed before we commit to it. Still, this is
> > doing the job fairly well. Comments welcome.
> 
> Do you have any performance tests (preferably with enough runs with and
> without this patchset to show 95% confidence interval) to show the
> impact this has?  Certainly the feature sounds worthwhile, but I'm
> curious about the cost of maintaining this extra state.
The reason for the node-aware LRUs in the first place is
performance. i.e. to remove the global LRU locks from the shrinkers
and LRU list operations. For XFS (at least) the VFS LRU operations
are significant sources of contention at 16p, and at high CPU counts
they can basically cause spinlock meltdown.
I've done performance testing on them on 16p machines with
fake-numa=4 under such contention generating workloads (e.g. 16-way
concurrent fsmark workloads) and seen that the LRU locks disappear
from the profiles. Performance improvement at this size of machine
under these workloads is still within the run-to-run variance of the
benchmarks I've run, but the fact the lock is no longer in the
profiles at all suggest that scalability for larger machines will be
significantly improved.
As for the memcg side of things, I'll leave that to Glauber....
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 00/28] memcg-aware slab shrinking
  2013-04-02  4:58   ` Dave Chinner
@ 2013-04-02  7:55     ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-02  7:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Serge Hallyn, linux-mm, hughd, containers, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On 04/02/2013 08:58 AM, Dave Chinner wrote:
> On Mon, Apr 01, 2013 at 07:38:43AM -0500, Serge Hallyn wrote:
>> Quoting Glauber Costa (glommer@parallels.com):
>>> Hi,
>>>
>>> Notes:
>>> ======
>>>
>>> This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
>>> and it behaves well, at least from the isolation point of view. However,
>>> I feel some more testing is needed before we commit to it. Still, this is
>>> doing the job fairly well. Comments welcome.
>>
>> Do you have any performance tests (preferably with enough runs with and
>> without this patchset to show 95% confidence interval) to show the
>> impact this has?  Certainly the feature sounds worthwhile, but I'm
>> curious about the cost of maintaining this extra state.
> 
> The reason for the node-aware LRUs in the first place is
> performance. i.e. to remove the global LRU locks from the shrinkers
> and LRU list operations. For XFS (at least) the VFS LRU operations
> are significant sources of contention at 16p, and at high CPU counts
> they can basically cause spinlock meltdown.
> 
> I've done performance testing on them on 16p machines with
> fake-numa=4 under such contention generating workloads (e.g. 16-way
> concurrent fsmark workloads) and seen that the LRU locks disappear
> from the profiles. Performance improvement at this size of machine
> under these workloads is still within the run-to-run variance of the
> benchmarks I've run, but the fact the lock is no longer in the
> profiles at all suggest that scalability for larger machines will be
> significantly improved.
> 
> As for the memcg side of things, I'll leave that to Glauber....
> 
I will chime in here about the per-node thing on the memcg side, because
I believe this is the most important design point, and one I'd like to
reach quick agreement on.
>From the memcg PoV, all pressure is global, and the LRU does not buy us
nothing. This is because the nature of memcg: all we care about, is
being over or below a certain soft or hard limit. Where the memory is
located is the concern of other subsystems (like cpuset) and has no
place in memcg interests.
As far as the underlying LRU, I fully trust Dave in what he says: We may
not be able to detect it in our mortal setups, but less lock contention
is very likely to be a clear win in big contended, multi-node scenarios.
By design, you may want as well to shrink per-node, so not disposing
objects in other nodes is also a qualitative win.
In memcg, there are two options: We either stuck a new list_lru in the
objects (dentry and inodes), or we keep it per-node as well. The first
one would allow us to keep global LRU order for global pressure walks,
and memcg LRU order for memcg walks.
Two lists seems simpler at first, but it also have an interesting
effect: on global reclaim, we will break fairness among memcgs. Old
memcgs are likely to be penalized regardless of any other
considerations. Being memcg concerned heavily about isolation between
workloads, it would be preferable to spread global reclaim among them.
The other reason I am keeping per-node memcg, is memory footprint. I
have 24 bytes per-LRU, which are constant and likely to exist in any
scenario (the list head for child lrus (16b) and a memcg pointer(8b))
Then we have a 32-byte structure that represents the LRU itself (If I
got the math correctly for the spinlock usual, non-debug size)
That will be per-node, and each LRU will have one copy of that
per-memcg. So the extra size is 32 * nodes * lrus * memcgs.
Keeping extra state in the object, means an extra list_head per-object.
This is 16 * # objects.
It starts to be a win in favor of the memcg-per-lru when we reach a
number of objects o = 2 * nodes * lrus * memcgs
Using some numbers, nodes = 4, lrus = 100, memcgs = 100, (and this can
already be considered damn big), we have 4000 objects as the threshold
point. My fedora laptop doing nothing other than April fool's jokes and
answering your e-mails, have 26k dentries stored in the slab.
This means aside from the more isolated behavior, our memory footprint
is way, way smaller by keeping the memcgs per-lru.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list
  2013-03-29  9:13 ` [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
@ 2013-04-03  6:51   ` Sha Zhengju
  2013-04-03  8:55     ` Glauber Costa
  2013-04-04  6:19     ` Dave Chinner
  0 siblings, 2 replies; 97+ messages in thread
From: Sha Zhengju @ 2013-04-03  6:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm@kvack.org, Hugh Dickins, containers, Dave Chinner,
	Dave Shrinnker, Michal Hocko, Johannes Weiner, linux-fsdevel,
	Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 8201 bytes --]
On Fri, Mar 29, 2013 at 5:13 PM, Glauber Costa <glommer@parallels.com>wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> One of the big problems with modifying the way the dcache shrinker
> and LRU implementation works is that the LRU is abused in several
> ways. One of these is shrink_dentry_list().
>
> Basically, we can move a dentry off the LRU onto a different list
> without doing any accounting changes, and then use dentry_lru_prune()
> to remove it from what-ever list it is now on to do the LRU
> accounting at that point.
>
> This makes it -really hard- to change the LRU implementation. The
> use of the per-sb LRU lock serialises movement of the dentries
> between the different lists and the removal of them, and this is the
> only reason that it works. If we want to break up the dentry LRU
> lock and lists into, say, per-node lists, we remove the only
> serialisation that allows this lru list/dispose list abuse to work.
>
> To make this work effectively, the dispose list has to be isolated
> from the LRU list - dentries have to be removed from the LRU
> *before* being placed on the dispose list. This means that the LRU
> accounting and isolation is completed before disposal is started,
> and that means we can change the LRU implementation freely in
> future.
>
> This means that dentries *must* be marked with DCACHE_SHRINK_LIST
> when they are placed on the dispose list so that we don't think that
> parent dentries found in try_prune_one_dentry() are on the LRU when
> the are actually on the dispose list. This would result in
> accounting the dentry to the LRU a second time. Hence
> dentry_lru_prune() has to handle the DCACHE_SHRINK_LIST case
> differently because the dentry isn't on the LRU list.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dcache.c | 73
> ++++++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 63 insertions(+), 10 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 0a1d7b3..d15420b 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -330,7 +330,6 @@ static void dentry_lru_add(struct dentry *dentry)
>  static void __dentry_lru_del(struct dentry *dentry)
>  {
>         list_del_init(&dentry->d_lru);
> -       dentry->d_flags &= ~DCACHE_SHRINK_LIST;
>         dentry->d_sb->s_nr_dentry_unused--;
>         this_cpu_dec(nr_dentry_unused);
>  }
> @@ -340,6 +339,8 @@ static void __dentry_lru_del(struct dentry *dentry)
>   */
>  static void dentry_lru_del(struct dentry *dentry)
>  {
> +       BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
> +
>         if (!list_empty(&dentry->d_lru)) {
>                 spin_lock(&dentry->d_sb->s_dentry_lru_lock);
>                 __dentry_lru_del(dentry);
> @@ -351,28 +352,42 @@ static void dentry_lru_del(struct dentry *dentry)
>   * Remove a dentry that is unreferenced and about to be pruned
>   * (unhashed and destroyed) from the LRU, and inform the file system.
>   * This wrapper should be called _prior_ to unhashing a victim dentry.
> + *
> + * Check that the dentry really is on the LRU as it may be on a private
> dispose
> + * list and in that case we do not want to call the generic LRU removal
> + * functions. This typically happens when shrink_dcache_sb() clears the
> LRU in
> + * one go and then try_prune_one_dentry() walks back up the parent chain
> finding
> + * dentries that are also on the dispose list.
>   */
>  static void dentry_lru_prune(struct dentry *dentry)
>  {
>         if (!list_empty(&dentry->d_lru)) {
> +
>                 if (dentry->d_flags & DCACHE_OP_PRUNE)
>                         dentry->d_op->d_prune(dentry);
>
> -               spin_lock(&dentry->d_sb->s_dentry_lru_lock);
> -               __dentry_lru_del(dentry);
> -               spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
> +               if ((dentry->d_flags & DCACHE_SHRINK_LIST))
> +                       list_del_init(&dentry->d_lru);
> +               else {
> +                       spin_lock(&dentry->d_sb->s_dentry_lru_lock);
> +                       __dentry_lru_del(dentry);
> +                       spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
> +               }
> +               dentry->d_flags &= ~DCACHE_SHRINK_LIST;
>         }
>  }
>
>  static void dentry_lru_move_list(struct dentry *dentry, struct list_head
> *list)
>  {
> +       BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
> +
>         spin_lock(&dentry->d_sb->s_dentry_lru_lock);
>         if (list_empty(&dentry->d_lru)) {
>                 list_add_tail(&dentry->d_lru, list);
> -               dentry->d_sb->s_nr_dentry_unused++;
> -               this_cpu_inc(nr_dentry_unused);
>         } else {
>                 list_move_tail(&dentry->d_lru, list);
> +               dentry->d_sb->s_nr_dentry_unused--;
> +               this_cpu_dec(nr_dentry_unused);
>         }
>         spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
>  }
> @@ -814,12 +829,18 @@ static void shrink_dentry_list(struct list_head
> *list)
>                 }
>
>                 /*
> +                * The dispose list is isolated and dentries are not
> accounted
> +                * to the LRU here, so we can simply remove it from the
> list
> +                * here regardless of whether it is referenced or not.
> +                */
> +               list_del_init(&dentry->d_lru);
> +
> +               /*
>                  * We found an inuse dentry which was not removed from
> -                * the LRU because of laziness during lookup.  Do not free
> -                * it - just keep it off the LRU list.
> +                * the LRU because of laziness during lookup. Do not free
> it.
>                  */
>                 if (dentry->d_count) {
> -                       dentry_lru_del(dentry);
> +                       dentry->d_flags &= ~DCACHE_SHRINK_LIST;
>                         spin_unlock(&dentry->d_lock);
>                         continue;
>                 }
> @@ -871,6 +892,8 @@ relock:
>                 } else {
>                         list_move_tail(&dentry->d_lru, &tmp);
>                         dentry->d_flags |= DCACHE_SHRINK_LIST;
> +                       this_cpu_dec(nr_dentry_unused);
> +                       sb->s_nr_dentry_unused--;
>                         spin_unlock(&dentry->d_lock);
>                         if (!--count)
>                                 break;
> @@ -884,6 +907,28 @@ relock:
>         shrink_dentry_list(&tmp);
>  }
>
> +/*
> + * Mark all the dentries as on being the dispose list so we don't think
> they are
> + * still on the LRU if we try to kill them from ascending the parent
> chain in
> + * try_prune_one_dentry() rather than directly from the dispose list.
> + */
> +static void
> +shrink_dcache_list(
> +       struct list_head *dispose)
> +{
> +       struct dentry *dentry;
> +
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(dentry, dispose, d_lru) {
> +               spin_lock(&dentry->d_lock);
> +               dentry->d_flags |= DCACHE_SHRINK_LIST;
> +               this_cpu_dec(nr_dentry_unused);
>
Why here dec nr_dentry_unused again? Has it been decreased in the following
shrink_dcache_sb()?
> +               spin_unlock(&dentry->d_lock);
> +       }
> +       rcu_read_unlock();
> +       shrink_dentry_list(dispose);
> +}
> +
>  /**
>   * shrink_dcache_sb - shrink dcache for a superblock
>   * @sb: superblock
> @@ -898,8 +943,16 @@ void shrink_dcache_sb(struct super_block *sb)
>         spin_lock(&sb->s_dentry_lru_lock);
>         while (!list_empty(&sb->s_dentry_lru)) {
>                 list_splice_init(&sb->s_dentry_lru, &tmp);
> +
> +               /*
> +                * account for removal here so we don't need to handle it
> later
> +                * even though the dentry is no longer on the lru list.
> +                */
> +               this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
> +               sb->s_nr_dentry_unused = 0;
> +
>                 spin_unlock(&sb->s_dentry_lru_lock);
> -               shrink_dentry_list(&tmp);
> +               shrink_dcache_list(&tmp);
>                 spin_lock(&sb->s_dentry_lru_lock);
>         }
>         spin_unlock(&sb->s_dentry_lru_lock);
>
>
-- 
Thanks,
Sha
[-- Attachment #2: Type: text/html, Size: 9680 bytes --]
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list
  2013-04-03  6:51   ` Sha Zhengju
@ 2013-04-03  8:55     ` Glauber Costa
  2013-04-04  6:19     ` Dave Chinner
  1 sibling, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-03  8:55 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm@kvack.org, Hugh Dickins, containers, Dave Chinner,
	Dave Shrinnker, Michal Hocko, Johannes Weiner, linux-fsdevel,
	Andrew Morton
On 04/03/2013 10:51 AM, Sha Zhengju wrote:
>     +static void
>     +shrink_dcache_list(
>     +       struct list_head *dispose)
>     +{
>     +       struct dentry *dentry;
>     +
>     +       rcu_read_lock();
>     +       list_for_each_entry_rcu(dentry, dispose, d_lru) {
>     +               spin_lock(&dentry->d_lock);
>     +               dentry->d_flags |= DCACHE_SHRINK_LIST;
>     +               this_cpu_dec(nr_dentry_unused);
> 
> 
> Why here dec nr_dentry_unused again? Has it been decreased in the
> following shrink_dcache_sb()?
You analysis seems to be correct, and the decrement in shrink_dcache_sb
seems not to be needed.
Dave, have comments on this ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-03-29  9:14 ` [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure Glauber Costa
  2013-04-01  7:46   ` Kamezawa Hiroyuki
@ 2013-04-03 10:11   ` Sha Zhengju
  2013-04-03 10:43     ` Glauber Costa
  1 sibling, 1 reply; 97+ messages in thread
From: Sha Zhengju @ 2013-04-03 10:11 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm@kvack.org, Rik van Riel, Hugh Dickins, containers,
	Dave Chinner, Dave Shrinnker, Michal Hocko, Mel Gorman,
	Johannes Weiner, linux-fsdevel, Andrew Morton
Hi Glauber,
On Fri, Mar 29, 2013 at 5:14 PM, Glauber Costa <glommer@parallels.com> wrote:
> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
>
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/memcontrol.h | 17 +++++++++++++++++
>  include/linux/shrinker.h   |  4 ++++
>  mm/memcontrol.c            | 16 +++++++++++++++-
>  mm/vmscan.c                | 46 +++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 79 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..4c24249 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -199,6 +199,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  bool mem_cgroup_bad_page_check(struct page *page);
>  void mem_cgroup_print_bad_page(struct page *page);
>  #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>
> @@ -377,6 +380,12 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>                                 struct page *newpage)
>  {
>  }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +       return 0;
> +}
>  #endif /* CONFIG_MEMCG */
>
>  #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -429,6 +438,8 @@ static inline bool memcg_kmem_enabled(void)
>         return static_key_false(&memcg_kmem_enabled_key);
>  }
>
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>  /*
>   * In general, we'll do everything in our power to not incur in any overhead
>   * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -562,6 +573,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>         return __memcg_kmem_get_cache(cachep, gfp);
>  }
>  #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +       return false;
> +}
> +
>  #define for_each_memcg_cache_index(_idx)       \
>         for (; NULL; )
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..4e9e53b 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>
>         /* shrink from these nodes */
>         nodemask_t nodes_to_scan;
> +
> +       /* reclaim from this memcg only (if not NULL) */
> +       struct mem_cgroup *target_mem_cgroup;
>  };
>
>  /*
> @@ -45,6 +48,7 @@ struct shrinker {
>
>         int seeks;      /* seeks to recreate an obj */
>         long batch;     /* reclaim batch size, 0 = default */
> +       bool memcg_shrinker; /* memcg-aware shrinker */
>
>         /* These are for internal use */
>         struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2b55222..ecdae39 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -386,7 +386,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>         set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
>
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>  {
>         return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
> @@ -942,6 +942,20 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>         return ret;
>  }
>
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +       int nid = zone_to_nid(zone);
> +       int zid = zone_idx(zone);
> +       unsigned long val;
> +
> +       val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
> +       if (do_swap_account)
IMHO May get_nr_swap_pages() be more appropriate here?
> +               val += mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
> +                                                   LRU_ALL_ANON);
> +       return val;
> +}
> +
>  static unsigned long
>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                         int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 232dfcb..43928fd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>  {
>         return !sc->target_mem_cgroup;
>  }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +       return !sc->target_mem_cgroup ||
> +               memcg_kmem_is_active(sc->target_mem_cgroup);
> +}
> +
> +static unsigned long
> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
> +{
> +       if (global_reclaim(sc))
> +               return zone_reclaimable_pages(zone);
> +       return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
> +}
> +
>  #else
>  static bool global_reclaim(struct scan_control *sc)
>  {
>         return true;
>  }
> +
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +       return true;
> +}
> +
> +static unsigned long
> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
> +{
> +       return zone_reclaimable_pages(zone);
> +}
>  #endif
>
>  static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -221,6 +252,14 @@ unsigned long shrink_slab(struct shrink_control *sc,
>                 long batch_size = shrinker->batch ? shrinker->batch
>                                                   : SHRINK_BATCH;
>
> +               /*
> +                * If we don't have a target mem cgroup, we scan them all.
> +                * Otherwise we will limit our scan to shrinkers marked as
> +                * memcg aware
> +                */
> +               if (sc->target_mem_cgroup && !shrinker->memcg_shrinker)
> +                       continue;
> +
>                 max_pass = shrinker->count_objects(shrinker, sc);
>                 WARN_ON(max_pass < 0);
>                 if (max_pass <= 0)
> @@ -2163,9 +2202,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>
>                 /*
>                  * Don't shrink slabs when reclaiming memory from
> -                * over limit cgroups
> +                * over limit cgroups, unless we know they have kmem objects
>                  */
> -               if (global_reclaim(sc)) {
> +               if (has_kmem_reclaim(sc)) {
>                         unsigned long lru_pages = 0;
>
>                         nodes_clear(shrink->nodes_to_scan);
> @@ -2174,7 +2213,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>                                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                                         continue;
>
> -                               lru_pages += zone_reclaimable_pages(zone);
> +                               lru_pages += zone_nr_reclaimable_pages(sc, zone);
>                                 node_set(zone_to_nid(zone),
>                                          shrink->nodes_to_scan);
>                         }
> @@ -2443,6 +2482,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>         };
>         struct shrink_control shrink = {
>                 .gfp_mask = sc.gfp_mask,
> +               .target_mem_cgroup = memcg,
>         };
>
>         /*
> --
> 1.8.1.4
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
Thanks,
Sha
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-04-03 10:11   ` Sha Zhengju
@ 2013-04-03 10:43     ` Glauber Costa
  2013-04-04  9:35       ` Sha Zhengju
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-03 10:43 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm@kvack.org, Rik van Riel, Hugh Dickins, containers,
	Dave Chinner, Dave Shrinnker, Michal Hocko, Mel Gorman,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On 04/03/2013 02:11 PM, Sha Zhengju wrote:
>> > +unsigned long
>> > +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> > +{
>> > +       int nid = zone_to_nid(zone);
>> > +       int zid = zone_idx(zone);
>> > +       unsigned long val;
>> > +
>> > +       val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
>> > +       if (do_swap_account)
> IMHO May get_nr_swap_pages() be more appropriate here?
> 
This is a per-memcg number, how would get_nr_swap_pages() help us here?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list
  2013-04-03  6:51   ` Sha Zhengju
  2013-04-03  8:55     ` Glauber Costa
@ 2013-04-04  6:19     ` Dave Chinner
  2013-04-04  6:56       ` Glauber Costa
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-04  6:19 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Glauber Costa, linux-mm@kvack.org, Hugh Dickins, containers,
	Dave Chinner, Michal Hocko, Johannes Weiner, linux-fsdevel,
	Andrew Morton
On Wed, Apr 03, 2013 at 02:51:43PM +0800, Sha Zhengju wrote:
> On Fri, Mar 29, 2013 at 5:13 PM, Glauber Costa <glommer@parallels.com>wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > @@ -884,6 +907,28 @@ relock:
> >         shrink_dentry_list(&tmp);
> >  }
> >
> > +/*
> > + * Mark all the dentries as on being the dispose list so we don't think
> > they are
> > + * still on the LRU if we try to kill them from ascending the parent
> > chain in
> > + * try_prune_one_dentry() rather than directly from the dispose list.
> > + */
> > +static void
> > +shrink_dcache_list(
> > +       struct list_head *dispose)
> > +{
> > +       struct dentry *dentry;
> > +
> > +       rcu_read_lock();
> > +       list_for_each_entry_rcu(dentry, dispose, d_lru) {
> > +               spin_lock(&dentry->d_lock);
> > +               dentry->d_flags |= DCACHE_SHRINK_LIST;
> > +               this_cpu_dec(nr_dentry_unused);
> >
> 
> Why here dec nr_dentry_unused again? Has it been decreased in the following
> shrink_dcache_sb()?
You are right, that's a bugi as we've already accounted for the
dentry being pulled off the LRU list. Good catch.
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list
  2013-04-04  6:19     ` Dave Chinner
@ 2013-04-04  6:56       ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-04  6:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Sha Zhengju, linux-mm@kvack.org, Hugh Dickins, containers,
	Dave Chinner, Michal Hocko, Johannes Weiner, linux-fsdevel,
	Andrew Morton
On 04/04/2013 10:19 AM, Dave Chinner wrote:
> On Wed, Apr 03, 2013 at 02:51:43PM +0800, Sha Zhengju wrote:
>> On Fri, Mar 29, 2013 at 5:13 PM, Glauber Costa <glommer@parallels.com>wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>> @@ -884,6 +907,28 @@ relock:
>>>         shrink_dentry_list(&tmp);
>>>  }
>>>
>>> +/*
>>> + * Mark all the dentries as on being the dispose list so we don't think
>>> they are
>>> + * still on the LRU if we try to kill them from ascending the parent
>>> chain in
>>> + * try_prune_one_dentry() rather than directly from the dispose list.
>>> + */
>>> +static void
>>> +shrink_dcache_list(
>>> +       struct list_head *dispose)
>>> +{
>>> +       struct dentry *dentry;
>>> +
>>> +       rcu_read_lock();
>>> +       list_for_each_entry_rcu(dentry, dispose, d_lru) {
>>> +               spin_lock(&dentry->d_lock);
>>> +               dentry->d_flags |= DCACHE_SHRINK_LIST;
>>> +               this_cpu_dec(nr_dentry_unused);
>>>
>>
>> Why here dec nr_dentry_unused again? Has it been decreased in the following
>> shrink_dcache_sb()?
> 
> You are right, that's a bugi as we've already accounted for the
> dentry being pulled off the LRU list. Good catch.
> 
Ok, I folded it in the original patch with the due credits for better
bisection.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-04-03 10:43     ` Glauber Costa
@ 2013-04-04  9:35       ` Sha Zhengju
  2013-04-05  8:25         ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Sha Zhengju @ 2013-04-04  9:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm@kvack.org, Rik van Riel, Hugh Dickins, containers,
	Dave Chinner, Dave Shrinnker, Michal Hocko, Mel Gorman,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On Wed, Apr 3, 2013 at 6:43 PM, Glauber Costa <glommer@parallels.com> wrote:
> On 04/03/2013 02:11 PM, Sha Zhengju wrote:
>>> > +unsigned long
>>> > +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>>> > +{
>>> > +       int nid = zone_to_nid(zone);
>>> > +       int zid = zone_idx(zone);
>>> > +       unsigned long val;
>>> > +
>>> > +       val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
>>> > +       if (do_swap_account)
>> IMHO May get_nr_swap_pages() be more appropriate here?
>>
>
> This is a per-memcg number, how would get_nr_swap_pages() help us here?
>
I meant to add get_nr_swap_pages() as the if-judgement, that is:
   if (do_swap_account && get_nr_swap_pages())
       ....
since anon pages becomes unreclaimable if swap space is used up.
Thanks,
Sha
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 08/28] list: add a new LRU list type
  2013-03-29  9:13 ` [PATCH v2 08/28] list: add a new LRU list type Glauber Costa
@ 2013-04-04 21:53   ` Greg Thelen
  2013-04-05  1:20     ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: Greg Thelen @ 2013-04-04 21:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	hughd, yinghan, Dave Chinner
On Fri, Mar 29 2013, Glauber Costa wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Several subsystems use the same construct for LRU lists - a list
> head, a spin lock and and item count. They also use exactly the same
> code for adding and removing items from the LRU. Create a generic
> type for these LRU lists.
>
> This is the beginning of generic, node aware LRUs for shrinkers to
> work with.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/list_lru.h |  36 +++++++++++++++
>  lib/Makefile             |   2 +-
>  lib/list_lru.c           | 117 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 154 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/list_lru.h
>  create mode 100644 lib/list_lru.c
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> new file mode 100644
> index 0000000..3423949
> --- /dev/null
> +++ b/include/linux/list_lru.h
> @@ -0,0 +1,36 @@
> +/*
> + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
> + * Author: David Chinner
> + *
> + * Generic LRU infrastructure
> + */
> +#ifndef _LRU_LIST_H
> +#define _LRU_LIST_H 0
Utter nitpicking, but all other .h files in this directory use the form:
  #define _LRU_LIST_H
not the removed trailing 0.
> +
> +#include <linux/list.h>
> +
> +struct list_lru {
> +	spinlock_t		lock;
> +	struct list_head	list;
> +	long			nr_items;
> +};
> +
> +int list_lru_init(struct list_lru *lru);
> +int list_lru_add(struct list_lru *lru, struct list_head *item);
> +int list_lru_del(struct list_lru *lru, struct list_head *item);
> +
> +static inline long list_lru_count(struct list_lru *lru)
> +{
> +	return lru->nr_items;
> +}
> +
> +typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
> +				void *cb_arg);
> +typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
> +
> +long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
> +		   void *cb_arg, long nr_to_walk);
> +
> +long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
> +
> +#endif /* _LRU_LIST_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index d7946ff..f14abd9 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
>  	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> -	 earlycpio.o
> +	 earlycpio.o list_lru.o
>  
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/list_lru.c b/lib/list_lru.c
> new file mode 100644
> index 0000000..475d0e9
> --- /dev/null
> +++ b/lib/list_lru.c
> @@ -0,0 +1,117 @@
> +/*
> + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
> + * Author: David Chinner
> + *
> + * Generic LRU infrastructure
> + */
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/list_lru.h>
> +
> +int
> +list_lru_add(
> +	struct list_lru	*lru,
> +	struct list_head *item)
> +{
> +	spin_lock(&lru->lock);
> +	if (list_empty(item)) {
> +		list_add_tail(item, &lru->list);
> +		lru->nr_items++;
> +		spin_unlock(&lru->lock);
> +		return 1;
> +	}
> +	spin_unlock(&lru->lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_add);
> +
> +int
> +list_lru_del(
> +	struct list_lru	*lru,
> +	struct list_head *item)
> +{
> +	spin_lock(&lru->lock);
> +	if (!list_empty(item)) {
> +		list_del_init(item);
> +		lru->nr_items--;
> +		spin_unlock(&lru->lock);
> +		return 1;
> +	}
> +	spin_unlock(&lru->lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_del);
> +
> +long
> +list_lru_walk(
> +	struct list_lru *lru,
> +	list_lru_walk_cb isolate,
> +	void		*cb_arg,
> +	long		nr_to_walk)
> +{
> +	struct list_head *item, *n;
> +	long removed = 0;
> +restart:
> +	spin_lock(&lru->lock);
> +	list_for_each_safe(item, n, &lru->list) {
> +		int ret;
> +
> +		if (nr_to_walk-- < 0)
> +			break;
> +
> +		ret = isolate(item, &lru->lock, cb_arg);
> +		switch (ret) {
> +		case 0:	/* item removed from list */
> +			lru->nr_items--;
> +			removed++;
> +			break;
> +		case 1: /* item referenced, give another pass */
> +			list_move_tail(item, &lru->list);
> +			break;
> +		case 2: /* item cannot be locked, skip */
> +			break;
> +		case 3: /* item not freeable, lock dropped */
> +			goto restart;
These four magic return values might benefit from an enum (or #define)
for clarity.
Maybe the names would be LRU_OK, LRU_REMOVED, LRU_ROTATE, LRU_RETRY.
> +		default:
> +			BUG();
> +		}
> +	}
> +	spin_unlock(&lru->lock);
> +	return removed;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_walk);
> +
> +long
> +list_lru_dispose_all(
> +	struct list_lru *lru,
> +	list_lru_dispose_cb dispose)
> +{
> +	long disposed = 0;
> +	LIST_HEAD(dispose_list);
> +
> +	spin_lock(&lru->lock);
> +	while (!list_empty(&lru->list)) {
> +		list_splice_init(&lru->list, &dispose_list);
> +		disposed += lru->nr_items;
> +		lru->nr_items = 0;
> +		spin_unlock(&lru->lock);
> +
> +		dispose(&dispose_list);
> +
> +		spin_lock(&lru->lock);
> +	}
> +	spin_unlock(&lru->lock);
> +	return disposed;
> +}
> +
> +int
> +list_lru_init(
> +	struct list_lru	*lru)
> +{
> +	spin_lock_init(&lru->lock);
> +	INIT_LIST_HEAD(&lru->list);
> +	lru->nr_items = 0;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_init);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-03-29  9:13 ` [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
@ 2013-04-05  1:09   ` Greg Thelen
  2013-04-05  1:15     ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: Greg Thelen @ 2013-04-05  1:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	hughd, yinghan, Dave Chinner
On Fri, Mar 29 2013, Glauber Costa wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Before we split up the dcache_lru_lock, the unused dentry counter
> needs to be made independent of the global dcache_lru_lock. Convert
> it to per-cpu counters to do this.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/dcache.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index fbfae008..f1196f2 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
>  };
>  
>  static DEFINE_PER_CPU(unsigned int, nr_dentry);
> +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
>  
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
>  static int get_nr_dentry(void)
> @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
>  	return sum < 0 ? 0 : sum;
>  }
>  
> +static int get_nr_dentry_unused(void)
> +{
> +	int i;
> +	int sum = 0;
> +	for_each_possible_cpu(i)
> +		sum += per_cpu(nr_dentry_unused, i);
> +	return sum < 0 ? 0 : sum;
> +}
Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
count survives so we don't leak nr_dentry_unused.  Right?  I see code in
percpu_counter_sum_positive() to explicitly handle this case and I want
to make sure we don't need it here.
[snip]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 06/28] mm: new shrinker API
  2013-03-29  9:13 ` [PATCH v2 06/28] mm: new shrinker API Glauber Costa
@ 2013-04-05  1:09   ` Greg Thelen
  0 siblings, 0 replies; 97+ messages in thread
From: Greg Thelen @ 2013-04-05  1:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	hughd, yinghan, Dave Chinner
On Fri, Mar 29 2013, Glauber Costa wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The current shrinker callout API uses an a single shrinker call for
> multiple functions. To determine the function, a special magical
> value is passed in a parameter to change the behaviour. This
> complicates the implementation and return value specification for
> the different behaviours.
>
> Separate the two different behaviours into separate operations, one
> to return a count of freeable objects in the cache, and another to
> scan a certain number of objects in the cache for freeing. In
> defining these new operations, ensure the return values and
> resultant behaviours are clearly defined and documented.
>
> Modify shrink_slab() to use the new API and implement the callouts
> for all the existing shrinkers.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/shrinker.h | 37 +++++++++++++++++++++++++----------
>  mm/vmscan.c              | 51 +++++++++++++++++++++++++++++++-----------------
>  2 files changed, 60 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index ac6b8ee..4f59615 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,31 +4,47 @@
>  /*
>   * This struct is used to pass information from page reclaim to the shrinkers.
>   * We consolidate the values for easier extention later.
> + *
> + * The 'gfpmask' refers to the allocation we are currently trying to
> + * fulfil.
> + *
> + * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
> + * querying the cache size, so a fastpath for that case is appropriate.
>   */
>  struct shrink_control {
>  	gfp_t gfp_mask;
>  
>  	/* How many slab objects shrinker() should scan and try to reclaim */
> -	unsigned long nr_to_scan;
> +	long nr_to_scan;
Why convert from unsigned?  What's a poor shrinker to do with a negative
to-scan request?
[snip]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-04-05  1:09   ` Greg Thelen
@ 2013-04-05  1:15     ` Dave Chinner
  2013-04-08  9:14       ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-05  1:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On Thu, Apr 04, 2013 at 06:09:31PM -0700, Greg Thelen wrote:
> On Fri, Mar 29 2013, Glauber Costa wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Before we split up the dcache_lru_lock, the unused dentry counter
> > needs to be made independent of the global dcache_lru_lock. Convert
> > it to per-cpu counters to do this.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/dcache.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index fbfae008..f1196f2 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
> >  };
> >  
> >  static DEFINE_PER_CPU(unsigned int, nr_dentry);
> > +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
> >  
> >  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> >  static int get_nr_dentry(void)
> > @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
> >  	return sum < 0 ? 0 : sum;
> >  }
> >  
> > +static int get_nr_dentry_unused(void)
> > +{
> > +	int i;
> > +	int sum = 0;
> > +	for_each_possible_cpu(i)
> > +		sum += per_cpu(nr_dentry_unused, i);
> > +	return sum < 0 ? 0 : sum;
> > +}
> 
> Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
> count survives so we don't leak nr_dentry_unused.  Right?  I see code in
> percpu_counter_sum_positive() to explicitly handle this case and I want
> to make sure we don't need it here.
DEFINE_PER_CPU() gives a variable per possible CPU, and we sum for
all possible CPUs. Therefore online/offline CPUs just don't matter.
The percpu_counter code uses for_each_online_cpu(), and so it has to
be aware of hotplug operations so taht it doesn't leak counts.
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 08/28] list: add a new LRU list type
  2013-04-04 21:53   ` Greg Thelen
@ 2013-04-05  1:20     ` Dave Chinner
  2013-04-05  8:01       ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-05  1:20 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On Thu, Apr 04, 2013 at 02:53:49PM -0700, Greg Thelen wrote:
> On Fri, Mar 29 2013, Glauber Costa wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > +long
> > +list_lru_walk(
> > +	struct list_lru *lru,
> > +	list_lru_walk_cb isolate,
> > +	void		*cb_arg,
> > +	long		nr_to_walk)
> > +{
> > +	struct list_head *item, *n;
> > +	long removed = 0;
> > +restart:
> > +	spin_lock(&lru->lock);
> > +	list_for_each_safe(item, n, &lru->list) {
> > +		int ret;
> > +
> > +		if (nr_to_walk-- < 0)
> > +			break;
> > +
> > +		ret = isolate(item, &lru->lock, cb_arg);
> > +		switch (ret) {
> > +		case 0:	/* item removed from list */
> > +			lru->nr_items--;
> > +			removed++;
> > +			break;
> > +		case 1: /* item referenced, give another pass */
> > +			list_move_tail(item, &lru->list);
> > +			break;
> > +		case 2: /* item cannot be locked, skip */
> > +			break;
> > +		case 3: /* item not freeable, lock dropped */
> > +			goto restart;
> 
> These four magic return values might benefit from an enum (or #define)
> for clarity.
Obviously, and it was stated that this needed to be done by miself
when I last posted the patch set many months ago. I've been rather
busy since then, and so haven't had time to do anything with it.
> Maybe the names would be LRU_OK, LRU_REMOVED, LRU_ROTATE, LRU_RETRY.
Something like that...
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 08/28] list: add a new LRU list type
  2013-04-05  1:20     ` Dave Chinner
@ 2013-04-05  8:01       ` Glauber Costa
  2013-04-06  0:04         ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-05  8:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On 04/05/2013 05:20 AM, Dave Chinner wrote:
> On Thu, Apr 04, 2013 at 02:53:49PM -0700, Greg Thelen wrote:
>> On Fri, Mar 29 2013, Glauber Costa wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>> +long
>>> +list_lru_walk(
>>> +	struct list_lru *lru,
>>> +	list_lru_walk_cb isolate,
>>> +	void		*cb_arg,
>>> +	long		nr_to_walk)
>>> +{
>>> +	struct list_head *item, *n;
>>> +	long removed = 0;
>>> +restart:
>>> +	spin_lock(&lru->lock);
>>> +	list_for_each_safe(item, n, &lru->list) {
>>> +		int ret;
>>> +
>>> +		if (nr_to_walk-- < 0)
>>> +			break;
>>> +
>>> +		ret = isolate(item, &lru->lock, cb_arg);
>>> +		switch (ret) {
>>> +		case 0:	/* item removed from list */
>>> +			lru->nr_items--;
>>> +			removed++;
>>> +			break;
>>> +		case 1: /* item referenced, give another pass */
>>> +			list_move_tail(item, &lru->list);
>>> +			break;
>>> +		case 2: /* item cannot be locked, skip */
>>> +			break;
>>> +		case 3: /* item not freeable, lock dropped */
>>> +			goto restart;
>>
>> These four magic return values might benefit from an enum (or #define)
>> for clarity.
> 
> Obviously, and it was stated that this needed to be done by miself
> when I last posted the patch set many months ago. I've been rather
> busy since then, and so haven't had time to do anything with it.
> 
>> Maybe the names would be LRU_OK, LRU_REMOVED, LRU_ROTATE, LRU_RETRY.
> 
> Something like that...
> 
I can handle that and fold it with credits as usual if you don't mind.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure
  2013-04-04  9:35       ` Sha Zhengju
@ 2013-04-05  8:25         ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-05  8:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm@kvack.org, Rik van Riel, Hugh Dickins, containers,
	Dave Chinner, Dave Shrinnker, Michal Hocko, Mel Gorman,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On 04/04/2013 01:35 PM, Sha Zhengju wrote:
> On Wed, Apr 3, 2013 at 6:43 PM, Glauber Costa <glommer@parallels.com> wrote:
>> On 04/03/2013 02:11 PM, Sha Zhengju wrote:
>>>>> +unsigned long
>>>>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>>>>> +{
>>>>> +       int nid = zone_to_nid(zone);
>>>>> +       int zid = zone_idx(zone);
>>>>> +       unsigned long val;
>>>>> +
>>>>> +       val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
>>>>> +       if (do_swap_account)
>>> IMHO May get_nr_swap_pages() be more appropriate here?
>>>
>>
>> This is a per-memcg number, how would get_nr_swap_pages() help us here?
>>
> 
> I meant to add get_nr_swap_pages() as the if-judgement, that is:
>    if (do_swap_account && get_nr_swap_pages())
>        ....
> since anon pages becomes unreclaimable if swap space is used up.
> 
> 
Well, I believe this is doable.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 08/28] list: add a new LRU list type
  2013-04-05  8:01       ` Glauber Costa
@ 2013-04-06  0:04         ` Dave Chinner
  0 siblings, 0 replies; 97+ messages in thread
From: Dave Chinner @ 2013-04-06  0:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On Fri, Apr 05, 2013 at 12:01:01PM +0400, Glauber Costa wrote:
> On 04/05/2013 05:20 AM, Dave Chinner wrote:
> > On Thu, Apr 04, 2013 at 02:53:49PM -0700, Greg Thelen wrote:
> >> On Fri, Mar 29 2013, Glauber Costa wrote:
> >>> From: Dave Chinner <dchinner@redhat.com>
> >>> +long
> >>> +list_lru_walk(
> >>> +	struct list_lru *lru,
> >>> +	list_lru_walk_cb isolate,
> >>> +	void		*cb_arg,
> >>> +	long		nr_to_walk)
> >>> +{
> >>> +	struct list_head *item, *n;
> >>> +	long removed = 0;
> >>> +restart:
> >>> +	spin_lock(&lru->lock);
> >>> +	list_for_each_safe(item, n, &lru->list) {
> >>> +		int ret;
> >>> +
> >>> +		if (nr_to_walk-- < 0)
> >>> +			break;
> >>> +
> >>> +		ret = isolate(item, &lru->lock, cb_arg);
> >>> +		switch (ret) {
> >>> +		case 0:	/* item removed from list */
> >>> +			lru->nr_items--;
> >>> +			removed++;
> >>> +			break;
> >>> +		case 1: /* item referenced, give another pass */
> >>> +			list_move_tail(item, &lru->list);
> >>> +			break;
> >>> +		case 2: /* item cannot be locked, skip */
> >>> +			break;
> >>> +		case 3: /* item not freeable, lock dropped */
> >>> +			goto restart;
> >>
> >> These four magic return values might benefit from an enum (or #define)
> >> for clarity.
> > 
> > Obviously, and it was stated that this needed to be done by miself
> > when I last posted the patch set many months ago. I've been rather
> > busy since then, and so haven't had time to do anything with it.
> > 
> >> Maybe the names would be LRU_OK, LRU_REMOVED, LRU_ROTATE, LRU_RETRY.
> > 
> > Something like that...
> > 
> I can handle that and fold it with credits as usual if you don't mind.
Sure, I'm happy for you to do that, along with any other cleanups
and fixes that are needed...
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 00/28] memcg-aware slab shrinking
  2013-04-01 14:12     ` Serge Hallyn
@ 2013-04-08  8:11       ` Glauber Costa
  0 siblings, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-08  8:11 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-mm, hughd, containers, Dave Shrinnker, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On 04/01/2013 06:12 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> On 04/01/2013 04:38 PM, Serge Hallyn wrote:
>>> Quoting Glauber Costa (glommer@parallels.com):
>>>> Hi,
>>>>
>>>> Notes:
>>>> ======
>>>>
>>>> This is v2 of memcg-aware LRU shrinking. I've been testing it extensively
>>>> and it behaves well, at least from the isolation point of view. However,
>>>> I feel some more testing is needed before we commit to it. Still, this is
>>>> doing the job fairly well. Comments welcome.
>>>
>>> Do you have any performance tests (preferably with enough runs with and
>>> without this patchset to show 95% confidence interval) to show the
>>> impact this has?  Certainly the feature sounds worthwhile, but I'm
>>> curious about the cost of maintaining this extra state.
>>>
>>> -serge
>>>
>> Not yet. I intend to include them in my next run. I haven't yet decided
>> on a set of tests to run (maybe just a memcg-contained kernel compile?)
>>
>> So if you have suggestions of what I could run to show this, feel free
>> to lay them down here.
> 
> Perhaps mount a 4G tmpfs, copy kernel tree there, and build kernel on
> that tmpfs?
> 
I've just run kernbench with 2Gb setups, with 3 different kernels. I
will include all this data in my opening letter for the next submission,
but wanted to drop a heads up here:
Kernels
========
base: the current -mm
davelru: that + dave's patches applied
fulllru: that + my patches applied.
I've ran all of them in a 1st level cgroup. Please note that the first
two kernels are not capable of shrinking metadata, so I had to select a
size that is enough to be in relatively constant pressure, but at the
same time not having that pressure to be exclusively from kernel memory.
2Gb did the job. This is a 2-node 24-way machine. My access to it is
very limited, and I have no idea when I'll be able to get my hands into
it again
Results:
Base
====
Average Optimal load -j 24 Run (std deviation):
Elapsed Time 415.988 (8.37909)
User Time 4142 (759.964)
System Time 418.483 (62.0377)
Percent CPU 1030.7 (267.462)
Context Switches 391509 (268361)
Sleeps 738483 (149934)
Dave
====
Average Optimal load -j 24 Run (std deviation):
Elapsed Time 424.486 (16.7365) ( + 2 % vs base)
User Time 4146.8 (764.012) ( + 0.84 % vs base)
System Time 419.24 (62.4507) (+ 0.18 % vs base)
Percent CPU 1012.1 (264.558) (-1.8 % vs base)
Context Switches 393363 (268899) (+ 0.47 % vs base)
Sleeps 739905 (147344) (+ 0.19 % vs base)
Full
=====
Average Optimal load -j 24 Run (std deviation):
Elapsed Time 456.644 (15.3567) ( + 9.7 % vs base)
User Time 4036.3 (645.261) ( - 2.5 % vs base)
System Time 438.134 (82.251) ( + 4.7 % vs base)
Percent CPU 973 (168.581) ( - 5.6 % vs base)
Context Switches 350796 (229700) ( - 10 % vs base)
Sleeps 728156 (138808) ( - 1.4 % vs base )
Discussion:
===========
First-level analysis: All figures fall within the std dev, except for
Full LRU wall time. It does fall within 2 std devs, though.
On the other hand, Full LRU kernel leads to better cpu utilization and
greater efficiency.
Details: The reclaim patterns in the three kernels are expected to be
different. User memory will always be the main driver, but in case of
pressure the first two kernels will shrink it while keeping the metadata
intact. This should lead to smaller system times figure at expense of
bigger user time figures, since user pages will be evicted more often.
This is consistent with the figures I've found.
Full LRU kernels have a 2.5 % better user time utilization, with 5.6 %
less CPU consumed and 10 % less context switches.
This comes at the expense of a 4.7 % loss of system time. Because we
will have to bring more dentry and inode objects back from caches, we
will stress more the slab code.
Because this is a benchmark that stresses a lot of metadata, it is
expected that this increase affects the end wall result proportionally.
We notice that the mere introduction of LRU code (Dave's Kernel) does
not affect the end wall time result outside the standard deviation.
Shrinking those objects, however, will lead to bigger wall times. This
is within the expected. No one would ever argue that the right kernel
behavior for all cases should keep the metadata in memory at expense of
user memory (and even if we should, we should do it the same way for the
cgroups).
My final conclusions is that performance wise the work is sound and
operates within expectations.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-03-29  9:13 ` [PATCH v2 02/28] vmscan: take at least one pass with shrinkers Glauber Costa
  2013-04-01  7:26   ` Kamezawa Hiroyuki
@ 2013-04-08  8:42   ` Joonsoo Kim
  2013-04-08  8:47     ` Glauber Costa
  1 sibling, 1 reply; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-08  8:42 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
Hello, Glauber.
On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
> In very low free kernel memory situations, it may be the case that we
> have less objects to free than our initial batch size. If this is the
> case, it is better to shrink those, and open space for the new workload
> then to keep them and fail the new allocations.
> 
> More specifically, this happens because we encode this in a loop with
> the condition: "while (total_scan >= batch_size)". So if we are in such
> a case, we'll not even enter the loop.
> 
> This patch modifies turns it into a do () while {} loop, that will
> guarantee that we scan it at least once, while keeping the behaviour
> exactly the same for the cases in which total_scan > batch_size.
Current user of shrinker not only use their own condition, but also
use batch_size and seeks to throttle their behavior. So IMHO,
this behavior change is very dangerous to some users.
For example, think lowmemorykiller.
With this patch, he always kill some process whenever shrink_slab() is
called and their low memory condition is satisfied.
Before this, total_scan also prevent us to go into lowmemorykiller, so
killing innocent process is limited as much as possible.
IMHO, at least, we need to be acknowledge by user of shrink_slab() about
this change.
Thanks.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Reviewed-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
> CC: "Theodore Ts'o" <tytso@mit.edu>
> CC: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  mm/vmscan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..fc6d45a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -280,7 +280,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>  					nr_pages_scanned, lru_pages,
>  					max_pass, delta, total_scan);
>  
> -		while (total_scan >= batch_size) {
> +		do {
>  			int nr_before;
>  
>  			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
> @@ -294,7 +294,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>  			total_scan -= batch_size;
>  
>  			cond_resched();
> -		}
> +		} while (total_scan >= batch_size);
>  
>  		/*
>  		 * move the unused scan count back into the shrinker in a
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-08  8:42   ` Joonsoo Kim
@ 2013-04-08  8:47     ` Glauber Costa
  2013-04-08  9:01       ` Joonsoo Kim
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-08  8:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On 04/08/2013 12:42 PM, Joonsoo Kim wrote:
> Hello, Glauber.
> 
> On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
>> In very low free kernel memory situations, it may be the case that we
>> have less objects to free than our initial batch size. If this is the
>> case, it is better to shrink those, and open space for the new workload
>> then to keep them and fail the new allocations.
>>
>> More specifically, this happens because we encode this in a loop with
>> the condition: "while (total_scan >= batch_size)". So if we are in such
>> a case, we'll not even enter the loop.
>>
>> This patch modifies turns it into a do () while {} loop, that will
>> guarantee that we scan it at least once, while keeping the behaviour
>> exactly the same for the cases in which total_scan > batch_size.
> 
> Current user of shrinker not only use their own condition, but also
> use batch_size and seeks to throttle their behavior. So IMHO,
> this behavior change is very dangerous to some users.
> 
> For example, think lowmemorykiller.
> With this patch, he always kill some process whenever shrink_slab() is
> called and their low memory condition is satisfied.
> Before this, total_scan also prevent us to go into lowmemorykiller, so
> killing innocent process is limited as much as possible.
> 
shrinking is part of the normal operation of the Linux kernel and
happens all the time. Not only the call to shrink_slab, but actual
shrinking of unused objects.
I don't know therefore about any code that would kill process only
because they have reached shrink_slab.
In normal systems, this loop will be executed many, many times. So we're
not shrinking *more*, we're just guaranteeing that at least one pass
will be made.
Also, anyone looking at this to see if we should kill processes, is a
lot more likely to kill something if we tried to shrink but didn't, than
if we successfully shrunk something.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-08  8:47     ` Glauber Costa
@ 2013-04-08  9:01       ` Joonsoo Kim
  2013-04-08  9:05         ` Glauber Costa
  0 siblings, 1 reply; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-08  9:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On Mon, Apr 08, 2013 at 12:47:14PM +0400, Glauber Costa wrote:
> On 04/08/2013 12:42 PM, Joonsoo Kim wrote:
> > Hello, Glauber.
> > 
> > On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
> >> In very low free kernel memory situations, it may be the case that we
> >> have less objects to free than our initial batch size. If this is the
> >> case, it is better to shrink those, and open space for the new workload
> >> then to keep them and fail the new allocations.
> >>
> >> More specifically, this happens because we encode this in a loop with
> >> the condition: "while (total_scan >= batch_size)". So if we are in such
> >> a case, we'll not even enter the loop.
> >>
> >> This patch modifies turns it into a do () while {} loop, that will
> >> guarantee that we scan it at least once, while keeping the behaviour
> >> exactly the same for the cases in which total_scan > batch_size.
> > 
> > Current user of shrinker not only use their own condition, but also
> > use batch_size and seeks to throttle their behavior. So IMHO,
> > this behavior change is very dangerous to some users.
> > 
> > For example, think lowmemorykiller.
> > With this patch, he always kill some process whenever shrink_slab() is
> > called and their low memory condition is satisfied.
> > Before this, total_scan also prevent us to go into lowmemorykiller, so
> > killing innocent process is limited as much as possible.
> > 
> shrinking is part of the normal operation of the Linux kernel and
> happens all the time. Not only the call to shrink_slab, but actual
> shrinking of unused objects.
> 
> I don't know therefore about any code that would kill process only
> because they have reached shrink_slab.
> 
> In normal systems, this loop will be executed many, many times. So we're
> not shrinking *more*, we're just guaranteeing that at least one pass
> will be made.
This one pass guarantee is a problem for lowmemory killer.
> Also, anyone looking at this to see if we should kill processes, is a
> lot more likely to kill something if we tried to shrink but didn't, than
> if we successfully shrunk something.
lowmemory killer is hacky user of shrink_slab interface. It kill a process
if system memory goes under some level. It check that condition every time
it is called. Without this patch, it cannot check that condition
if total_scan < batch_size. But with this patch, it check that condition
more frequently, so there will be side-effect.
Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-08  9:01       ` Joonsoo Kim
@ 2013-04-08  9:05         ` Glauber Costa
  2013-04-09  0:55           ` Joonsoo Kim
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-08  9:05 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On 04/08/2013 01:01 PM, Joonsoo Kim wrote:
> On Mon, Apr 08, 2013 at 12:47:14PM +0400, Glauber Costa wrote:
>> On 04/08/2013 12:42 PM, Joonsoo Kim wrote:
>>> Hello, Glauber.
>>>
>>> On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
>>>> In very low free kernel memory situations, it may be the case that we
>>>> have less objects to free than our initial batch size. If this is the
>>>> case, it is better to shrink those, and open space for the new workload
>>>> then to keep them and fail the new allocations.
>>>>
>>>> More specifically, this happens because we encode this in a loop with
>>>> the condition: "while (total_scan >= batch_size)". So if we are in such
>>>> a case, we'll not even enter the loop.
>>>>
>>>> This patch modifies turns it into a do () while {} loop, that will
>>>> guarantee that we scan it at least once, while keeping the behaviour
>>>> exactly the same for the cases in which total_scan > batch_size.
>>>
>>> Current user of shrinker not only use their own condition, but also
>>> use batch_size and seeks to throttle their behavior. So IMHO,
>>> this behavior change is very dangerous to some users.
>>>
>>> For example, think lowmemorykiller.
>>> With this patch, he always kill some process whenever shrink_slab() is
>>> called and their low memory condition is satisfied.
>>> Before this, total_scan also prevent us to go into lowmemorykiller, so
>>> killing innocent process is limited as much as possible.
>>>
>> shrinking is part of the normal operation of the Linux kernel and
>> happens all the time. Not only the call to shrink_slab, but actual
>> shrinking of unused objects.
>>
>> I don't know therefore about any code that would kill process only
>> because they have reached shrink_slab.
>>
>> In normal systems, this loop will be executed many, many times. So we're
>> not shrinking *more*, we're just guaranteeing that at least one pass
>> will be made.
> 
> This one pass guarantee is a problem for lowmemory killer.
> 
>> Also, anyone looking at this to see if we should kill processes, is a
>> lot more likely to kill something if we tried to shrink but didn't, than
>> if we successfully shrunk something.
> 
> lowmemory killer is hacky user of shrink_slab interface.
Well, it says it all =)
In special, I really can't see how, hacky or not, it makes sense to kill
a process if we *actually* shrunk memory.
Moreover, I don't see the code in drivers/staging/android/lowmemory.c
doing anything even remotely close to that. Could you point me to some
code that does it ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-04-05  1:15     ` Dave Chinner
@ 2013-04-08  9:14       ` Glauber Costa
  2013-04-08 13:18         ` Glauber Costa
  2013-04-08 23:26         ` Dave Chinner
  0 siblings, 2 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-08  9:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On 04/05/2013 05:15 AM, Dave Chinner wrote:
> On Thu, Apr 04, 2013 at 06:09:31PM -0700, Greg Thelen wrote:
>> On Fri, Mar 29 2013, Glauber Costa wrote:
>>
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> Before we split up the dcache_lru_lock, the unused dentry counter
>>> needs to be made independent of the global dcache_lru_lock. Convert
>>> it to per-cpu counters to do this.
>>>
>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>> ---
>>>  fs/dcache.c | 17 ++++++++++++++---
>>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/dcache.c b/fs/dcache.c
>>> index fbfae008..f1196f2 100644
>>> --- a/fs/dcache.c
>>> +++ b/fs/dcache.c
>>> @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
>>>  };
>>>  
>>>  static DEFINE_PER_CPU(unsigned int, nr_dentry);
>>> +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
>>>  
>>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
>>>  static int get_nr_dentry(void)
>>> @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
>>>  	return sum < 0 ? 0 : sum;
>>>  }
>>>  
>>> +static int get_nr_dentry_unused(void)
>>> +{
>>> +	int i;
>>> +	int sum = 0;
>>> +	for_each_possible_cpu(i)
>>> +		sum += per_cpu(nr_dentry_unused, i);
>>> +	return sum < 0 ? 0 : sum;
>>> +}
>>
>> Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
>> count survives so we don't leak nr_dentry_unused.  Right?  I see code in
>> percpu_counter_sum_positive() to explicitly handle this case and I want
>> to make sure we don't need it here.
> 
> DEFINE_PER_CPU() gives a variable per possible CPU, and we sum for
> all possible CPUs. Therefore online/offline CPUs just don't matter.
> 
> The percpu_counter code uses for_each_online_cpu(), and so it has to
> be aware of hotplug operations so taht it doesn't leak counts.
> 
It is an unsigned quantity, however. Can't we go negative if it becomes
unused in one cpu, but used in another?
Ex:
nr_unused/0: 0
nr_unused/1: 0
dentry goes to the LRU at cpu 1:
nr_unused/0: 0
nr_unused/1: 1
CPU 1 goes down:
nr_unused/0: 0
dentry goes out of the LRU at cpu 0:
nr_unused/0: 1 << 32.
That would easily be fixed by using a normal signed long, and is in fact
what the percpu code does in its internal operations.
Any reason not to do it? Something I am not seeing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 10/28] dcache: convert to use new lru list infrastructure
  2013-03-29  9:13 ` [PATCH v2 10/28] dcache: convert to use new lru list infrastructure Glauber Costa
@ 2013-04-08 13:14   ` Glauber Costa
  2013-04-08 23:28     ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-08 13:14 UTC (permalink / raw)
  To: linux-mm
  Cc: hughd, containers, Dave Chinner, Dave Shrinnker, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 536 bytes --]
On 03/29/2013 01:13 PM, Glauber Costa wrote:
> +	if (dentry->d_flags & DCACHE_REFERENCED) {
> +		dentry->d_flags &= ~DCACHE_REFERENCED;
> +		spin_unlock(&dentry->d_lock);
> +
> +		/*
> +		 * XXX: this list move should be be done under d_lock. Need to
> +		 * determine if it is safe just to do it under the lru lock.
> +		 */
> +		return 1;
> +	}
I've carefully audited the list manipulations in dcache and determined
this is safe. I've replaced the fixme string for the following text. Let
me know if you believe this is not right.
[-- Attachment #2: comment --]
[-- Type: text/plain, Size: 1302 bytes --]
diff --git a/fs/dcache.c b/fs/dcache.c
index a2fc76e..8e166a4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -855,8 +855,23 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 		spin_unlock(&dentry->d_lock);
 
 		/*
-		 * XXX: this list move should be be done under d_lock. Need to
-		 * determine if it is safe just to do it under the lru lock.
+		 * The list move itself will be made by the common LRU code. At
+		 * this point, we've dropped the dentry->d_lock but keep the
+		 * lru lock. This is safe to do, since every list movement is
+		 * protected by the lru lock even if both locks are held.
+		 *
+		 * This is guaranteed by the fact that all LRU management
+		 * functions are intermediated by the LRU API calls like
+		 * list_lru_add and list_lru_del. List movement in this file
+		 * only ever occur through this functions or through callbacks
+		 * like this one, that are called from the LRU API.
+		 *
+		 * The only exceptions to this are functions like
+		 * shrink_dentry_list, and code that first checks for the
+		 * DCACHE_SHRINK_LIST flag.  Those are guaranteed to be
+		 * operating only with stack provided lists after they are
+		 * properly isolated from the main list.  It is thus, always a
+		 * local access.
 		 */
 		return LRU_ROTATE;
 	}
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-04-08  9:14       ` Glauber Costa
@ 2013-04-08 13:18         ` Glauber Costa
  2013-04-08 23:26         ` Dave Chinner
  1 sibling, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-08 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
[-- Attachment #1: Type: text/plain, Size: 2565 bytes --]
On 04/08/2013 01:14 PM, Glauber Costa wrote:
> On 04/05/2013 05:15 AM, Dave Chinner wrote:
>> On Thu, Apr 04, 2013 at 06:09:31PM -0700, Greg Thelen wrote:
>>> On Fri, Mar 29 2013, Glauber Costa wrote:
>>>
>>>> From: Dave Chinner <dchinner@redhat.com>
>>>>
>>>> Before we split up the dcache_lru_lock, the unused dentry counter
>>>> needs to be made independent of the global dcache_lru_lock. Convert
>>>> it to per-cpu counters to do this.
>>>>
>>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>> ---
>>>>  fs/dcache.c | 17 ++++++++++++++---
>>>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/dcache.c b/fs/dcache.c
>>>> index fbfae008..f1196f2 100644
>>>> --- a/fs/dcache.c
>>>> +++ b/fs/dcache.c
>>>> @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
>>>>  };
>>>>  
>>>>  static DEFINE_PER_CPU(unsigned int, nr_dentry);
>>>> +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
>>>>  
>>>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
>>>>  static int get_nr_dentry(void)
>>>> @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
>>>>  	return sum < 0 ? 0 : sum;
>>>>  }
>>>>  
>>>> +static int get_nr_dentry_unused(void)
>>>> +{
>>>> +	int i;
>>>> +	int sum = 0;
>>>> +	for_each_possible_cpu(i)
>>>> +		sum += per_cpu(nr_dentry_unused, i);
>>>> +	return sum < 0 ? 0 : sum;
>>>> +}
>>>
>>> Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
>>> count survives so we don't leak nr_dentry_unused.  Right?  I see code in
>>> percpu_counter_sum_positive() to explicitly handle this case and I want
>>> to make sure we don't need it here.
>>
>> DEFINE_PER_CPU() gives a variable per possible CPU, and we sum for
>> all possible CPUs. Therefore online/offline CPUs just don't matter.
>>
>> The percpu_counter code uses for_each_online_cpu(), and so it has to
>> be aware of hotplug operations so taht it doesn't leak counts.
>>
> 
> It is an unsigned quantity, however. Can't we go negative if it becomes
> unused in one cpu, but used in another?
> 
> Ex:
> 
> nr_unused/0: 0
> nr_unused/1: 0
> 
> dentry goes to the LRU at cpu 1:
> nr_unused/0: 0
> nr_unused/1: 1
> 
> CPU 1 goes down:
> nr_unused/0: 0
> 
> dentry goes out of the LRU at cpu 0:
> nr_unused/0: 1 << 32.
> 
> That would easily be fixed by using a normal signed long, and is in fact
> what the percpu code does in its internal operations.
> 
> Any reason not to do it? Something I am not seeing?
Unless you have objections, I will fold the following patch into this one:
[-- Attachment #2: signed --]
[-- Type: text/plain, Size: 792 bytes --]
diff --git a/fs/dcache.c b/fs/dcache.c
index 8e166a4..c7cd9ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -118,7 +118,14 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static DEFINE_PER_CPU(unsigned int, nr_dentry);
-static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
+/*
+ * The total counts for nr_dentry_unused are hotplug-safe, since we always loop
+ * through all possible cpus. It is quite possible, though, that the counters
+ * go negative.  That could easily happen for a dentry that is marked unused in
+ * one CPU but decrements that count after being preempted to another CPU.
+ * Therefore, we must use a signed quantity in here.
+ */
+static DEFINE_PER_CPU(long , nr_dentry_unused);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 static int get_nr_dentry(void)
^ permalink raw reply related	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-04-08  9:14       ` Glauber Costa
  2013-04-08 13:18         ` Glauber Costa
@ 2013-04-08 23:26         ` Dave Chinner
  2013-04-09  8:02           ` Glauber Costa
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-08 23:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On Mon, Apr 08, 2013 at 01:14:48PM +0400, Glauber Costa wrote:
> On 04/05/2013 05:15 AM, Dave Chinner wrote:
> > On Thu, Apr 04, 2013 at 06:09:31PM -0700, Greg Thelen wrote:
> >> On Fri, Mar 29 2013, Glauber Costa wrote:
> >>
> >>> From: Dave Chinner <dchinner@redhat.com>
> >>>
> >>> Before we split up the dcache_lru_lock, the unused dentry counter
> >>> needs to be made independent of the global dcache_lru_lock. Convert
> >>> it to per-cpu counters to do this.
> >>>
> >>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>> ---
> >>>  fs/dcache.c | 17 ++++++++++++++---
> >>>  1 file changed, 14 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/fs/dcache.c b/fs/dcache.c
> >>> index fbfae008..f1196f2 100644
> >>> --- a/fs/dcache.c
> >>> +++ b/fs/dcache.c
> >>> @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
> >>>  };
> >>>  
> >>>  static DEFINE_PER_CPU(unsigned int, nr_dentry);
> >>> +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
> >>>  
> >>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> >>>  static int get_nr_dentry(void)
> >>> @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
> >>>  	return sum < 0 ? 0 : sum;
> >>>  }
> >>>  
> >>> +static int get_nr_dentry_unused(void)
> >>> +{
> >>> +	int i;
> >>> +	int sum = 0;
> >>> +	for_each_possible_cpu(i)
> >>> +		sum += per_cpu(nr_dentry_unused, i);
> >>> +	return sum < 0 ? 0 : sum;
> >>> +}
> >>
> >> Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
> >> count survives so we don't leak nr_dentry_unused.  Right?  I see code in
> >> percpu_counter_sum_positive() to explicitly handle this case and I want
> >> to make sure we don't need it here.
> > 
> > DEFINE_PER_CPU() gives a variable per possible CPU, and we sum for
> > all possible CPUs. Therefore online/offline CPUs just don't matter.
> > 
> > The percpu_counter code uses for_each_online_cpu(), and so it has to
> > be aware of hotplug operations so taht it doesn't leak counts.
> 
> It is an unsigned quantity, however. Can't we go negative if it becomes
> unused in one cpu, but used in another?
Sure, but it's unsigned for the purposes of summing, not for the
purposes of having pos/neg values - they are just delta counters.
I'm just copying the code from fs/inode.c. I originally implemented
the fs/inode.c code using generic per-cpu counters, but there was
a hissy fit over "too much overhead" and so someone implemented
their own lightweight version. I've just copied the existing code to
code because I don't care to revisit this....
> Ex:
> 
> nr_unused/0: 0
> nr_unused/1: 0
> 
> dentry goes to the LRU at cpu 1:
> nr_unused/0: 0
> nr_unused/1: 1
> 
> CPU 1 goes down:
> nr_unused/0: 0
why?
> dentry goes out of the LRU at cpu 0:
> nr_unused/0: 1 << 32.
Sorry, where does that shift come from? Pulling from the LRU is just
a simple subtraction. (i.e. 0 - 1 = 0xffffffff), and so
when we sum them all up:
nr_unused/0: 1
nr_unused/0: -1 (0xffffffff)
sum = 1 + 0xffffffff = 0
> That would easily be fixed by using a normal signed long, and is in fact
> what the percpu code does in its internal operations.
Changing it to a long means it becomes at 64 bit value on 64 bit
machines (doubling memory usage), and now you're summing a 64 bit
values into a 32 bit integer. Something else to go wrong....
> Any reason not to do it? Something I am not seeing?
It's a direct copy of the counting code in fs/inode.c. That has not
demonstrated any problems in all my monitoring for the past coupl
eof years  (these are userspace visible stats) so AFAICT this code
is just fine...
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 10/28] dcache: convert to use new lru list infrastructure
  2013-04-08 13:14   ` Glauber Costa
@ 2013-04-08 23:28     ` Dave Chinner
  0 siblings, 0 replies; 97+ messages in thread
From: Dave Chinner @ 2013-04-08 23:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, hughd, containers, Dave Chinner, Michal Hocko,
	Johannes Weiner, linux-fsdevel, Andrew Morton
On Mon, Apr 08, 2013 at 05:14:44PM +0400, Glauber Costa wrote:
> On 03/29/2013 01:13 PM, Glauber Costa wrote:
> > +	if (dentry->d_flags & DCACHE_REFERENCED) {
> > +		dentry->d_flags &= ~DCACHE_REFERENCED;
> > +		spin_unlock(&dentry->d_lock);
> > +
> > +		/*
> > +		 * XXX: this list move should be be done under d_lock. Need to
> > +		 * determine if it is safe just to do it under the lru lock.
> > +		 */
> > +		return 1;
> > +	}
> 
> I've carefully audited the list manipulations in dcache and determined
> this is safe. I've replaced the fixme string for the following text. Let
> me know if you believe this is not right.
....
> diff --git a/fs/dcache.c b/fs/dcache.c
> index a2fc76e..8e166a4 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -855,8 +855,23 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>  		spin_unlock(&dentry->d_lock);
>  
>  		/*
> -		 * XXX: this list move should be be done under d_lock. Need to
> -		 * determine if it is safe just to do it under the lru lock.
> +		 * The list move itself will be made by the common LRU code. At
> +		 * this point, we've dropped the dentry->d_lock but keep the
> +		 * lru lock. This is safe to do, since every list movement is
> +		 * protected by the lru lock even if both locks are held.
> +		 *
> +		 * This is guaranteed by the fact that all LRU management
> +		 * functions are intermediated by the LRU API calls like
> +		 * list_lru_add and list_lru_del. List movement in this file
> +		 * only ever occur through this functions or through callbacks
> +		 * like this one, that are called from the LRU API.
> +		 *
> +		 * The only exceptions to this are functions like
> +		 * shrink_dentry_list, and code that first checks for the
> +		 * DCACHE_SHRINK_LIST flag.  Those are guaranteed to be
> +		 * operating only with stack provided lists after they are
> +		 * properly isolated from the main list.  It is thus, always a
> +		 * local access.
>  		 */
>  		return LRU_ROTATE;
It looks correct - I just never got around to doing the audit to
determine it was. Thanks!
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-08  9:05         ` Glauber Costa
@ 2013-04-09  0:55           ` Joonsoo Kim
  2013-04-09  1:29             ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-09  0:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
Hello, Glauber.
On Mon, Apr 08, 2013 at 01:05:59PM +0400, Glauber Costa wrote:
> On 04/08/2013 01:01 PM, Joonsoo Kim wrote:
> > On Mon, Apr 08, 2013 at 12:47:14PM +0400, Glauber Costa wrote:
> >> On 04/08/2013 12:42 PM, Joonsoo Kim wrote:
> >>> Hello, Glauber.
> >>>
> >>> On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
> >>>> In very low free kernel memory situations, it may be the case that we
> >>>> have less objects to free than our initial batch size. If this is the
> >>>> case, it is better to shrink those, and open space for the new workload
> >>>> then to keep them and fail the new allocations.
> >>>>
> >>>> More specifically, this happens because we encode this in a loop with
> >>>> the condition: "while (total_scan >= batch_size)". So if we are in such
> >>>> a case, we'll not even enter the loop.
> >>>>
> >>>> This patch modifies turns it into a do () while {} loop, that will
> >>>> guarantee that we scan it at least once, while keeping the behaviour
> >>>> exactly the same for the cases in which total_scan > batch_size.
> >>>
> >>> Current user of shrinker not only use their own condition, but also
> >>> use batch_size and seeks to throttle their behavior. So IMHO,
> >>> this behavior change is very dangerous to some users.
> >>>
> >>> For example, think lowmemorykiller.
> >>> With this patch, he always kill some process whenever shrink_slab() is
> >>> called and their low memory condition is satisfied.
> >>> Before this, total_scan also prevent us to go into lowmemorykiller, so
> >>> killing innocent process is limited as much as possible.
> >>>
> >> shrinking is part of the normal operation of the Linux kernel and
> >> happens all the time. Not only the call to shrink_slab, but actual
> >> shrinking of unused objects.
> >>
> >> I don't know therefore about any code that would kill process only
> >> because they have reached shrink_slab.
> >>
> >> In normal systems, this loop will be executed many, many times. So we're
> >> not shrinking *more*, we're just guaranteeing that at least one pass
> >> will be made.
> > 
> > This one pass guarantee is a problem for lowmemory killer.
> > 
> >> Also, anyone looking at this to see if we should kill processes, is a
> >> lot more likely to kill something if we tried to shrink but didn't, than
> >> if we successfully shrunk something.
> > 
> > lowmemory killer is hacky user of shrink_slab interface.
> 
> Well, it says it all =)
> 
> In special, I really can't see how, hacky or not, it makes sense to kill
> a process if we *actually* shrunk memory.
> 
> Moreover, I don't see the code in drivers/staging/android/lowmemory.c
> doing anything even remotely close to that. Could you point me to some
> code that does it ?
Sorry for late. :)
lowmemkiller makes spare memory via killing a task.
Below is code from lowmem_shrink() in lowmemorykiller.c
        for (i = 0; i < array_size; i++) {
                if (other_free < lowmem_minfree[i] &&
                    other_file < lowmem_minfree[i]) {
                        min_score_adj = lowmem_adj[i];
                        break;
                }   
        } 
lowmemkiller kill a process if min_score_adj is assigned.
And then, it goes to for_each_process() loop and select target task.
And then, execute below code.
        if (selected) {
		...
                send_sig(SIGKILL, selected, 0);
                set_tsk_thread_flag(selected, TIF_MEMDIE);
		...
        }
lowmemkiller just check sc->nr_to_scan whether it is 0 or not. And it don't
check it anymore. So if we run do_shrinker_shrink() atleast once
without checking batch_size, there will be side-effect.
Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-09  0:55           ` Joonsoo Kim
@ 2013-04-09  1:29             ` Dave Chinner
  2013-04-09  2:05               ` Joonsoo Kim
  0 siblings, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-09  1:29 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
> Hello, Glauber.
> 
> On Mon, Apr 08, 2013 at 01:05:59PM +0400, Glauber Costa wrote:
> > On 04/08/2013 01:01 PM, Joonsoo Kim wrote:
> > > On Mon, Apr 08, 2013 at 12:47:14PM +0400, Glauber Costa wrote:
> > >> On 04/08/2013 12:42 PM, Joonsoo Kim wrote:
> > >>> Hello, Glauber.
> > >>>
> > >>> On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
> > >>>> In very low free kernel memory situations, it may be the case that we
> > >>>> have less objects to free than our initial batch size. If this is the
> > >>>> case, it is better to shrink those, and open space for the new workload
> > >>>> then to keep them and fail the new allocations.
> > >>>>
> > >>>> More specifically, this happens because we encode this in a loop with
> > >>>> the condition: "while (total_scan >= batch_size)". So if we are in such
> > >>>> a case, we'll not even enter the loop.
> > >>>>
> > >>>> This patch modifies turns it into a do () while {} loop, that will
> > >>>> guarantee that we scan it at least once, while keeping the behaviour
> > >>>> exactly the same for the cases in which total_scan > batch_size.
> > >>>
> > >>> Current user of shrinker not only use their own condition, but also
> > >>> use batch_size and seeks to throttle their behavior. So IMHO,
> > >>> this behavior change is very dangerous to some users.
> > >>>
> > >>> For example, think lowmemorykiller.
> > >>> With this patch, he always kill some process whenever shrink_slab() is
> > >>> called and their low memory condition is satisfied.
> > >>> Before this, total_scan also prevent us to go into lowmemorykiller, so
> > >>> killing innocent process is limited as much as possible.
> > >>>
> > >> shrinking is part of the normal operation of the Linux kernel and
> > >> happens all the time. Not only the call to shrink_slab, but actual
> > >> shrinking of unused objects.
> > >>
> > >> I don't know therefore about any code that would kill process only
> > >> because they have reached shrink_slab.
> > >>
> > >> In normal systems, this loop will be executed many, many times. So we're
> > >> not shrinking *more*, we're just guaranteeing that at least one pass
> > >> will be made.
> > > 
> > > This one pass guarantee is a problem for lowmemory killer.
> > > 
> > >> Also, anyone looking at this to see if we should kill processes, is a
> > >> lot more likely to kill something if we tried to shrink but didn't, than
> > >> if we successfully shrunk something.
> > > 
> > > lowmemory killer is hacky user of shrink_slab interface.
> > 
> > Well, it says it all =)
> > 
> > In special, I really can't see how, hacky or not, it makes sense to kill
> > a process if we *actually* shrunk memory.
> > 
> > Moreover, I don't see the code in drivers/staging/android/lowmemory.c
> > doing anything even remotely close to that. Could you point me to some
> > code that does it ?
> 
> Sorry for late. :)
> 
> lowmemkiller makes spare memory via killing a task.
> 
> Below is code from lowmem_shrink() in lowmemorykiller.c
> 
>         for (i = 0; i < array_size; i++) {
>                 if (other_free < lowmem_minfree[i] &&
>                     other_file < lowmem_minfree[i]) {
>                         min_score_adj = lowmem_adj[i];
>                         break;
>                 }   
>         } 
I don't think you understand what the current lowmemkiller shrinker
hackery actually does.
        rem = global_page_state(NR_ACTIVE_ANON) +
                global_page_state(NR_ACTIVE_FILE) +
                global_page_state(NR_INACTIVE_ANON) +
                global_page_state(NR_INACTIVE_FILE);
        if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
                lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
                             sc->nr_to_scan, sc->gfp_mask, rem);
                return rem;
        }
So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
going to return a count of active/inactive pages in the cache. That
is almost always going to be non-zero, and almost always be > 1000
because of the minimum working set needed to run the system.
Even after applying the seek count adjustment, total_scan is almost
always going to be larger than the shrinker default batch size of
128, and that means this shrinker will almost always run at least
once per shrink_slab() call.
And, interestingly enough, when the file cache has been pruned down
to it's smallest possible size, that's when the shrinker *won't run*
because the that's when the total_scan will be smaller than the
batch size and hence shrinker won't get called.
The shrinker is hacky, abuses the shrinker API, and doesn't appear
to do what it is intended to do.  You need to fix the shrinker, not
use it's brokenness as an excuse to hold up a long overdue shrinker
rework.
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-09  1:29             ` Dave Chinner
@ 2013-04-09  2:05               ` Joonsoo Kim
  2013-04-09  7:43                 ` Glauber Costa
  2013-04-09 12:30                 ` Dave Chinner
  0 siblings, 2 replies; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-09  2:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
Hello, Dave.
On Tue, Apr 09, 2013 at 11:29:31AM +1000, Dave Chinner wrote:
> On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
> > Hello, Glauber.
> > 
> > On Mon, Apr 08, 2013 at 01:05:59PM +0400, Glauber Costa wrote:
> > > On 04/08/2013 01:01 PM, Joonsoo Kim wrote:
> > > > On Mon, Apr 08, 2013 at 12:47:14PM +0400, Glauber Costa wrote:
> > > >> On 04/08/2013 12:42 PM, Joonsoo Kim wrote:
> > > >>> Hello, Glauber.
> > > >>>
> > > >>> On Fri, Mar 29, 2013 at 01:13:44PM +0400, Glauber Costa wrote:
> > > >>>> In very low free kernel memory situations, it may be the case that we
> > > >>>> have less objects to free than our initial batch size. If this is the
> > > >>>> case, it is better to shrink those, and open space for the new workload
> > > >>>> then to keep them and fail the new allocations.
> > > >>>>
> > > >>>> More specifically, this happens because we encode this in a loop with
> > > >>>> the condition: "while (total_scan >= batch_size)". So if we are in such
> > > >>>> a case, we'll not even enter the loop.
> > > >>>>
> > > >>>> This patch modifies turns it into a do () while {} loop, that will
> > > >>>> guarantee that we scan it at least once, while keeping the behaviour
> > > >>>> exactly the same for the cases in which total_scan > batch_size.
> > > >>>
> > > >>> Current user of shrinker not only use their own condition, but also
> > > >>> use batch_size and seeks to throttle their behavior. So IMHO,
> > > >>> this behavior change is very dangerous to some users.
> > > >>>
> > > >>> For example, think lowmemorykiller.
> > > >>> With this patch, he always kill some process whenever shrink_slab() is
> > > >>> called and their low memory condition is satisfied.
> > > >>> Before this, total_scan also prevent us to go into lowmemorykiller, so
> > > >>> killing innocent process is limited as much as possible.
> > > >>>
> > > >> shrinking is part of the normal operation of the Linux kernel and
> > > >> happens all the time. Not only the call to shrink_slab, but actual
> > > >> shrinking of unused objects.
> > > >>
> > > >> I don't know therefore about any code that would kill process only
> > > >> because they have reached shrink_slab.
> > > >>
> > > >> In normal systems, this loop will be executed many, many times. So we're
> > > >> not shrinking *more*, we're just guaranteeing that at least one pass
> > > >> will be made.
> > > > 
> > > > This one pass guarantee is a problem for lowmemory killer.
> > > > 
> > > >> Also, anyone looking at this to see if we should kill processes, is a
> > > >> lot more likely to kill something if we tried to shrink but didn't, than
> > > >> if we successfully shrunk something.
> > > > 
> > > > lowmemory killer is hacky user of shrink_slab interface.
> > > 
> > > Well, it says it all =)
> > > 
> > > In special, I really can't see how, hacky or not, it makes sense to kill
> > > a process if we *actually* shrunk memory.
> > > 
> > > Moreover, I don't see the code in drivers/staging/android/lowmemory.c
> > > doing anything even remotely close to that. Could you point me to some
> > > code that does it ?
> > 
> > Sorry for late. :)
> > 
> > lowmemkiller makes spare memory via killing a task.
> > 
> > Below is code from lowmem_shrink() in lowmemorykiller.c
> > 
> >         for (i = 0; i < array_size; i++) {
> >                 if (other_free < lowmem_minfree[i] &&
> >                     other_file < lowmem_minfree[i]) {
> >                         min_score_adj = lowmem_adj[i];
> >                         break;
> >                 }   
> >         } 
> 
> I don't think you understand what the current lowmemkiller shrinker
> hackery actually does.
> 
>         rem = global_page_state(NR_ACTIVE_ANON) +
>                 global_page_state(NR_ACTIVE_FILE) +
>                 global_page_state(NR_INACTIVE_ANON) +
>                 global_page_state(NR_INACTIVE_FILE);
>         if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
>                 lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
>                              sc->nr_to_scan, sc->gfp_mask, rem);
>                 return rem;
>         }
> 
> So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
> going to return a count of active/inactive pages in the cache. That
> is almost always going to be non-zero, and almost always be > 1000
> because of the minimum working set needed to run the system.
> Even after applying the seek count adjustment, total_scan is almost
> always going to be larger than the shrinker default batch size of
> 128, and that means this shrinker will almost always run at least
> once per shrink_slab() call.
I don't think so.
Yes, lowmem_shrink() return number of (in)active lru pages
when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
lru_pages can vary where shrink_slab() is called, anyway, perhaps this
logic makes total_scan below 128.
> 
> And, interestingly enough, when the file cache has been pruned down
> to it's smallest possible size, that's when the shrinker *won't run*
> because the that's when the total_scan will be smaller than the
> batch size and hence shrinker won't get called.
> 
> The shrinker is hacky, abuses the shrinker API, and doesn't appear
> to do what it is intended to do.  You need to fix the shrinker, not
> use it's brokenness as an excuse to hold up a long overdue shrinker
> rework.
Agreed. I also think shrinker rework is valuable and I don't want
to become a stopper for this change. But, IMHO, at least, we should
notify users of shrinker API to know how shrinker API behavior changed,
because this is unexpected behavior change when they used this API.
When they used this API, they can assume that it is possible to control
logic with seeks and return value(when nr_to_scan=0), but with this patch,
this assumption is broken.
Thanks.
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-09  2:05               ` Joonsoo Kim
@ 2013-04-09  7:43                 ` Glauber Costa
  2013-04-09  9:08                   ` Joonsoo Kim
  2013-04-09 12:30                 ` Dave Chinner
  1 sibling, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-09  7:43 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Dave Chinner, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
On 04/09/2013 06:05 AM, Joonsoo Kim wrote:
> I don't think so.
> Yes, lowmem_shrink() return number of (in)active lru pages
> when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
> lru_pages can vary where shrink_slab() is called, anyway, perhaps this
> logic makes total_scan below 128.
> 
You may benefit from looking at the lowmemory patches in this patchset
itself. We modified the shrinker API to separate the count and scan
phases. With this, the whole nr_to_scan == 0 disappears and the code
gets easier to follow.
>> > 
>> > And, interestingly enough, when the file cache has been pruned down
>> > to it's smallest possible size, that's when the shrinker *won't run*
>> > because the that's when the total_scan will be smaller than the
>> > batch size and hence shrinker won't get called.
>> > 
>> > The shrinker is hacky, abuses the shrinker API, and doesn't appear
>> > to do what it is intended to do.  You need to fix the shrinker, not
>> > use it's brokenness as an excuse to hold up a long overdue shrinker
>> > rework.
> Agreed. I also think shrinker rework is valuable and I don't want
> to become a stopper for this change. But, IMHO, at least, we should
> notify users of shrinker API to know how shrinker API behavior changed,
Except that the behavior didn't change.
> because this is unexpected behavior change when they used this API.
> When they used this API, they can assume that it is possible to control
> logic with seeks and return value(when nr_to_scan=0), but with this patch,
> this assumption is broken.
> 
Jonsoo, you are still missing the point. nr_to_scan=0 has nothing to do
with this, or with this patch. nr_to_scan will reach 0 ANYWAY if you
shrink all objects you have to shrink, which is a *very* common thing to
happen.
The only case changed here is where this happen when attempting to
shrink a small number of objects that is smaller than the batch size.
Also, again, the nr_to_scan=0 checks in the shrinker calls have nothing
to do with that. They reflect the situation *BEFORE* the shrinker was
called. So how many objects we shrunk afterwards have zero to do with
it. This is just the shrinker API using the magic value of 0 to mean :
"don't shrink, just tell me how much do you have", vs a positive number
meaning "try to shrink as much as nr_to_scan objects".
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-04-08 23:26         ` Dave Chinner
@ 2013-04-09  8:02           ` Glauber Costa
  2013-04-09 12:47             ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-09  8:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On 04/09/2013 03:26 AM, Dave Chinner wrote:
> On Mon, Apr 08, 2013 at 01:14:48PM +0400, Glauber Costa wrote:
>> On 04/05/2013 05:15 AM, Dave Chinner wrote:
>>> On Thu, Apr 04, 2013 at 06:09:31PM -0700, Greg Thelen wrote:
>>>> On Fri, Mar 29 2013, Glauber Costa wrote:
>>>>
>>>>> From: Dave Chinner <dchinner@redhat.com>
>>>>>
>>>>> Before we split up the dcache_lru_lock, the unused dentry counter
>>>>> needs to be made independent of the global dcache_lru_lock. Convert
>>>>> it to per-cpu counters to do this.
>>>>>
>>>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> ---
>>>>>  fs/dcache.c | 17 ++++++++++++++---
>>>>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/dcache.c b/fs/dcache.c
>>>>> index fbfae008..f1196f2 100644
>>>>> --- a/fs/dcache.c
>>>>> +++ b/fs/dcache.c
>>>>> @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
>>>>>  };
>>>>>  
>>>>>  static DEFINE_PER_CPU(unsigned int, nr_dentry);
>>>>> +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
>>>>>  
>>>>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
>>>>>  static int get_nr_dentry(void)
>>>>> @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
>>>>>  	return sum < 0 ? 0 : sum;
>>>>>  }
>>>>>  
>>>>> +static int get_nr_dentry_unused(void)
>>>>> +{
>>>>> +	int i;
>>>>> +	int sum = 0;
>>>>> +	for_each_possible_cpu(i)
>>>>> +		sum += per_cpu(nr_dentry_unused, i);
>>>>> +	return sum < 0 ? 0 : sum;
>>>>> +}
>>>>
>>>> Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
>>>> count survives so we don't leak nr_dentry_unused.  Right?  I see code in
>>>> percpu_counter_sum_positive() to explicitly handle this case and I want
>>>> to make sure we don't need it here.
>>>
>>> DEFINE_PER_CPU() gives a variable per possible CPU, and we sum for
>>> all possible CPUs. Therefore online/offline CPUs just don't matter.
>>>
>>> The percpu_counter code uses for_each_online_cpu(), and so it has to
>>> be aware of hotplug operations so taht it doesn't leak counts.
>>
>> It is an unsigned quantity, however. Can't we go negative if it becomes
>> unused in one cpu, but used in another?
> 
> Sure, but it's unsigned for the purposes of summing, not for the
> purposes of having pos/neg values - they are just delta counters.
> 
> I'm just copying the code from fs/inode.c. I originally implemented
> the fs/inode.c code using generic per-cpu counters, but there was
> a hissy fit over "too much overhead" and so someone implemented
> their own lightweight version. I've just copied the existing code to
> code because I don't care to revisit this....
> 
Funny enough, we re implement per-cpu counters in memcg as well.
This is mostly overhead/counters cache layout related. Maybe it is time
for a better percpu counter ? (not that I have the time for it...)
>> Ex:
>>
>> nr_unused/0: 0
>> nr_unused/1: 0
>>
>> dentry goes to the LRU at cpu 1:
>> nr_unused/0: 0
>> nr_unused/1: 1
>>
>> CPU 1 goes down:
>> nr_unused/0: 0
> 
> why?
> 
>> dentry goes out of the LRU at cpu 0:
>> nr_unused/0: 1 << 32.
> 
> Sorry, where does that shift come from? Pulling from the LRU is just
> a simple subtraction. (i.e. 0 - 1 = 0xffffffff), and so
> when we sum them all up:
> 
> nr_unused/0: 1
> nr_unused/0: -1 (0xffffffff)
> 
> sum = 1 + 0xffffffff = 0
> 
>> That would easily be fixed by using a normal signed long, and is in fact
>> what the percpu code does in its internal operations.
> 
> Changing it to a long means it becomes at 64 bit value on 64 bit
> machines (doubling memory usage), and now you're summing a 64 bit
> values into a 32 bit integer. Something else to go wrong....
> 
>> Any reason not to do it? Something I am not seeing?
> 
> It's a direct copy of the counting code in fs/inode.c. That has not
> demonstrated any problems in all my monitoring for the past coupl
> eof years  (these are userspace visible stats) so AFAICT this code
> is just fine...
> 
Well, in this case I can revert that.
My main concern is us being caught by overflows and stuff like that. But
I trust the "it's working for x years" more than my eyes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-09  7:43                 ` Glauber Costa
@ 2013-04-09  9:08                   ` Joonsoo Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-09  9:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
Hello, Glauber.
On Tue, Apr 09, 2013 at 11:43:33AM +0400, Glauber Costa wrote:
> On 04/09/2013 06:05 AM, Joonsoo Kim wrote:
> > I don't think so.
> > Yes, lowmem_shrink() return number of (in)active lru pages
> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
> > logic makes total_scan below 128.
> > 
> You may benefit from looking at the lowmemory patches in this patchset
> itself. We modified the shrinker API to separate the count and scan
> phases. With this, the whole nr_to_scan == 0 disappears and the code
> gets easier to follow.
> 
> >> > 
> >> > And, interestingly enough, when the file cache has been pruned down
> >> > to it's smallest possible size, that's when the shrinker *won't run*
> >> > because the that's when the total_scan will be smaller than the
> >> > batch size and hence shrinker won't get called.
> >> > 
> >> > The shrinker is hacky, abuses the shrinker API, and doesn't appear
> >> > to do what it is intended to do.  You need to fix the shrinker, not
> >> > use it's brokenness as an excuse to hold up a long overdue shrinker
> >> > rework.
> > Agreed. I also think shrinker rework is valuable and I don't want
> > to become a stopper for this change. But, IMHO, at least, we should
> > notify users of shrinker API to know how shrinker API behavior changed,
> 
> Except that the behavior didn't change.
> 
> > because this is unexpected behavior change when they used this API.
> > When they used this API, they can assume that it is possible to control
> > logic with seeks and return value(when nr_to_scan=0), but with this patch,
> > this assumption is broken.
> > 
> 
> Jonsoo, you are still missing the point. nr_to_scan=0 has nothing to do
> with this, or with this patch. nr_to_scan will reach 0 ANYWAY if you
> shrink all objects you have to shrink, which is a *very* common thing to
> happen.
> 
> The only case changed here is where this happen when attempting to
> shrink a small number of objects that is smaller than the batch size.
> 
> Also, again, the nr_to_scan=0 checks in the shrinker calls have nothing
> to do with that. They reflect the situation *BEFORE* the shrinker was
> called. So how many objects we shrunk afterwards have zero to do with
> it. This is just the shrinker API using the magic value of 0 to mean :
> "don't shrink, just tell me how much do you have", vs a positive number
> meaning "try to shrink as much as nr_to_scan objects".
Yes, I know that :)
It seems that I mislead you and you misunderstand what I want to say.
Sorry for my poor English.
I mean to say, changing when we attempt to shrink a small number of
objects(below batch size) can affect some users of API and their system.
Maybe they assume that if they have a little objects, shrinker will not
call do_shrinker_shrink(). But, with this patch, although they have a
little objects, shrinker call do_shrinker_shrink() at least once.
Thanks.
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-09  2:05               ` Joonsoo Kim
  2013-04-09  7:43                 ` Glauber Costa
@ 2013-04-09 12:30                 ` Dave Chinner
  2013-04-10  2:51                   ` Joonsoo Kim
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-09 12:30 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
> On Tue, Apr 09, 2013 at 11:29:31AM +1000, Dave Chinner wrote:
> > On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
> > > lowmemkiller makes spare memory via killing a task.
> > > 
> > > Below is code from lowmem_shrink() in lowmemorykiller.c
> > > 
> > >         for (i = 0; i < array_size; i++) {
> > >                 if (other_free < lowmem_minfree[i] &&
> > >                     other_file < lowmem_minfree[i]) {
> > >                         min_score_adj = lowmem_adj[i];
> > >                         break;
> > >                 }   
> > >         } 
> > 
> > I don't think you understand what the current lowmemkiller shrinker
> > hackery actually does.
> > 
> >         rem = global_page_state(NR_ACTIVE_ANON) +
> >                 global_page_state(NR_ACTIVE_FILE) +
> >                 global_page_state(NR_INACTIVE_ANON) +
> >                 global_page_state(NR_INACTIVE_FILE);
> >         if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
> >                 lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> >                              sc->nr_to_scan, sc->gfp_mask, rem);
> >                 return rem;
> >         }
> > 
> > So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
> > going to return a count of active/inactive pages in the cache. That
> > is almost always going to be non-zero, and almost always be > 1000
> > because of the minimum working set needed to run the system.
> > Even after applying the seek count adjustment, total_scan is almost
> > always going to be larger than the shrinker default batch size of
> > 128, and that means this shrinker will almost always run at least
> > once per shrink_slab() call.
> 
> I don't think so.
> Yes, lowmem_shrink() return number of (in)active lru pages
> when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
> lru_pages can vary where shrink_slab() is called, anyway, perhaps this
> logic makes total_scan below 128.
"perhaps"
There is no "perhaps" here - there is *zero* guarantee of the
behaviour you are claiming the lowmem killer shrinker is dependent
on with the existing shrinker infrastructure. So, lets say we have:
	nr_pages_scanned = 1000
	lru_pages = 100,000
Your shrinker is going to return 100,000 when nr_to_scan = 0. So,
we have:
	batch_size = SHRINK_BATCH = 128
	max_pass= 100,000
	total_scan = shrinker->nr_in_batch = 0
	delta = 4 * 1000 / 32 = 128
	delta = 128 * 100,000 = 12,800,000
	delta = 12,800,000 / 100,001 = 127
	total_scan += delta = 127
Assuming the LRU pages count does not change(*), nr_pages_scanned is
irrelevant and delta always comes in 1 count below the batch size,
and the shrinker is not called. The remainder is then:
	shrinker->nr_in_batch += total_scan = 127
(*) the lru page count will change, because reclaim and shrinkers
run concurrently, and so we can't even make a simple contrived case
where delta is consistently < batch_size here.
Anyway, the next time the shrinker is entered, we start with:
	total_scan = shrinker->nr_in_batch = 127
	.....
	total_scan += delta = 254
	<shrink once, total scan -= batch_size = 126>
	shrinker->nr_in_batch += total_scan = 126
And so on for all the subsequent shrink_slab calls....
IOWs, this algorithm effectively causes the shrinker to be called
127 times out of 128 in this arbitrary scenario. It does not behave
as you are assuming it to, and as such any code based on those
assumptions is broken....
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-04-09  8:02           ` Glauber Costa
@ 2013-04-09 12:47             ` Dave Chinner
  0 siblings, 0 replies; 97+ messages in thread
From: Dave Chinner @ 2013-04-09 12:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Greg Thelen, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, hughd, yinghan,
	Dave Chinner
On Tue, Apr 09, 2013 at 12:02:21PM +0400, Glauber Costa wrote:
> On 04/09/2013 03:26 AM, Dave Chinner wrote:
> > On Mon, Apr 08, 2013 at 01:14:48PM +0400, Glauber Costa wrote:
> >> On 04/05/2013 05:15 AM, Dave Chinner wrote:
> >>> On Thu, Apr 04, 2013 at 06:09:31PM -0700, Greg Thelen wrote:
> >>>> On Fri, Mar 29 2013, Glauber Costa wrote:
> >>>>
> >>>>> From: Dave Chinner <dchinner@redhat.com>
> >>>>>
> >>>>> Before we split up the dcache_lru_lock, the unused dentry counter
> >>>>> needs to be made independent of the global dcache_lru_lock. Convert
> >>>>> it to per-cpu counters to do this.
> >>>>>
> >>>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> ---
> >>>>>  fs/dcache.c | 17 ++++++++++++++---
> >>>>>  1 file changed, 14 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/dcache.c b/fs/dcache.c
> >>>>> index fbfae008..f1196f2 100644
> >>>>> --- a/fs/dcache.c
> >>>>> +++ b/fs/dcache.c
> >>>>> @@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
> >>>>>  };
> >>>>>  
> >>>>>  static DEFINE_PER_CPU(unsigned int, nr_dentry);
> >>>>> +static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
> >>>>>  
> >>>>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> >>>>>  static int get_nr_dentry(void)
> >>>>> @@ -129,10 +130,20 @@ static int get_nr_dentry(void)
> >>>>>  	return sum < 0 ? 0 : sum;
> >>>>>  }
> >>>>>  
> >>>>> +static int get_nr_dentry_unused(void)
> >>>>> +{
> >>>>> +	int i;
> >>>>> +	int sum = 0;
> >>>>> +	for_each_possible_cpu(i)
> >>>>> +		sum += per_cpu(nr_dentry_unused, i);
> >>>>> +	return sum < 0 ? 0 : sum;
> >>>>> +}
> >>>>
> >>>> Just checking...  If cpu x is removed, then its per cpu nr_dentry_unused
> >>>> count survives so we don't leak nr_dentry_unused.  Right?  I see code in
> >>>> percpu_counter_sum_positive() to explicitly handle this case and I want
> >>>> to make sure we don't need it here.
> >>>
> >>> DEFINE_PER_CPU() gives a variable per possible CPU, and we sum for
> >>> all possible CPUs. Therefore online/offline CPUs just don't matter.
> >>>
> >>> The percpu_counter code uses for_each_online_cpu(), and so it has to
> >>> be aware of hotplug operations so taht it doesn't leak counts.
> >>
> >> It is an unsigned quantity, however. Can't we go negative if it becomes
> >> unused in one cpu, but used in another?
> > 
> > Sure, but it's unsigned for the purposes of summing, not for the
> > purposes of having pos/neg values - they are just delta counters.
> > 
> > I'm just copying the code from fs/inode.c. I originally implemented
> > the fs/inode.c code using generic per-cpu counters, but there was
> > a hissy fit over "too much overhead" and so someone implemented
> > their own lightweight version. I've just copied the existing code to
> > code because I don't care to revisit this....
> > 
> 
> Funny enough, we re implement per-cpu counters in memcg as well.
> This is mostly overhead/counters cache layout related. Maybe it is time
> for a better percpu counter ? (not that I have the time for it...)
Word.
I've just given up trying to convince people to use the generic code
when they are set on micro-optimising code. The "I can trim 3
instructions from every increment and decrement" argument seems to
win every time over "we know the generic counters work"....
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-09 12:30                 ` Dave Chinner
@ 2013-04-10  2:51                   ` Joonsoo Kim
  2013-04-10  7:30                     ` Glauber Costa
                                       ` (3 more replies)
  0 siblings, 4 replies; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-10  2:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
Hello, Dave.
On Tue, Apr 09, 2013 at 10:30:08PM +1000, Dave Chinner wrote:
> On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
> > On Tue, Apr 09, 2013 at 11:29:31AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
> > > > lowmemkiller makes spare memory via killing a task.
> > > > 
> > > > Below is code from lowmem_shrink() in lowmemorykiller.c
> > > > 
> > > >         for (i = 0; i < array_size; i++) {
> > > >                 if (other_free < lowmem_minfree[i] &&
> > > >                     other_file < lowmem_minfree[i]) {
> > > >                         min_score_adj = lowmem_adj[i];
> > > >                         break;
> > > >                 }   
> > > >         } 
> > > 
> > > I don't think you understand what the current lowmemkiller shrinker
> > > hackery actually does.
> > > 
> > >         rem = global_page_state(NR_ACTIVE_ANON) +
> > >                 global_page_state(NR_ACTIVE_FILE) +
> > >                 global_page_state(NR_INACTIVE_ANON) +
> > >                 global_page_state(NR_INACTIVE_FILE);
> > >         if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
> > >                 lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> > >                              sc->nr_to_scan, sc->gfp_mask, rem);
> > >                 return rem;
> > >         }
> > > 
> > > So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
> > > going to return a count of active/inactive pages in the cache. That
> > > is almost always going to be non-zero, and almost always be > 1000
> > > because of the minimum working set needed to run the system.
> > > Even after applying the seek count adjustment, total_scan is almost
> > > always going to be larger than the shrinker default batch size of
> > > 128, and that means this shrinker will almost always run at least
> > > once per shrink_slab() call.
> > 
> > I don't think so.
> > Yes, lowmem_shrink() return number of (in)active lru pages
> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
> > logic makes total_scan below 128.
> 
> "perhaps"
> 
> 
> There is no "perhaps" here - there is *zero* guarantee of the
> behaviour you are claiming the lowmem killer shrinker is dependent
> on with the existing shrinker infrastructure. So, lets say we have:
> 
> 	nr_pages_scanned = 1000
> 	lru_pages = 100,000
> 
> Your shrinker is going to return 100,000 when nr_to_scan = 0. So,
> we have:
> 
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 100,000
> 
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 1000 / 32 = 128
> 	delta = 128 * 100,000 = 12,800,000
> 	delta = 12,800,000 / 100,001 = 127
> 	total_scan += delta = 127
> 
> Assuming the LRU pages count does not change(*), nr_pages_scanned is
> irrelevant and delta always comes in 1 count below the batch size,
> and the shrinker is not called. The remainder is then:
> 
> 	shrinker->nr_in_batch += total_scan = 127
> 
> (*) the lru page count will change, because reclaim and shrinkers
> run concurrently, and so we can't even make a simple contrived case
> where delta is consistently < batch_size here.
> 
> Anyway, the next time the shrinker is entered, we start with:
> 
> 	total_scan = shrinker->nr_in_batch = 127
> 	.....
> 	total_scan += delta = 254
> 
> 	<shrink once, total scan -= batch_size = 126>
> 
> 	shrinker->nr_in_batch += total_scan = 126
> 
> And so on for all the subsequent shrink_slab calls....
> 
> IOWs, this algorithm effectively causes the shrinker to be called
> 127 times out of 128 in this arbitrary scenario. It does not behave
> as you are assuming it to, and as such any code based on those
> assumptions is broken....
Thanks for good example. I got your point :)
But, my concern is not solved entirely, because this is not problem
just for lowmem killer and I can think counter example. And other drivers
can be suffered from this change.
I look at the code for "huge_zero_page_shrinker".
They return HPAGE_PMD_NR if there is shrikerable object.
I try to borrow your example for this case.
 	nr_pages_scanned = 1,000
 	lru_pages = 100,000
 	batch_size = SHRINK_BATCH = 128
 	max_pass= 512 (HPAGE_PMD_NR)
 
 	total_scan = shrinker->nr_in_batch = 0
 	delta = 4 * 1,000 / 2 = 2,000
 	delta = 2,000 * 512 = 1,024,000
 	delta = 1,024,000 / 100,001 = 10
 	total_scan += delta = 10
As you can see, before this patch, do_shrinker_shrink() for
"huge_zero_page_shrinker" is not called until we call shrink_slab() more
than 13 times. *Frequency* we call do_shrinker_shrink() actually is
largely different with before. With this patch, we actually call
do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
than before. Can we be convinced that there will be no problem?
This is why I worry about this change.
Am I worried too much? :)
I show another scenario what I am thinking for lowmem killer.
In reality, 'nr_pages_scanned' reflect sc->priority.
You can see it get_scan_count() in vmscan.c
	size = get_lru_size(lruvec, lru);
	scan = size >> sc->priority;
So, I try to re-construct your example with above assumption.
If sc->priority is DEF_PRIORITY (12)
 	nr_pages_scanned = 25 (100,000 / 4,096)
 	lru_pages = 100,000
 	batch_size = SHRINK_BATCH = 128
 	max_pass= 100,000
 
 	total_scan = shrinker->nr_in_batch = 0
 	delta = 4 * 25 / 32 = 3
 	delta = 3 * 100,000 = 300,000
 	delta = 300,000 / 100,001 = 3
 	total_scan += delta = 3
So, do_shrinker_shrink() is not called for lowmem killer until
we call shrink_slab() more than 40 times if sc->priority is DEF_PRIORITY.
So, AICT, if we don't have trouble too much in reclaiming memory, it will not
triggered frequently.
I like this patchset, and I think shrink_slab interface should be
re-worked. What I want to say is just that this patch is not trivial
change and should notify user to test it.
I want to say again, I don't want to become a stopper for this patchset :)
Please let me know what I am missing.
Thanks.
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-01  8:10     ` Glauber Costa
@ 2013-04-10  5:09       ` Ric Mason
  2013-04-10  7:32         ` Glauber Costa
  2013-04-10  9:19         ` Dave Chinner
  0 siblings, 2 replies; 97+ messages in thread
From: Ric Mason @ 2013-04-10  5:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Kamezawa Hiroyuki, linux-mm, linux-fsdevel, containers,
	Michal Hocko, Johannes Weiner, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
Hi Glauber,
On 04/01/2013 04:10 PM, Glauber Costa wrote:
> Hi Kame,
>
>> Doesn't this break
>>
>> ==
>>                 /*
>>                  * copy the current shrinker scan count into a local variable
>>                  * and zero it so that other concurrent shrinker invocations
>>                  * don't also do this scanning work.
>>                  */
>>                 nr = atomic_long_xchg(&shrinker->nr_in_batch, 0);
>> ==
>>
>> This xchg magic ?
>>
>> Thnks,
>> -Kame
> This is done before the actual reclaim attempt, and all it does is to
> indicate to other concurrent shrinkers that "I've got it", and others
> should not attempt to shrink.
>
> Even before I touch this, this quantity represents the number of
> entities we will try to shrink. Not necessarily we will succeed. What my
> patch does, is to try at least once if the number is too small.
>
> Before it, we will try to shrink 512 objects and succeed at 0 (because
> batch is 1024). After this, we will try to free 512 objects and succeed
> at an undefined quantity between 0 and 512.
Where you get the magic number 512 and 1024? The value of SHRINK_BATCH
is 128.
>
> In both cases, we will zero out nr_in_batch in the shrinker structure to
> notify other shrinkers that we are the ones shrinking.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  2:51                   ` Joonsoo Kim
@ 2013-04-10  7:30                     ` Glauber Costa
  2013-04-10  8:19                       ` Joonsoo Kim
  2013-04-10  8:46                     ` Wanpeng Li
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 97+ messages in thread
From: Glauber Costa @ 2013-04-10  7:30 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Dave Chinner, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
On 04/10/2013 06:51 AM, Joonsoo Kim wrote:
> As you can see, before this patch, do_shrinker_shrink() for
> "huge_zero_page_shrinker" is not called until we call shrink_slab() more
> than 13 times. *Frequency* we call do_shrinker_shrink() actually is
> largely different with before. With this patch, we actually call
> do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
> than before. Can we be convinced that there will be no problem?
> 
> This is why I worry about this change.
> Am I worried too much? :)
Yes, you are. The amount of times shrink_slab is called is completely
unpredictable. Changing the size of cached data structures is a lot more
likely to change this than this shrinker change, for instance.
Not to mention, the amount of times shrink_slab() is called is not
changed directly here. But rather, the amount of times an individual
shrinker actually does work.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  5:09       ` Ric Mason
@ 2013-04-10  7:32         ` Glauber Costa
  2013-04-10  9:19         ` Dave Chinner
  1 sibling, 0 replies; 97+ messages in thread
From: Glauber Costa @ 2013-04-10  7:32 UTC (permalink / raw)
  To: Ric Mason
  Cc: Kamezawa Hiroyuki, linux-mm, linux-fsdevel, containers,
	Michal Hocko, Johannes Weiner, Andrew Morton, Dave Shrinnker,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On 04/10/2013 09:09 AM, Ric Mason wrote:
>> Before it, we will try to shrink 512 objects and succeed at 0 (because
>> > batch is 1024). After this, we will try to free 512 objects and succeed
>> > at an undefined quantity between 0 and 512.
> Where you get the magic number 512 and 1024? The value of SHRINK_BATCH
> is 128.
> 
This is shrinker-defined. For instance, the super-block shrinker reads:
                s->s_shrink.shrink = prune_super;
                s->s_shrink.batch = 1024;
And then vmscan:
                long batch_size = shrinker->batch ? shrinker->batch
                                                  : SHRINK_BATCH;
I am dealing too much with the super block shrinker these days, so I
just had that cached in my mind and forgot to check the code and be more
explicit.
In any case, that was a numeric example that is valid nevertheless.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  7:30                     ` Glauber Costa
@ 2013-04-10  8:19                       ` Joonsoo Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Joonsoo Kim @ 2013-04-10  8:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
On Wed, Apr 10, 2013 at 11:30:32AM +0400, Glauber Costa wrote:
> On 04/10/2013 06:51 AM, Joonsoo Kim wrote:
> > As you can see, before this patch, do_shrinker_shrink() for
> > "huge_zero_page_shrinker" is not called until we call shrink_slab() more
> > than 13 times. *Frequency* we call do_shrinker_shrink() actually is
> > largely different with before. With this patch, we actually call
> > do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
> > than before. Can we be convinced that there will be no problem?
> > 
> > This is why I worry about this change.
> > Am I worried too much? :)
> 
> Yes, you are. The amount of times shrink_slab is called is completely
> unpredictable. Changing the size of cached data structures is a lot more
> likely to change this than this shrinker change, for instance.
> 
> Not to mention, the amount of times shrink_slab() is called is not
> changed directly here. But rather, the amount of times an individual
> shrinker actually does work.
Yes, I worried about the amount of times an individual shrinker triggered.
As you mentioned, it can be unpredictable. My concern without data may
be useless and invalid to you. So, from now on, I will stop to worry about this.
Thanks.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  2:51                   ` Joonsoo Kim
  2013-04-10  7:30                     ` Glauber Costa
  2013-04-10  8:46                     ` Wanpeng Li
@ 2013-04-10  8:46                     ` Wanpeng Li
  2013-04-10 10:07                       ` Dave Chinner
       [not found]                     ` <20130410025115.GA5872-Hm3cg6mZ9cc@public.gmane.org>
  3 siblings, 1 reply; 97+ messages in thread
From: Wanpeng Li @ 2013-04-10  8:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
Hi Glauber,
On Wed, Apr 10, 2013 at 11:51:16AM +0900, Joonsoo Kim wrote:
>Hello, Dave.
>
>On Tue, Apr 09, 2013 at 10:30:08PM +1000, Dave Chinner wrote:
>> On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
>> > On Tue, Apr 09, 2013 at 11:29:31AM +1000, Dave Chinner wrote:
>> > > On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
>> > > > lowmemkiller makes spare memory via killing a task.
>> > > > 
>> > > > Below is code from lowmem_shrink() in lowmemorykiller.c
>> > > > 
>> > > >         for (i = 0; i < array_size; i++) {
>> > > >                 if (other_free < lowmem_minfree[i] &&
>> > > >                     other_file < lowmem_minfree[i]) {
>> > > >                         min_score_adj = lowmem_adj[i];
>> > > >                         break;
>> > > >                 }   
>> > > >         } 
>> > > 
>> > > I don't think you understand what the current lowmemkiller shrinker
>> > > hackery actually does.
>> > > 
>> > >         rem = global_page_state(NR_ACTIVE_ANON) +
>> > >                 global_page_state(NR_ACTIVE_FILE) +
>> > >                 global_page_state(NR_INACTIVE_ANON) +
>> > >                 global_page_state(NR_INACTIVE_FILE);
>> > >         if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
>> > >                 lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
>> > >                              sc->nr_to_scan, sc->gfp_mask, rem);
>> > >                 return rem;
>> > >         }
>> > > 
>> > > So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
>> > > going to return a count of active/inactive pages in the cache. That
>> > > is almost always going to be non-zero, and almost always be > 1000
>> > > because of the minimum working set needed to run the system.
>> > > Even after applying the seek count adjustment, total_scan is almost
>> > > always going to be larger than the shrinker default batch size of
>> > > 128, and that means this shrinker will almost always run at least
>> > > once per shrink_slab() call.
>> > 
>> > I don't think so.
>> > Yes, lowmem_shrink() return number of (in)active lru pages
>> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
>> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
>> > logic makes total_scan below 128.
>> 
>> "perhaps"
>> 
>> 
>> There is no "perhaps" here - there is *zero* guarantee of the
>> behaviour you are claiming the lowmem killer shrinker is dependent
>> on with the existing shrinker infrastructure. So, lets say we have:
>> 
>> 	nr_pages_scanned = 1000
>> 	lru_pages = 100,000
>> 
>> Your shrinker is going to return 100,000 when nr_to_scan = 0. So,
>> we have:
>> 
>> 	batch_size = SHRINK_BATCH = 128
>> 	max_pass= 100,000
>> 
>> 	total_scan = shrinker->nr_in_batch = 0
>> 	delta = 4 * 1000 / 32 = 128
>> 	delta = 128 * 100,000 = 12,800,000
>> 	delta = 12,800,000 / 100,001 = 127
>> 	total_scan += delta = 127
>> 
>> Assuming the LRU pages count does not change(*), nr_pages_scanned is
>> irrelevant and delta always comes in 1 count below the batch size,
>> and the shrinker is not called. The remainder is then:
>> 
>> 	shrinker->nr_in_batch += total_scan = 127
>> 
>> (*) the lru page count will change, because reclaim and shrinkers
>> run concurrently, and so we can't even make a simple contrived case
>> where delta is consistently < batch_size here.
>> 
>> Anyway, the next time the shrinker is entered, we start with:
>> 
>> 	total_scan = shrinker->nr_in_batch = 127
>> 	.....
>> 	total_scan += delta = 254
>> 
>> 	<shrink once, total scan -= batch_size = 126>
>> 
>> 	shrinker->nr_in_batch += total_scan = 126
>> 
>> And so on for all the subsequent shrink_slab calls....
>> 
>> IOWs, this algorithm effectively causes the shrinker to be called
>> 127 times out of 128 in this arbitrary scenario. It does not behave
>> as you are assuming it to, and as such any code based on those
>> assumptions is broken....
>
>Thanks for good example. I got your point :)
>But, my concern is not solved entirely, because this is not problem
>just for lowmem killer and I can think counter example. And other drivers
>can be suffered from this change.
>
>I look at the code for "huge_zero_page_shrinker".
>They return HPAGE_PMD_NR if there is shrikerable object.
>
>I try to borrow your example for this case.
>
> 	nr_pages_scanned = 1,000
> 	lru_pages = 100,000
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 512 (HPAGE_PMD_NR)
>
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 1,000 / 2 = 2,000
> 	delta = 2,000 * 512 = 1,024,000
> 	delta = 1,024,000 / 100,001 = 10
> 	total_scan += delta = 10
>
>As you can see, before this patch, do_shrinker_shrink() for
>"huge_zero_page_shrinker" is not called until we call shrink_slab() more
>than 13 times. *Frequency* we call do_shrinker_shrink() actually is
>largely different with before. With this patch, we actually call
>do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
>than before. Can we be convinced that there will be no problem?
>
>This is why I worry about this change.
>Am I worried too much? :)
>
>I show another scenario what I am thinking for lowmem killer.
>
>In reality, 'nr_pages_scanned' reflect sc->priority.
>You can see it get_scan_count() in vmscan.c
>
>	size = get_lru_size(lruvec, lru);
>	scan = size >> sc->priority;
>
>So, I try to re-construct your example with above assumption.
>
>If sc->priority is DEF_PRIORITY (12)
>
> 	nr_pages_scanned = 25 (100,000 / 4,096)
> 	lru_pages = 100,000
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 100,000
>
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 25 / 32 = 3
> 	delta = 3 * 100,000 = 300,000
> 	delta = 300,000 / 100,001 = 3
> 	total_scan += delta = 3
>
>So, do_shrinker_shrink() is not called for lowmem killer until
>we call shrink_slab() more than 40 times if sc->priority is DEF_PRIORITY.
>So, AICT, if we don't have trouble too much in reclaiming memory, it will not
>triggered frequently.
>
As the example from Joonsoo, before the patch, if scan priority is low, 
slab cache won't be shrinked, however, after the patch, slab cache is 
shrinked more aggressive. Furthmore, these slab cache pages maybe more 
seek expensive than lru pages.
Regards,
Wanpeng Li 
>I like this patchset, and I think shrink_slab interface should be
>re-worked. What I want to say is just that this patch is not trivial
>change and should notify user to test it.
>I want to say again, I don't want to become a stopper for this patchset :)
>
>Please let me know what I am missing.
>
>Thanks.
>
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
       [not found]                     ` <20130410025115.GA5872-Hm3cg6mZ9cc@public.gmane.org>
@ 2013-04-10  8:46                       ` Wanpeng Li
  0 siblings, 0 replies; 97+ messages in thread
From: Wanpeng Li @ 2013-04-10  8:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Theodore Ts'o, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Al Viro, Johannes Weiner, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton
Hi Glauber,
On Wed, Apr 10, 2013 at 11:51:16AM +0900, Joonsoo Kim wrote:
>Hello, Dave.
>
>On Tue, Apr 09, 2013 at 10:30:08PM +1000, Dave Chinner wrote:
>> On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
>> > On Tue, Apr 09, 2013 at 11:29:31AM +1000, Dave Chinner wrote:
>> > > On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
>> > > > lowmemkiller makes spare memory via killing a task.
>> > > > 
>> > > > Below is code from lowmem_shrink() in lowmemorykiller.c
>> > > > 
>> > > >         for (i = 0; i < array_size; i++) {
>> > > >                 if (other_free < lowmem_minfree[i] &&
>> > > >                     other_file < lowmem_minfree[i]) {
>> > > >                         min_score_adj = lowmem_adj[i];
>> > > >                         break;
>> > > >                 }   
>> > > >         } 
>> > > 
>> > > I don't think you understand what the current lowmemkiller shrinker
>> > > hackery actually does.
>> > > 
>> > >         rem = global_page_state(NR_ACTIVE_ANON) +
>> > >                 global_page_state(NR_ACTIVE_FILE) +
>> > >                 global_page_state(NR_INACTIVE_ANON) +
>> > >                 global_page_state(NR_INACTIVE_FILE);
>> > >         if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
>> > >                 lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
>> > >                              sc->nr_to_scan, sc->gfp_mask, rem);
>> > >                 return rem;
>> > >         }
>> > > 
>> > > So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
>> > > going to return a count of active/inactive pages in the cache. That
>> > > is almost always going to be non-zero, and almost always be > 1000
>> > > because of the minimum working set needed to run the system.
>> > > Even after applying the seek count adjustment, total_scan is almost
>> > > always going to be larger than the shrinker default batch size of
>> > > 128, and that means this shrinker will almost always run at least
>> > > once per shrink_slab() call.
>> > 
>> > I don't think so.
>> > Yes, lowmem_shrink() return number of (in)active lru pages
>> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
>> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
>> > logic makes total_scan below 128.
>> 
>> "perhaps"
>> 
>> 
>> There is no "perhaps" here - there is *zero* guarantee of the
>> behaviour you are claiming the lowmem killer shrinker is dependent
>> on with the existing shrinker infrastructure. So, lets say we have:
>> 
>> 	nr_pages_scanned = 1000
>> 	lru_pages = 100,000
>> 
>> Your shrinker is going to return 100,000 when nr_to_scan = 0. So,
>> we have:
>> 
>> 	batch_size = SHRINK_BATCH = 128
>> 	max_pass= 100,000
>> 
>> 	total_scan = shrinker->nr_in_batch = 0
>> 	delta = 4 * 1000 / 32 = 128
>> 	delta = 128 * 100,000 = 12,800,000
>> 	delta = 12,800,000 / 100,001 = 127
>> 	total_scan += delta = 127
>> 
>> Assuming the LRU pages count does not change(*), nr_pages_scanned is
>> irrelevant and delta always comes in 1 count below the batch size,
>> and the shrinker is not called. The remainder is then:
>> 
>> 	shrinker->nr_in_batch += total_scan = 127
>> 
>> (*) the lru page count will change, because reclaim and shrinkers
>> run concurrently, and so we can't even make a simple contrived case
>> where delta is consistently < batch_size here.
>> 
>> Anyway, the next time the shrinker is entered, we start with:
>> 
>> 	total_scan = shrinker->nr_in_batch = 127
>> 	.....
>> 	total_scan += delta = 254
>> 
>> 	<shrink once, total scan -= batch_size = 126>
>> 
>> 	shrinker->nr_in_batch += total_scan = 126
>> 
>> And so on for all the subsequent shrink_slab calls....
>> 
>> IOWs, this algorithm effectively causes the shrinker to be called
>> 127 times out of 128 in this arbitrary scenario. It does not behave
>> as you are assuming it to, and as such any code based on those
>> assumptions is broken....
>
>Thanks for good example. I got your point :)
>But, my concern is not solved entirely, because this is not problem
>just for lowmem killer and I can think counter example. And other drivers
>can be suffered from this change.
>
>I look at the code for "huge_zero_page_shrinker".
>They return HPAGE_PMD_NR if there is shrikerable object.
>
>I try to borrow your example for this case.
>
> 	nr_pages_scanned = 1,000
> 	lru_pages = 100,000
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 512 (HPAGE_PMD_NR)
>
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 1,000 / 2 = 2,000
> 	delta = 2,000 * 512 = 1,024,000
> 	delta = 1,024,000 / 100,001 = 10
> 	total_scan += delta = 10
>
>As you can see, before this patch, do_shrinker_shrink() for
>"huge_zero_page_shrinker" is not called until we call shrink_slab() more
>than 13 times. *Frequency* we call do_shrinker_shrink() actually is
>largely different with before. With this patch, we actually call
>do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
>than before. Can we be convinced that there will be no problem?
>
>This is why I worry about this change.
>Am I worried too much? :)
>
>I show another scenario what I am thinking for lowmem killer.
>
>In reality, 'nr_pages_scanned' reflect sc->priority.
>You can see it get_scan_count() in vmscan.c
>
>	size = get_lru_size(lruvec, lru);
>	scan = size >> sc->priority;
>
>So, I try to re-construct your example with above assumption.
>
>If sc->priority is DEF_PRIORITY (12)
>
> 	nr_pages_scanned = 25 (100,000 / 4,096)
> 	lru_pages = 100,000
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 100,000
>
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 25 / 32 = 3
> 	delta = 3 * 100,000 = 300,000
> 	delta = 300,000 / 100,001 = 3
> 	total_scan += delta = 3
>
>So, do_shrinker_shrink() is not called for lowmem killer until
>we call shrink_slab() more than 40 times if sc->priority is DEF_PRIORITY.
>So, AICT, if we don't have trouble too much in reclaiming memory, it will not
>triggered frequently.
>
As the example from Joonsoo, before the patch, if scan priority is low, 
slab cache won't be shrinked, however, after the patch, slab cache is 
shrinked more aggressive. Furthmore, these slab cache pages maybe more 
seek expensive than lru pages.
Regards,
Wanpeng Li 
>I like this patchset, and I think shrink_slab interface should be
>re-worked. What I want to say is just that this patch is not trivial
>change and should notify user to test it.
>I want to say again, I don't want to become a stopper for this patchset :)
>
>Please let me know what I am missing.
>
>Thanks.
>
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  2:51                   ` Joonsoo Kim
  2013-04-10  7:30                     ` Glauber Costa
@ 2013-04-10  8:46                     ` Wanpeng Li
  2013-04-10  8:46                     ` Wanpeng Li
       [not found]                     ` <20130410025115.GA5872-Hm3cg6mZ9cc@public.gmane.org>
  3 siblings, 0 replies; 97+ messages in thread
From: Wanpeng Li @ 2013-04-10  8:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
Hi Glauber,
On Wed, Apr 10, 2013 at 11:51:16AM +0900, Joonsoo Kim wrote:
>Hello, Dave.
>
>On Tue, Apr 09, 2013 at 10:30:08PM +1000, Dave Chinner wrote:
>> On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
>> > On Tue, Apr 09, 2013 at 11:29:31AM +1000, Dave Chinner wrote:
>> > > On Tue, Apr 09, 2013 at 09:55:47AM +0900, Joonsoo Kim wrote:
>> > > > lowmemkiller makes spare memory via killing a task.
>> > > > 
>> > > > Below is code from lowmem_shrink() in lowmemorykiller.c
>> > > > 
>> > > >         for (i = 0; i < array_size; i++) {
>> > > >                 if (other_free < lowmem_minfree[i] &&
>> > > >                     other_file < lowmem_minfree[i]) {
>> > > >                         min_score_adj = lowmem_adj[i];
>> > > >                         break;
>> > > >                 }   
>> > > >         } 
>> > > 
>> > > I don't think you understand what the current lowmemkiller shrinker
>> > > hackery actually does.
>> > > 
>> > >         rem = global_page_state(NR_ACTIVE_ANON) +
>> > >                 global_page_state(NR_ACTIVE_FILE) +
>> > >                 global_page_state(NR_INACTIVE_ANON) +
>> > >                 global_page_state(NR_INACTIVE_FILE);
>> > >         if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
>> > >                 lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
>> > >                              sc->nr_to_scan, sc->gfp_mask, rem);
>> > >                 return rem;
>> > >         }
>> > > 
>> > > So, when nr_to_scan == 0 (i.e. the count phase), the shrinker is
>> > > going to return a count of active/inactive pages in the cache. That
>> > > is almost always going to be non-zero, and almost always be > 1000
>> > > because of the minimum working set needed to run the system.
>> > > Even after applying the seek count adjustment, total_scan is almost
>> > > always going to be larger than the shrinker default batch size of
>> > > 128, and that means this shrinker will almost always run at least
>> > > once per shrink_slab() call.
>> > 
>> > I don't think so.
>> > Yes, lowmem_shrink() return number of (in)active lru pages
>> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
>> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
>> > logic makes total_scan below 128.
>> 
>> "perhaps"
>> 
>> 
>> There is no "perhaps" here - there is *zero* guarantee of the
>> behaviour you are claiming the lowmem killer shrinker is dependent
>> on with the existing shrinker infrastructure. So, lets say we have:
>> 
>> 	nr_pages_scanned = 1000
>> 	lru_pages = 100,000
>> 
>> Your shrinker is going to return 100,000 when nr_to_scan = 0. So,
>> we have:
>> 
>> 	batch_size = SHRINK_BATCH = 128
>> 	max_pass= 100,000
>> 
>> 	total_scan = shrinker->nr_in_batch = 0
>> 	delta = 4 * 1000 / 32 = 128
>> 	delta = 128 * 100,000 = 12,800,000
>> 	delta = 12,800,000 / 100,001 = 127
>> 	total_scan += delta = 127
>> 
>> Assuming the LRU pages count does not change(*), nr_pages_scanned is
>> irrelevant and delta always comes in 1 count below the batch size,
>> and the shrinker is not called. The remainder is then:
>> 
>> 	shrinker->nr_in_batch += total_scan = 127
>> 
>> (*) the lru page count will change, because reclaim and shrinkers
>> run concurrently, and so we can't even make a simple contrived case
>> where delta is consistently < batch_size here.
>> 
>> Anyway, the next time the shrinker is entered, we start with:
>> 
>> 	total_scan = shrinker->nr_in_batch = 127
>> 	.....
>> 	total_scan += delta = 254
>> 
>> 	<shrink once, total scan -= batch_size = 126>
>> 
>> 	shrinker->nr_in_batch += total_scan = 126
>> 
>> And so on for all the subsequent shrink_slab calls....
>> 
>> IOWs, this algorithm effectively causes the shrinker to be called
>> 127 times out of 128 in this arbitrary scenario. It does not behave
>> as you are assuming it to, and as such any code based on those
>> assumptions is broken....
>
>Thanks for good example. I got your point :)
>But, my concern is not solved entirely, because this is not problem
>just for lowmem killer and I can think counter example. And other drivers
>can be suffered from this change.
>
>I look at the code for "huge_zero_page_shrinker".
>They return HPAGE_PMD_NR if there is shrikerable object.
>
>I try to borrow your example for this case.
>
> 	nr_pages_scanned = 1,000
> 	lru_pages = 100,000
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 512 (HPAGE_PMD_NR)
>
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 1,000 / 2 = 2,000
> 	delta = 2,000 * 512 = 1,024,000
> 	delta = 1,024,000 / 100,001 = 10
> 	total_scan += delta = 10
>
>As you can see, before this patch, do_shrinker_shrink() for
>"huge_zero_page_shrinker" is not called until we call shrink_slab() more
>than 13 times. *Frequency* we call do_shrinker_shrink() actually is
>largely different with before. With this patch, we actually call
>do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
>than before. Can we be convinced that there will be no problem?
>
>This is why I worry about this change.
>Am I worried too much? :)
>
>I show another scenario what I am thinking for lowmem killer.
>
>In reality, 'nr_pages_scanned' reflect sc->priority.
>You can see it get_scan_count() in vmscan.c
>
>	size = get_lru_size(lruvec, lru);
>	scan = size >> sc->priority;
>
>So, I try to re-construct your example with above assumption.
>
>If sc->priority is DEF_PRIORITY (12)
>
> 	nr_pages_scanned = 25 (100,000 / 4,096)
> 	lru_pages = 100,000
> 	batch_size = SHRINK_BATCH = 128
> 	max_pass= 100,000
>
> 	total_scan = shrinker->nr_in_batch = 0
> 	delta = 4 * 25 / 32 = 3
> 	delta = 3 * 100,000 = 300,000
> 	delta = 300,000 / 100,001 = 3
> 	total_scan += delta = 3
>
>So, do_shrinker_shrink() is not called for lowmem killer until
>we call shrink_slab() more than 40 times if sc->priority is DEF_PRIORITY.
>So, AICT, if we don't have trouble too much in reclaiming memory, it will not
>triggered frequently.
>
As the example from Joonsoo, before the patch, if scan priority is low, 
slab cache won't be shrinked, however, after the patch, slab cache is 
shrinked more aggressive. Furthmore, these slab cache pages maybe more 
seek expensive than lru pages.
Regards,
Wanpeng Li 
>I like this patchset, and I think shrink_slab interface should be
>re-worked. What I want to say is just that this patch is not trivial
>change and should notify user to test it.
>I want to say again, I don't want to become a stopper for this patchset :)
>
>Please let me know what I am missing.
>
>Thanks.
>
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  5:09       ` Ric Mason
  2013-04-10  7:32         ` Glauber Costa
@ 2013-04-10  9:19         ` Dave Chinner
  1 sibling, 0 replies; 97+ messages in thread
From: Dave Chinner @ 2013-04-10  9:19 UTC (permalink / raw)
  To: Ric Mason
  Cc: Glauber Costa, Kamezawa Hiroyuki, linux-mm, linux-fsdevel,
	containers, Michal Hocko, Johannes Weiner, Andrew Morton,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On Wed, Apr 10, 2013 at 01:09:42PM +0800, Ric Mason wrote:
> Hi Glauber,
> On 04/01/2013 04:10 PM, Glauber Costa wrote:
> > Hi Kame,
> >
> >> Doesn't this break
> >>
> >> ==
> >>                 /*
> >>                  * copy the current shrinker scan count into a local variable
> >>                  * and zero it so that other concurrent shrinker invocations
> >>                  * don't also do this scanning work.
> >>                  */
> >>                 nr = atomic_long_xchg(&shrinker->nr_in_batch, 0);
> >> ==
> >>
> >> This xchg magic ?
> >>
> >> Thnks,
> >> -Kame
> > This is done before the actual reclaim attempt, and all it does is to
> > indicate to other concurrent shrinkers that "I've got it", and others
> > should not attempt to shrink.
> >
> > Even before I touch this, this quantity represents the number of
> > entities we will try to shrink. Not necessarily we will succeed. What my
> > patch does, is to try at least once if the number is too small.
> >
> > Before it, we will try to shrink 512 objects and succeed at 0 (because
> > batch is 1024). After this, we will try to free 512 objects and succeed
> > at an undefined quantity between 0 and 512.
> 
> Where you get the magic number 512 and 1024? The value of SHRINK_BATCH
> is 128.
The default is SHRINK_BATCH, but batch size has been customisable
for some time now...
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10  8:46                     ` Wanpeng Li
@ 2013-04-10 10:07                       ` Dave Chinner
  2013-04-10 14:03                         ` JoonSoo Kim
  0 siblings, 1 reply; 97+ messages in thread
From: Dave Chinner @ 2013-04-10 10:07 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Glauber Costa, linux-mm, linux-fsdevel, containers, Michal Hocko,
	Johannes Weiner, kamezawa.hiroyu, Andrew Morton, Greg Thelen,
	hughd, yinghan, Theodore Ts'o, Al Viro
On Wed, Apr 10, 2013 at 04:46:06PM +0800, Wanpeng Li wrote:
> On Wed, Apr 10, 2013 at 11:51:16AM +0900, Joonsoo Kim wrote:
> >On Tue, Apr 09, 2013 at 10:30:08PM +1000, Dave Chinner wrote:
> >> On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
> >> > I don't think so.
> >> > Yes, lowmem_shrink() return number of (in)active lru pages
> >> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
> >> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
> >> > logic makes total_scan below 128.
> >> 
> >> "perhaps"
> >> 
> >> 
> >> There is no "perhaps" here - there is *zero* guarantee of the
> >> behaviour you are claiming the lowmem killer shrinker is dependent
> >> on with the existing shrinker infrastructure. So, lets say we have:
.....
> >> IOWs, this algorithm effectively causes the shrinker to be called
> >> 127 times out of 128 in this arbitrary scenario. It does not behave
> >> as you are assuming it to, and as such any code based on those
> >> assumptions is broken....
> >
> >Thanks for good example. I got your point :)
> >But, my concern is not solved entirely, because this is not problem
> >just for lowmem killer and I can think counter example. And other drivers
> >can be suffered from this change.
> >
> >I look at the code for "huge_zero_page_shrinker".
> >They return HPAGE_PMD_NR if there is shrikerable object.
<sigh>
Yet another new shrinker that is just plain broken. it tracks a
*single object*, and returns a value only when the ref count value
is 1 which will result in freeing the zero page at some
random time in the future after some number of other calls to the
shrinker where the refcount is also 1.
This is *insane*.
> >I try to borrow your example for this case.
> >
> > 	nr_pages_scanned = 1,000
> > 	lru_pages = 100,000
> > 	batch_size = SHRINK_BATCH = 128
> > 	max_pass= 512 (HPAGE_PMD_NR)
> >
> > 	total_scan = shrinker->nr_in_batch = 0
> > 	delta = 4 * 1,000 / 2 = 2,000
> > 	delta = 2,000 * 512 = 1,024,000
> > 	delta = 1,024,000 / 100,001 = 10
> > 	total_scan += delta = 10
> >
> >As you can see, before this patch, do_shrinker_shrink() for
> >"huge_zero_page_shrinker" is not called until we call shrink_slab() more
> >than 13 times. *Frequency* we call do_shrinker_shrink() actually is
> >largely different with before.
If the frequency of the shrinker calls breaks the shrinker
functionality or the subsystem because it pays no attention to
nr_to_scan, then the shrinker is fundamentally broken. The shrinker
has *no control* over the frequency of the calls to it or the bathc
size, and so being dependent on "small numbers means few calls" for
correct behaviour is dangerously unpredictable and completely
non-deterministic.
Besides, if you don't want to be shrunk, return a count of -1.
Shock, horror, it is even documented in the API!
 * 'sc' is passed shrink_control which includes a count 'nr_to_scan'             
 * and a 'gfpmask'.  It should look through the least-recently-used              
 * 'nr_to_scan' entries and attempt to free them up.  It should return           
 * the number of objects which remain in the cache.  If it returns -1, it means  
 * it cannot do any scanning at this time (eg. there is a risk of deadlock).     
> >With this patch, we actually call
> >do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
> >than before. Can we be convinced that there will be no problem?
> >
> >This is why I worry about this change.
> >Am I worried too much? :)
You're worrying about the wrong thing. You're assuming that
shrinkers are implemented correctly and sanely, but the reality is
that most shrinkers are fundamentally broken in some way or another.
These are just two examples of many. We are trying to fix the API
and shrinker infrastructure to remove the current insanity. We want
to make the shrinkers more flexible so that stuff like one-shot low
memory event notifications can be implemented without grotesque
hacks like the shrinkers you've used as examples so far...
-Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10 10:07                       ` Dave Chinner
@ 2013-04-10 14:03                         ` JoonSoo Kim
  2013-04-11  0:41                           ` Dave Chinner
  0 siblings, 1 reply; 97+ messages in thread
From: JoonSoo Kim @ 2013-04-10 14:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wanpeng Li, Glauber Costa, Linux Memory Management List,
	linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Greg Thelen, hughd, yinghan,
	Theodore Ts'o, Al Viro
2013/4/10 Dave Chinner <david@fromorbit.com>:
> On Wed, Apr 10, 2013 at 04:46:06PM +0800, Wanpeng Li wrote:
>> On Wed, Apr 10, 2013 at 11:51:16AM +0900, Joonsoo Kim wrote:
>> >On Tue, Apr 09, 2013 at 10:30:08PM +1000, Dave Chinner wrote:
>> >> On Tue, Apr 09, 2013 at 11:05:05AM +0900, Joonsoo Kim wrote:
>> >> > I don't think so.
>> >> > Yes, lowmem_shrink() return number of (in)active lru pages
>> >> > when nr_to_scan is 0. And in shrink_slab(), we divide it by lru_pages.
>> >> > lru_pages can vary where shrink_slab() is called, anyway, perhaps this
>> >> > logic makes total_scan below 128.
>> >>
>> >> "perhaps"
>> >>
>> >>
>> >> There is no "perhaps" here - there is *zero* guarantee of the
>> >> behaviour you are claiming the lowmem killer shrinker is dependent
>> >> on with the existing shrinker infrastructure. So, lets say we have:
> .....
>> >> IOWs, this algorithm effectively causes the shrinker to be called
>> >> 127 times out of 128 in this arbitrary scenario. It does not behave
>> >> as you are assuming it to, and as such any code based on those
>> >> assumptions is broken....
>> >
>> >Thanks for good example. I got your point :)
>> >But, my concern is not solved entirely, because this is not problem
>> >just for lowmem killer and I can think counter example. And other drivers
>> >can be suffered from this change.
>> >
>> >I look at the code for "huge_zero_page_shrinker".
>> >They return HPAGE_PMD_NR if there is shrikerable object.
>
> <sigh>
>
> Yet another new shrinker that is just plain broken. it tracks a
> *single object*, and returns a value only when the ref count value
> is 1 which will result in freeing the zero page at some
> random time in the future after some number of other calls to the
> shrinker where the refcount is also 1.
>
> This is *insane*.
>
>> >I try to borrow your example for this case.
>> >
>> >     nr_pages_scanned = 1,000
>> >     lru_pages = 100,000
>> >     batch_size = SHRINK_BATCH = 128
>> >     max_pass= 512 (HPAGE_PMD_NR)
>> >
>> >     total_scan = shrinker->nr_in_batch = 0
>> >     delta = 4 * 1,000 / 2 = 2,000
>> >     delta = 2,000 * 512 = 1,024,000
>> >     delta = 1,024,000 / 100,001 = 10
>> >     total_scan += delta = 10
>> >
>> >As you can see, before this patch, do_shrinker_shrink() for
>> >"huge_zero_page_shrinker" is not called until we call shrink_slab() more
>> >than 13 times. *Frequency* we call do_shrinker_shrink() actually is
>> >largely different with before.
>
> If the frequency of the shrinker calls breaks the shrinker
> functionality or the subsystem because it pays no attention to
> nr_to_scan, then the shrinker is fundamentally broken. The shrinker
> has *no control* over the frequency of the calls to it or the bathc
> size, and so being dependent on "small numbers means few calls" for
> correct behaviour is dangerously unpredictable and completely
> non-deterministic.
>
> Besides, if you don't want to be shrunk, return a count of -1.
> Shock, horror, it is even documented in the API!
>
>  * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
>  * and a 'gfpmask'.  It should look through the least-recently-used
>  * 'nr_to_scan' entries and attempt to free them up.  It should return
>  * the number of objects which remain in the cache.  If it returns -1, it means
>  * it cannot do any scanning at this time (eg. there is a risk of deadlock).
>
>> >With this patch, we actually call
>> >do_shrinker_shrink() for "huge_zero_page_shrinker" 12 times more
>> >than before. Can we be convinced that there will be no problem?
>> >
>> >This is why I worry about this change.
>> >Am I worried too much? :)
>
> You're worrying about the wrong thing. You're assuming that
> shrinkers are implemented correctly and sanely, but the reality is
> that most shrinkers are fundamentally broken in some way or another.
>
> These are just two examples of many. We are trying to fix the API
> and shrinker infrastructure to remove the current insanity. We want
> to make the shrinkers more flexible so that stuff like one-shot low
> memory event notifications can be implemented without grotesque
> hacks like the shrinkers you've used as examples so far...
Yes, it is great.
I already know that many shrinkers are wrongly implemented.
Above examples explain themselves.
Another one what I found is that they don't account "nr_reclaimed" precisely.
There is no code which check whether "current->reclaim_state" exist or not,
except prune_inode(). So if they reclaim a page directly, they will not
account how many pages are freed, so shrink_zone() and shrink_slab() will
be called excessively.
Maybe there is no properly implemented shrinker except fs' one :)
But, this is a reality where we live. So I have worried about it.
Now, I'm Okay. So please fotget my concern.
Thanks.
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-10 14:03                         ` JoonSoo Kim
@ 2013-04-11  0:41                           ` Dave Chinner
  2013-04-11  7:27                             ` Wanpeng Li
                                               ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Dave Chinner @ 2013-04-11  0:41 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Wanpeng Li, Glauber Costa, Linux Memory Management List,
	linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Greg Thelen, hughd, yinghan,
	Theodore Ts'o, Al Viro
On Wed, Apr 10, 2013 at 11:03:39PM +0900, JoonSoo Kim wrote:
> Another one what I found is that they don't account "nr_reclaimed" precisely.
> There is no code which check whether "current->reclaim_state" exist or not,
> except prune_inode().
That's because prune_inode() can free page cache pages when the
inode mapping is invalidated. Hence it accounts this in addition
to the slab objects being freed.
IOWs, if you have a shrinker that frees pages from the page cache,
you need to do this. Last time I checked, only inode cache reclaim
caused extra page cache reclaim to occur, so most (all?) other
shrinkers do not need to do this.
It's just another wart that we need to clean up....
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-11  0:41                           ` Dave Chinner
  2013-04-11  7:27                             ` Wanpeng Li
  2013-04-11  7:27                             ` Wanpeng Li
@ 2013-04-11  7:27                             ` Wanpeng Li
  2013-04-11  9:25                               ` Dave Chinner
  2 siblings, 1 reply; 97+ messages in thread
From: Wanpeng Li @ 2013-04-11  7:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: JoonSoo Kim, Wanpeng Li, Glauber Costa,
	Linux Memory Management List, linux-fsdevel, containers,
	Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Andrew Morton,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On Thu, Apr 11, 2013 at 10:41:14AM +1000, Dave Chinner wrote:
>On Wed, Apr 10, 2013 at 11:03:39PM +0900, JoonSoo Kim wrote:
>> Another one what I found is that they don't account "nr_reclaimed" precisely.
>> There is no code which check whether "current->reclaim_state" exist or not,
>> except prune_inode().
>
>That's because prune_inode() can free page cache pages when the
>inode mapping is invalidated. Hence it accounts this in addition
>to the slab objects being freed.
>
>IOWs, if you have a shrinker that frees pages from the page cache,
>you need to do this. Last time I checked, only inode cache reclaim
>caused extra page cache reclaim to occur, so most (all?) other
>shrinkers do not need to do this.
>
If we should account "nr_reclaimed" against huge zero page? There are 
large number(512) of pages reclaimed which can throttle direct or 
kswapd relcaim to avoid reclaim excess pages. I can do this work if 
you think the idea is needed.
Regards,
Wanpeng Li 
>It's just another wart that we need to clean up....
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-11  0:41                           ` Dave Chinner
  2013-04-11  7:27                             ` Wanpeng Li
@ 2013-04-11  7:27                             ` Wanpeng Li
  2013-04-11  7:27                             ` Wanpeng Li
  2 siblings, 0 replies; 97+ messages in thread
From: Wanpeng Li @ 2013-04-11  7:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: JoonSoo Kim, Wanpeng Li, Glauber Costa,
	Linux Memory Management List, linux-fsdevel, containers,
	Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Andrew Morton,
	Greg Thelen, hughd, yinghan, Theodore Ts'o, Al Viro
On Thu, Apr 11, 2013 at 10:41:14AM +1000, Dave Chinner wrote:
>On Wed, Apr 10, 2013 at 11:03:39PM +0900, JoonSoo Kim wrote:
>> Another one what I found is that they don't account "nr_reclaimed" precisely.
>> There is no code which check whether "current->reclaim_state" exist or not,
>> except prune_inode().
>
>That's because prune_inode() can free page cache pages when the
>inode mapping is invalidated. Hence it accounts this in addition
>to the slab objects being freed.
>
>IOWs, if you have a shrinker that frees pages from the page cache,
>you need to do this. Last time I checked, only inode cache reclaim
>caused extra page cache reclaim to occur, so most (all?) other
>shrinkers do not need to do this.
>
If we should account "nr_reclaimed" against huge zero page? There are 
large number(512) of pages reclaimed which can throttle direct or 
kswapd relcaim to avoid reclaim excess pages. I can do this work if 
you think the idea is needed.
Regards,
Wanpeng Li 
>It's just another wart that we need to clean up....
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-11  0:41                           ` Dave Chinner
@ 2013-04-11  7:27                             ` Wanpeng Li
  2013-04-11  7:27                             ` Wanpeng Li
  2013-04-11  7:27                             ` Wanpeng Li
  2 siblings, 0 replies; 97+ messages in thread
From: Wanpeng Li @ 2013-04-11  7:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, JoonSoo Kim,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, Michal Hocko,
	Linux Memory Management List, Johannes Weiner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Wanpeng Li,
	Al Viro
On Thu, Apr 11, 2013 at 10:41:14AM +1000, Dave Chinner wrote:
>On Wed, Apr 10, 2013 at 11:03:39PM +0900, JoonSoo Kim wrote:
>> Another one what I found is that they don't account "nr_reclaimed" precisely.
>> There is no code which check whether "current->reclaim_state" exist or not,
>> except prune_inode().
>
>That's because prune_inode() can free page cache pages when the
>inode mapping is invalidated. Hence it accounts this in addition
>to the slab objects being freed.
>
>IOWs, if you have a shrinker that frees pages from the page cache,
>you need to do this. Last time I checked, only inode cache reclaim
>caused extra page cache reclaim to occur, so most (all?) other
>shrinkers do not need to do this.
>
If we should account "nr_reclaimed" against huge zero page? There are 
large number(512) of pages reclaimed which can throttle direct or 
kswapd relcaim to avoid reclaim excess pages. I can do this work if 
you think the idea is needed.
Regards,
Wanpeng Li 
>It's just another wart that we need to clean up....
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
^ permalink raw reply	[flat|nested] 97+ messages in thread
* Re: [PATCH v2 02/28] vmscan: take at least one pass with shrinkers
  2013-04-11  7:27                             ` Wanpeng Li
@ 2013-04-11  9:25                               ` Dave Chinner
  0 siblings, 0 replies; 97+ messages in thread
From: Dave Chinner @ 2013-04-11  9:25 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: JoonSoo Kim, Glauber Costa, Linux Memory Management List,
	linux-fsdevel, containers, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Andrew Morton, Greg Thelen, hughd, yinghan,
	Theodore Ts'o, Al Viro
On Thu, Apr 11, 2013 at 03:27:30PM +0800, Wanpeng Li wrote:
> On Thu, Apr 11, 2013 at 10:41:14AM +1000, Dave Chinner wrote:
> >On Wed, Apr 10, 2013 at 11:03:39PM +0900, JoonSoo Kim wrote:
> >> Another one what I found is that they don't account "nr_reclaimed" precisely.
> >> There is no code which check whether "current->reclaim_state" exist or not,
> >> except prune_inode().
> >
> >That's because prune_inode() can free page cache pages when the
> >inode mapping is invalidated. Hence it accounts this in addition
> >to the slab objects being freed.
> >
> >IOWs, if you have a shrinker that frees pages from the page cache,
> >you need to do this. Last time I checked, only inode cache reclaim
> >caused extra page cache reclaim to occur, so most (all?) other
> >shrinkers do not need to do this.
> >
> 
> If we should account "nr_reclaimed" against huge zero page? There are 
> large number(512) of pages reclaimed which can throttle direct or 
> kswapd relcaim to avoid reclaim excess pages. I can do this work if 
> you think the idea is needed.
I'm not sure. the zero hugepage is allocated through:
	zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,   
				HPAGE_PMD_ORDER);
which means the pages reclaimed by the shrinker aren't file/anon LRU
pages.  Hence I'm not sure what extra accounting might be useful
here, but accounting them as LRU pages being reclaimed seems wrong.
FWIW, the reclaim of a single global object by a shrinker is not
really a use case the shrinkers were designed for, so I suspect that
anything we try to do right now within the current framework will
just be a hack.
I suspect that what we need to do is add the current zone reclaim
priority to the shrinker control structure (like has been done with
the nodemask) so that objects like this can be considered for
removal at a specific reclaim priority level rather than trying to
use scan/count trickery to get where we want to be.
Perhaps we need a shrinker->shrink_priority method that is called just
once when the reclaim priority is high enough to trigger it. i.e.
all these "do something special when memory reclaim is struggling to
make progress" operations set the priority at which they get called
and every time shrink_slab() is then called with that priority (or
higher) the shrinker->shrink_priority method is called just once?
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 97+ messages in thread
end of thread, other threads:[~2013-04-11  9:25 UTC | newest]
Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-29  9:13 [PATCH v2 00/28] memcg-aware slab shrinking Glauber Costa
2013-03-29  9:13 ` [PATCH v2 01/28] super: fix calculation of shrinkable objects for small numbers Glauber Costa
2013-04-01  7:16   ` Kamezawa Hiroyuki
2013-03-29  9:13 ` [PATCH v2 02/28] vmscan: take at least one pass with shrinkers Glauber Costa
2013-04-01  7:26   ` Kamezawa Hiroyuki
2013-04-01  8:10     ` Glauber Costa
2013-04-10  5:09       ` Ric Mason
2013-04-10  7:32         ` Glauber Costa
2013-04-10  9:19         ` Dave Chinner
2013-04-08  8:42   ` Joonsoo Kim
2013-04-08  8:47     ` Glauber Costa
2013-04-08  9:01       ` Joonsoo Kim
2013-04-08  9:05         ` Glauber Costa
2013-04-09  0:55           ` Joonsoo Kim
2013-04-09  1:29             ` Dave Chinner
2013-04-09  2:05               ` Joonsoo Kim
2013-04-09  7:43                 ` Glauber Costa
2013-04-09  9:08                   ` Joonsoo Kim
2013-04-09 12:30                 ` Dave Chinner
2013-04-10  2:51                   ` Joonsoo Kim
2013-04-10  7:30                     ` Glauber Costa
2013-04-10  8:19                       ` Joonsoo Kim
2013-04-10  8:46                     ` Wanpeng Li
2013-04-10  8:46                     ` Wanpeng Li
2013-04-10 10:07                       ` Dave Chinner
2013-04-10 14:03                         ` JoonSoo Kim
2013-04-11  0:41                           ` Dave Chinner
2013-04-11  7:27                             ` Wanpeng Li
2013-04-11  7:27                             ` Wanpeng Li
2013-04-11  7:27                             ` Wanpeng Li
2013-04-11  9:25                               ` Dave Chinner
     [not found]                     ` <20130410025115.GA5872-Hm3cg6mZ9cc@public.gmane.org>
2013-04-10  8:46                       ` Wanpeng Li
2013-03-29  9:13 ` [PATCH v2 03/28] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
2013-04-05  1:09   ` Greg Thelen
2013-04-05  1:15     ` Dave Chinner
2013-04-08  9:14       ` Glauber Costa
2013-04-08 13:18         ` Glauber Costa
2013-04-08 23:26         ` Dave Chinner
2013-04-09  8:02           ` Glauber Costa
2013-04-09 12:47             ` Dave Chinner
2013-03-29  9:13 ` [PATCH v2 04/28] dentry: move to per-sb LRU locks Glauber Costa
2013-03-29  9:13 ` [PATCH v2 05/28] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
2013-04-03  6:51   ` Sha Zhengju
2013-04-03  8:55     ` Glauber Costa
2013-04-04  6:19     ` Dave Chinner
2013-04-04  6:56       ` Glauber Costa
2013-03-29  9:13 ` [PATCH v2 06/28] mm: new shrinker API Glauber Costa
2013-04-05  1:09   ` Greg Thelen
2013-03-29  9:13 ` [PATCH v2 07/28] shrinker: convert superblock shrinkers to new API Glauber Costa
2013-03-29  9:13 ` [PATCH v2 08/28] list: add a new LRU list type Glauber Costa
2013-04-04 21:53   ` Greg Thelen
2013-04-05  1:20     ` Dave Chinner
2013-04-05  8:01       ` Glauber Costa
2013-04-06  0:04         ` Dave Chinner
2013-03-29  9:13 ` [PATCH v2 09/28] inode: convert inode lru list to generic lru list code Glauber Costa
2013-03-29  9:13 ` [PATCH v2 10/28] dcache: convert to use new lru list infrastructure Glauber Costa
2013-04-08 13:14   ` Glauber Costa
2013-04-08 23:28     ` Dave Chinner
2013-03-29  9:13 ` [PATCH v2 11/28] list_lru: per-node " Glauber Costa
2013-03-29  9:13 ` [PATCH v2 12/28] shrinker: add node awareness Glauber Costa
2013-03-29  9:13 ` [PATCH v2 13/28] fs: convert inode and dentry shrinking to be node aware Glauber Costa
2013-03-29  9:13 ` [PATCH v2 14/28] xfs: convert buftarg LRU to generic code Glauber Costa
2013-03-29  9:13 ` [PATCH v2 15/28] xfs: convert dquot cache lru to list_lru Glauber Costa
2013-03-29  9:13 ` [PATCH v2 16/28] fs: convert fs shrinkers to new scan/count API Glauber Costa
2013-03-29  9:13 ` [PATCH v2 17/28] drivers: convert shrinkers to new count/scan API Glauber Costa
2013-03-29  9:14 ` [PATCH v2 18/28] shrinker: convert remaining shrinkers to " Glauber Costa
2013-03-29  9:14 ` [PATCH v2 19/28] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
2013-03-29  9:14 ` [PATCH v2 20/28] shrinker: Kill old ->shrink API Glauber Costa
2013-03-29  9:14 ` [PATCH v2 21/28] vmscan: also shrink slab in memcg pressure Glauber Costa
2013-04-01  7:46   ` Kamezawa Hiroyuki
2013-04-01  8:51     ` Glauber Costa
2013-04-03 10:11   ` Sha Zhengju
2013-04-03 10:43     ` Glauber Costa
2013-04-04  9:35       ` Sha Zhengju
2013-04-05  8:25         ` Glauber Costa
2013-03-29  9:14 ` [PATCH v2 22/28] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
2013-04-01  8:05   ` Kamezawa Hiroyuki
2013-04-01  8:22     ` Glauber Costa
2013-03-29  9:14 ` [PATCH v2 23/28] lru: add an element to a memcg list Glauber Costa
2013-04-01  8:18   ` Kamezawa Hiroyuki
2013-04-01  8:29     ` Glauber Costa
2013-03-29  9:14 ` [PATCH v2 24/28] list_lru: also include memcg lists in counts and scans Glauber Costa
2013-03-29  9:14 ` [PATCH v2 25/28] list_lru: per-memcg walks Glauber Costa
2013-03-29  9:14 ` [PATCH v2 26/28] memcg: per-memcg kmem shrinking Glauber Costa
2013-04-01  8:31   ` Kamezawa Hiroyuki
2013-04-01  8:48     ` Glauber Costa
2013-04-01  9:01       ` Kamezawa Hiroyuki
2013-04-01  9:14         ` Glauber Costa
2013-04-01  9:35         ` Kamezawa Hiroyuki
2013-03-29  9:14 ` [PATCH v2 27/28] list_lru: reclaim proportionaly between memcgs and nodes Glauber Costa
2013-03-29  9:14 ` [PATCH v2 28/28] super: targeted memcg reclaim Glauber Costa
2013-04-01 12:38 ` [PATCH v2 00/28] memcg-aware slab shrinking Serge Hallyn
2013-04-01 12:45   ` Glauber Costa
2013-04-01 14:12     ` Serge Hallyn
2013-04-08  8:11       ` Glauber Costa
2013-04-02  4:58   ` Dave Chinner
2013-04-02  7:55     ` Glauber Costa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).