[PATCH v10 00/35] kmemcg shrinkers

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v10 00/35] kmemcg shrinkers
@ 2013-06-03 19:29 Glauber Costa
  2013-06-03 19:29 ` [PATCH v10 02/35] super: fix calculation of shrinkable objects for small numbers Glauber Costa
                   ` (5 more replies)
  0 siblings, 6 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen

Andrew,

This submission contains one small bug fix over the last one. I have been
testing it regularly and believe this is ready for merging. I have follow up
patches for this series, with a few improvements (namely: dynamic sized
list_lru node arrays, memcg flush-at-destruction, kmemcg shrinking setting
limit < usage).  But since this series is already quite mature - and very
extensive, I don't believe that adding new patches would make them receive the
appropriate level of review. So please advise me if there is anything crucial
missing in here. Thanks!

Hi,

This patchset implements targeted shrinking for memcg when kmem limits are
present. So far, we've been accounting kernel objects but failing allocations
when short of memory. This is because our only option would be to call the
global shrinker, depleting objects from all caches and breaking isolation.

The main idea is to associate per-memcg lists with each of the LRUs. The main
LRU still provides a single entry point and when adding or removing an element
from the LRU, we use the page information to figure out which memcg it belongs
to and relay it to the right list.

Base work:
==========

Please note that this builds upon the recent work from Dave Chinner that
sanitizes the LRU shrinking API and make the shrinkers node aware. Node
awareness is not *strictly* needed for my work, but I still perceive it
as an advantage. The API unification is a major need, and I build upon it
heavily. That allows us to manipulate the LRUs without knowledge of the
underlying objects with ease. This time, I am including that work here as
a baseline.

Main changes from *v9
* Fixed iteration over all memcgs from list_lru side.

Main changes from *v8
* fixed xfs umount bug
* rebase to current linux-next

Main changes from *v7:
* Fixed races for memcg
* Enhanced memcg hierarchy walks during global pressure (we were walking only
  the global list, not all memcgs)

Main changes from *v6:
* Change nr_unused_dentry to long, Dave reported an int not being enough
* Fixed shrink_list leak, by Dave
* LRU API now gets a node id, instead of a node mask.
* per-node deferred work, leading to smoother behavior

Main changes from *v5:
* Rebased to linux-next, and fix the conflicts with the dcache.
* Make sure LRU_RETRY only retry once
* Prevent the bcache shrinker to scan the caches when disabled (by returning
  0 in the count function)
* Fix i915 return code when mutex cannot be acquired.
* Only scan less-than-batch objects in memcg scenarios

Main changes from *v4:
* Fixed a bug in user-generated memcg pressure
* Fixed overly-agressive slab shrinker behavior spotted by Mel Gorman
* Various other fixes and comments by Mel Gorman

Main changes from *v3:
* Merged suggestions from mailing list.
* Removed the memcg-walking code from LRU. vmscan now drives all the hierarchy
  decisions, which makes more sense
* lazily free the old memcg arrays (needs now to be saved in struct lru). Since
  we need to call synchronize_rcu, calling it for every LRU can become expensive
* Moved the dead memcg shrinker to vmpressure. Already independently sent to
  linux-mm for review.
* Changed locking convention for LRU_RETRY. It now needs to return locked, which
  silents warnings about possible lock unbalance (although previous code was
  correct)

Main changes from *v2:
* shrink dead memcgs when global pressure kicks in. Uses the new lru API.
* bugfixes and comments from the mailing list.
* proper hierarchy-aware walk in shrink_slab.

Main changes from *v1:
* merged comments from the mailing list
* reworked lru-memcg API
* effective proportional shrinking
* sanitized locking on the memcg side
* bill user memory first when kmem == umem
* various bugfixes

Dave Chinner (18):
  dcache: convert dentry_stat.nr_unused to per-cpu counters
  dentry: move to per-sb LRU locks
  dcache: remove dentries from LRU before putting on dispose list
  mm: new shrinker API
  shrinker: convert superblock shrinkers to new API
  list: add a new LRU list type
  inode: convert inode lru list to generic lru list code.
  dcache: convert to use new lru list infrastructure
  list_lru: per-node list infrastructure
  shrinker: add node awareness
  fs: convert inode and dentry shrinking to be node aware
  xfs: convert buftarg LRU to generic code
  xfs: rework buffer dispose list tracking
  xfs: convert dquot cache lru to list_lru
  fs: convert fs shrinkers to new scan/count API
  drivers: convert shrinkers to new count/scan API
  shrinker: convert remaining shrinkers to count/scan API
  shrinker: Kill old ->shrink API.

Glauber Costa (17):
  fs: bump inode and dentry counters to long
  super: fix calculation of shrinkable objects for small numbers
  vmscan: per-node deferred work
  list_lru: per-node API
  i915: bail out earlier when shrinker cannot acquire mutex
  hugepage: convert huge zero page shrinker to new shrinker API
  vmscan: also shrink slab in memcg pressure
  memcg,list_lru: duplicate LRUs upon kmemcg creation
  lru: add an element to a memcg list
  list_lru: per-memcg walks
  memcg: per-memcg kmem shrinking
  memcg: scan cache objects hierarchically
  vmscan: take at least one pass with shrinkers
  super: targeted memcg reclaim
  memcg: move initialization to memcg creation
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure.

 arch/x86/kvm/mmu.c                        |  28 +-
 drivers/gpu/drm/i915/i915_dma.c           |   4 +-
 drivers/gpu/drm/i915/i915_gem.c           |  71 +++--
 drivers/gpu/drm/ttm/ttm_page_alloc.c      |  48 ++--
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c  |  55 ++--
 drivers/md/bcache/btree.c                 |  43 +--
 drivers/md/bcache/sysfs.c                 |   2 +-
 drivers/md/dm-bufio.c                     |  65 +++--
 drivers/staging/android/ashmem.c          |  46 +++-
 drivers/staging/android/lowmemorykiller.c |  40 +--
 drivers/staging/zcache/zcache-main.c      |  29 +-
 fs/dcache.c                               | 259 +++++++++++-------
 fs/drop_caches.c                          |   1 +
 fs/ext4/extents_status.c                  |  30 ++-
 fs/gfs2/glock.c                           |  30 ++-
 fs/gfs2/main.c                            |   3 +-
 fs/gfs2/quota.c                           |  14 +-
 fs/gfs2/quota.h                           |   4 +-
 fs/inode.c                                | 194 ++++++-------
 fs/internal.h                             |   7 +-
 fs/mbcache.c                              |  53 ++--
 fs/nfs/dir.c                              |  20 +-
 fs/nfs/internal.h                         |   4 +-
 fs/nfs/super.c                            |   3 +-
 fs/nfsd/nfscache.c                        |  31 ++-
 fs/quota/dquot.c                          |  39 ++-
 fs/super.c                                | 104 ++++---
 fs/ubifs/shrinker.c                       |  20 +-
 fs/ubifs/super.c                          |   3 +-
 fs/ubifs/ubifs.h                          |   3 +-
 fs/xfs/xfs_buf.c                          | 249 ++++++++---------
 fs/xfs/xfs_buf.h                          |  17 +-
 fs/xfs/xfs_dquot.c                        |   7 +-
 fs/xfs/xfs_icache.c                       |   4 +-
 fs/xfs/xfs_icache.h                       |   2 +-
 fs/xfs/xfs_qm.c                           | 277 +++++++++----------
 fs/xfs/xfs_qm.h                           |   4 +-
 fs/xfs/xfs_super.c                        |  12 +-
 include/linux/dcache.h                    |  14 +-
 include/linux/fs.h                        |  25 +-
 include/linux/list_lru.h                  | 162 +++++++++++
 include/linux/memcontrol.h                |  45 ++++
 include/linux/shrinker.h                  |  72 ++++-
 include/linux/swap.h                      |   2 +
 include/linux/vmpressure.h                |   6 +
 include/trace/events/vmscan.h             |   4 +-
 include/uapi/linux/fs.h                   |   6 +-
 kernel/sysctl.c                           |   6 +-
 lib/Makefile                              |   2 +-
 lib/list_lru.c                            | 407 ++++++++++++++++++++++++++++
 mm/huge_memory.c                          |  17 +-
 mm/memcontrol.c                           | 433 ++++++++++++++++++++++++++----
 mm/memory-failure.c                       |   2 +
 mm/slab_common.c                          |   1 -
 mm/vmpressure.c                           |  52 +++-
 mm/vmscan.c                               | 380 +++++++++++++++++++-------
 net/sunrpc/auth.c                         |  45 +++-
 57 files changed, 2513 insertions(+), 993 deletions(-)
 create mode 100644 include/linux/list_lru.h
 create mode 100644 lib/list_lru.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 02/35] super: fix calculation of shrinkable objects for small numbers
  2013-06-03 19:29 [PATCH v10 00/35] kmemcg shrinkers Glauber Costa
@ 2013-06-03 19:29 ` Glauber Costa
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Glauber Costa, Theodore Ts'o, Al Viro

The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.

It works great in situations in which we have many objects (at least
more than 100), because the aproximation errors will be negligible. But
if this is not the case, specially when total_objects < 100, we may end
up concluding that we have no objects at all (total / 100 = 0,  if total
< 100).

This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.

[ v2: fix it for all occurrences of sysctl_vfs_cache_pressure ]

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
CC: Dave Chinner <david@fromorbit.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/gfs2/glock.c        |  2 +-
 fs/gfs2/quota.c        |  2 +-
 fs/mbcache.c           |  2 +-
 fs/nfs/dir.c           |  2 +-
 fs/quota/dquot.c       |  5 ++---
 fs/super.c             | 14 +++++++-------
 fs/xfs/xfs_qm.c        |  2 +-
 include/linux/dcache.h |  4 ++++
 8 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 9435384..3bd2748 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1463,7 +1463,7 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
 		gfs2_scan_glock_lru(sc->nr_to_scan);
 	}
 
-	return (atomic_read(&lru_count) / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(atomic_read(&lru_count));
 }
 
 static struct shrinker glock_shrinker = {
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index c253b13..f9f4077 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -114,7 +114,7 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
 	spin_unlock(&qd_lru_lock);
 
 out:
-	return (atomic_read(&qd_lru_count) * sysctl_vfs_cache_pressure) / 100;
+	return vfs_pressure_ratio(atomic_read(&qd_lru_count));
 }
 
 static u64 qd2index(struct gfs2_quota_data *qd)
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..5eb0476 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -189,7 +189,7 @@ mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
 		__mb_cache_entry_forget(entry, gfp_mask);
 	}
-	return (count / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(count);
 }
 
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c662ff6..a6a3d05 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1978,7 +1978,7 @@ remove_lru_entry:
 	}
 	spin_unlock(&nfs_access_lru_lock);
 	nfs_access_free_list(&head);
-	return (atomic_long_read(&nfs_access_nr_entries) / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
 }
 
 static void __nfs_access_zap_cache(struct nfs_inode *nfsi, struct list_head *head)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3e64169..762b09c 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -719,9 +719,8 @@ static int shrink_dqcache_memory(struct shrinker *shrink,
 		prune_dqcache(nr);
 		spin_unlock(&dq_list_lock);
 	}
-	return ((unsigned)
-		percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS])
-		/100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(
+	percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
 }
 
 static struct shrinker dqcache_shrinker = {
diff --git a/fs/super.c b/fs/super.c
index 7465d43..2a37fd6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -82,13 +82,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		int	inodes;
 
 		/* proportion the scan between the caches */
-		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
-							total_objects;
-		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
-							total_objects;
+		dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
+							total_objects);
+		inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
+							total_objects);
 		if (fs_objects)
-			fs_objects = (sc->nr_to_scan * fs_objects) /
-							total_objects;
+			fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
+							total_objects);
 		/*
 		 * prune the dcache first as the icache is pinned by it, then
 		 * prune the icache, followed by the filesystem specific caches
@@ -104,7 +104,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 				sb->s_nr_inodes_unused + fs_objects;
 	}
 
-	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
+	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
 	return total_objects;
 }
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index f41702b..7ade175 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1585,7 +1585,7 @@ xfs_qm_shake(
 	}
 
 out:
-	return (qi->qi_lru_count / 100) * sysctl_vfs_cache_pressure;
+	return vfs_pressure_ratio(qi->qi_lru_count);
 }
 
 /*
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a82bdb..bd08285 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -411,4 +411,8 @@ static inline bool d_mountpoint(struct dentry *dentry)
 
 extern int sysctl_vfs_cache_pressure;
 
+static inline unsigned long vfs_pressure_ratio(unsigned long val)
+{
+	return mult_frac(val, sysctl_vfs_cache_pressure, 100);
+}
 #endif	/* __LINUX_DCACHE_H */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

[parent not found: <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]

* [PATCH v10 01/35] fs: bump inode and dentry counters to long
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
                     ` (29 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner

There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc. This is particularly
true when umounting a filesystem, where eventually since every live
object will eventually be discarded.

Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset. So we believe it is time for a change. This
patch just moves int to longs. Machines where it matters should have a
big long anyway.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
CC: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 fs/dcache.c             |  8 ++++----
 fs/inode.c              | 18 +++++++++---------
 fs/internal.h           |  2 +-
 include/linux/dcache.h  | 10 +++++-----
 include/linux/fs.h      |  4 ++--
 include/uapi/linux/fs.h |  6 +++---
 kernel/sysctl.c         |  6 +++---
 7 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..aca4e4b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -117,13 +117,13 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
-static DEFINE_PER_CPU(unsigned int, nr_dentry);
+static DEFINE_PER_CPU(long, nr_dentry);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
-static int get_nr_dentry(void)
+static long get_nr_dentry(void)
 {
 	int i;
-	int sum = 0;
+	long sum = 0;
 	for_each_possible_cpu(i)
 		sum += per_cpu(nr_dentry, i);
 	return sum < 0 ? 0 : sum;
@@ -133,7 +133,7 @@ int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
 		   size_t *lenp, loff_t *ppos)
 {
 	dentry_stat.nr_dentry = get_nr_dentry();
-	return proc_dointvec(table, write, buffer, lenp, ppos);
+	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
 
diff --git a/fs/inode.c b/fs/inode.c
index 00d5fc3..ff29765 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -70,33 +70,33 @@ EXPORT_SYMBOL(empty_aops);
  */
 struct inodes_stat_t inodes_stat;
 
-static DEFINE_PER_CPU(unsigned int, nr_inodes);
-static DEFINE_PER_CPU(unsigned int, nr_unused);
+static DEFINE_PER_CPU(unsigned long, nr_inodes);
+static DEFINE_PER_CPU(unsigned long, nr_unused);
 
 static struct kmem_cache *inode_cachep __read_mostly;
 
-static int get_nr_inodes(void)
+static long get_nr_inodes(void)
 {
 	int i;
-	int sum = 0;
+	long sum = 0;
 	for_each_possible_cpu(i)
 		sum += per_cpu(nr_inodes, i);
 	return sum < 0 ? 0 : sum;
 }
 
-static inline int get_nr_inodes_unused(void)
+static inline long get_nr_inodes_unused(void)
 {
 	int i;
-	int sum = 0;
+	long sum = 0;
 	for_each_possible_cpu(i)
 		sum += per_cpu(nr_unused, i);
 	return sum < 0 ? 0 : sum;
 }
 
-int get_nr_dirty_inodes(void)
+long get_nr_dirty_inodes(void)
 {
 	/* not actually dirty inodes, but a wild approximation */
-	int nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
+	long nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
 	return nr_dirty > 0 ? nr_dirty : 0;
 }
 
@@ -109,7 +109,7 @@ int proc_nr_inodes(ctl_table *table, int write,
 {
 	inodes_stat.nr_inodes = get_nr_inodes();
 	inodes_stat.nr_unused = get_nr_inodes_unused();
-	return proc_dointvec(table, write, buffer, lenp, ppos);
+	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
 
diff --git a/fs/internal.h b/fs/internal.h
index eaa75f7..cd5009f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -117,7 +117,7 @@ extern void inode_add_lru(struct inode *inode);
  */
 extern void inode_wb_list_del(struct inode *inode);
 
-extern int get_nr_dirty_inodes(void);
+extern long get_nr_dirty_inodes(void);
 extern void evict_inodes(struct super_block *);
 extern int invalidate_inodes(struct super_block *, bool);
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..1a82bdb 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -54,11 +54,11 @@ struct qstr {
 #define hashlen_len(hashlen)  ((u32)((hashlen) >> 32))
 
 struct dentry_stat_t {
-	int nr_dentry;
-	int nr_unused;
-	int age_limit;          /* age in seconds */
-	int want_pages;         /* pages requested by system */
-	int dummy[2];
+	long nr_dentry;
+	long nr_unused;
+	long age_limit;          /* age in seconds */
+	long want_pages;         /* pages requested by system */
+	long dummy[2];
 };
 extern struct dentry_stat_t dentry_stat;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f47e43c..204d615 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1265,12 +1265,12 @@ struct super_block {
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
 	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
-	int			s_nr_dentry_unused;	/* # of dentry on lru */
+	long			s_nr_dentry_unused;	/* # of dentry on lru */
 
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
 	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_inode_lru;		/* unused inode lru */
-	int			s_nr_inodes_unused;	/* # of inodes on lru */
+	long			s_nr_inodes_unused;	/* # of inodes on lru */
 
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index a4ed56c..6c28b61 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -49,9 +49,9 @@ struct files_stat_struct {
 };
 
 struct inodes_stat_t {
-	int nr_inodes;
-	int nr_unused;
-	int dummy[5];		/* padding for sysctl ABI compatibility */
+	long nr_inodes;
+	long nr_unused;
+	long dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9edcf45..fb90f7c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1456,14 +1456,14 @@ static struct ctl_table fs_table[] = {
 	{
 		.procname	= "inode-nr",
 		.data		= &inodes_stat,
-		.maxlen		= 2*sizeof(int),
+		.maxlen		= 2*sizeof(long),
 		.mode		= 0444,
 		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "inode-state",
 		.data		= &inodes_stat,
-		.maxlen		= 7*sizeof(int),
+		.maxlen		= 7*sizeof(long),
 		.mode		= 0444,
 		.proc_handler	= proc_nr_inodes,
 	},
@@ -1493,7 +1493,7 @@ static struct ctl_table fs_table[] = {
 	{
 		.procname	= "dentry-state",
 		.data		= &dentry_stat,
-		.maxlen		= 6*sizeof(int),
+		.maxlen		= 6*sizeof(long),
 		.mode		= 0444,
 		.proc_handler	= proc_nr_dentry,
 	},
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2013-06-03 19:29   ` [PATCH v10 01/35] fs: bump inode and dentry counters to long Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:07     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 04/35] dentry: move to per-sb LRU locks Glauber Costa
                     ` (28 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Before we split up the dcache_lru_lock, the unused dentry counter
needs to be made independent of the global dcache_lru_lock. Convert
it to per-cpu counters to do this.

[ v5: comment about possible cpus ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 fs/dcache.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index aca4e4b..9f2aa96 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -118,8 +118,10 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static DEFINE_PER_CPU(long, nr_dentry);
+static DEFINE_PER_CPU(long, nr_dentry_unused);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+/* scan possible cpus instead of online and avoid worrying about CPU hotplug. */
 static long get_nr_dentry(void)
 {
 	int i;
@@ -129,10 +131,20 @@ static long get_nr_dentry(void)
 	return sum < 0 ? 0 : sum;
 }
 
+static long get_nr_dentry_unused(void)
+{
+	int i;
+	long sum = 0;
+	for_each_possible_cpu(i)
+		sum += per_cpu(nr_dentry_unused, i);
+	return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
 		   size_t *lenp, loff_t *ppos)
 {
 	dentry_stat.nr_dentry = get_nr_dentry();
+	dentry_stat.nr_unused = get_nr_dentry_unused();
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -312,7 +324,7 @@ static void dentry_lru_add(struct dentry *dentry)
 		spin_lock(&dcache_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
+		this_cpu_inc(nr_dentry_unused);
 		spin_unlock(&dcache_lru_lock);
 	}
 }
@@ -322,7 +334,7 @@ static void __dentry_lru_del(struct dentry *dentry)
 	list_del_init(&dentry->d_lru);
 	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	dentry->d_sb->s_nr_dentry_unused--;
-	dentry_stat.nr_unused--;
+	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -343,7 +355,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
 		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
+		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
 	}
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-03 19:29   ` [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
@ 2013-06-05 23:07     ` Andrew Morton
  2013-06-06  1:45       ` Dave Chinner
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Mon,  3 Jun 2013 23:29:32 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> Before we split up the dcache_lru_lock, the unused dentry counter
> needs to be made independent of the global dcache_lru_lock. Convert
> it to per-cpu counters to do this.
> 
> ...
>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -118,8 +118,10 @@ struct dentry_stat_t dentry_stat = {
>  };
>  
>  static DEFINE_PER_CPU(long, nr_dentry);
> +static DEFINE_PER_CPU(long, nr_dentry_unused);
>  
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +/* scan possible cpus instead of online and avoid worrying about CPU hotplug. */

That's a poor comment.  It explains what the code does (which is dead
obvious) but fails to explain *why* the code does it.

> @@ -129,10 +131,20 @@ static long get_nr_dentry(void)
>  	return sum < 0 ? 0 : sum;
>  }
>  
> +static long get_nr_dentry_unused(void)
> +{
> +	int i;
> +	long sum = 0;
> +	for_each_possible_cpu(i)
> +		sum += per_cpu(nr_dentry_unused, i);
> +	return sum < 0 ? 0 : sum;
> +}

And I'm sure we've asked and answered ad nauseum why this code needed
to open-code the counters instead of using the provided library code,
yet the answer to that *still* isn't in the code comments or even in
the changelog.  It should be.


Given that the existing proc_nr_dentry() will suck mud rocks on
large-cpu-count machines (due to get_nr_dentry()), I guess we can
assume that nobody will be especially hurt by making proc_nr_dentry()
suck even harder...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-05 23:07     ` Andrew Morton
@ 2013-06-06  1:45       ` Dave Chinner
  2013-06-06  2:48         ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  1:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 04:07:31PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:32 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Before we split up the dcache_lru_lock, the unused dentry counter
> > needs to be made independent of the global dcache_lru_lock. Convert
> > it to per-cpu counters to do this.
> > 
> > ...
> >
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -118,8 +118,10 @@ struct dentry_stat_t dentry_stat = {
> >  };
> >  
> >  static DEFINE_PER_CPU(long, nr_dentry);
> > +static DEFINE_PER_CPU(long, nr_dentry_unused);
> >  
> >  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> > +/* scan possible cpus instead of online and avoid worrying about CPU hotplug. */
> 
> That's a poor comment.  It explains what the code does (which is dead
> obvious) but fails to explain *why* the code does it.
> 
> > @@ -129,10 +131,20 @@ static long get_nr_dentry(void)
> >  	return sum < 0 ? 0 : sum;
> >  }
> >  
> > +static long get_nr_dentry_unused(void)
> > +{
> > +	int i;
> > +	long sum = 0;
> > +	for_each_possible_cpu(i)
> > +		sum += per_cpu(nr_dentry_unused, i);
> > +	return sum < 0 ? 0 : sum;
> > +}
> 
> And I'm sure we've asked and answered ad nauseum why this code needed
> to open-code the counters instead of using the provided library code,
> yet the answer to that *still* isn't in the code comments or even in
> the changelog.  It should be.

<sigh>

They were, originally, generic per-cpu counters:

312d3ca fs: use percpu counter for nr_dentry and nr_dentry_unused
cffbc8a fs: Convert nr_inodes and nr_unused to per-cpu counters

but then, well, let me just point you at the last time someone asked
this:

http://lwn.net/Articles/546587/

This is how we ended up with these fucked-up custom per-cpu
counters:

86c8749 vfs: revert per-cpu nr_unused counters for dentry and inodes
3e880fb fs: use fast counters for vfs caches

And so here we are now reverting 86c8749 because we're now
implementing the side of the scalability pile that requires the
unused counters to scale globally. I don't care to revisit 3e880fb
in this patch series, so this patch just duplicates existing
infrastructure.

> Given that the existing proc_nr_dentry() will suck mud rocks on
> large-cpu-count machines (due to get_nr_dentry()), I guess we can
> assume that nobody will be especially hurt by making proc_nr_dentry()
> suck even harder...

Yup, another reason I don't like the current implementation, too.
But making this better was labelled "optimising the slow path" and
so roundly dismissed.

Andrew, if you want to push the changes back to generic per-cpu
counters through to Linus, then I'll write the patches for you.  But
- and this is a big but - I'll only do this if you are going to deal
with the "performance trumps all other concerns" fanatics over
whether it should be merged or not. I have better things to do
with my time have a flamewar over trivial details like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-06  1:45       ` Dave Chinner
@ 2013-06-06  2:48         ` Andrew Morton
  2013-06-06  4:02           ` Dave Chinner
  2013-06-06 12:40           ` Glauber Costa
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  2:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 11:45:09 +1000 Dave Chinner <david@fromorbit.com> wrote:

> Andrew, if you want to push the changes back to generic per-cpu
> counters through to Linus, then I'll write the patches for you.  But
> - and this is a big but - I'll only do this if you are going to deal
> with the "performance trumps all other concerns" fanatics over
> whether it should be merged or not. I have better things to do
> with my time have a flamewar over trivial details like this.

Please view my comments as a critique of the changelog, not of the code. 

There are presumably good (but undisclosed) reasons for going this way,
but this question is so bleeding obvious that the decision should have
been addressed up-front and in good detail.

And, preferably, with benchmark numbers.  Because it might have been
the wrong decision - stranger things have happened.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-06  2:48         ` Andrew Morton
@ 2013-06-06  4:02           ` Dave Chinner
  2013-06-06 12:40           ` Glauber Costa
  1 sibling, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  4:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 07:48:01PM -0700, Andrew Morton wrote:
> On Thu, 6 Jun 2013 11:45:09 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Andrew, if you want to push the changes back to generic per-cpu
> > counters through to Linus, then I'll write the patches for you.  But
> > - and this is a big but - I'll only do this if you are going to deal
> > with the "performance trumps all other concerns" fanatics over
> > whether it should be merged or not. I have better things to do
> > with my time have a flamewar over trivial details like this.
> 
> Please view my comments as a critique of the changelog, not of the code. 
> 
> There are presumably good (but undisclosed) reasons for going this way,
> but this question is so bleeding obvious that the decision should have
> been addressed up-front and in good detail.

The answer is so bleeding obvious I didn't think it needed to be
documented. ;) i.e. implement it the same way that it's sibling is
implemented because consistency is good....

> And, preferably, with benchmark numbers.  Because it might have been
> the wrong decision - stranger things have happened.

I've never been able to measure the difference in fast-path
performance that can be attributed to the generic CPU counters
having more overhead than the special ones. If you've got any
workload where the fast-path counter overhead shows up in a
profile, I'd be very interested....

Cheers,

dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-06  2:48         ` Andrew Morton
  2013-06-06  4:02           ` Dave Chinner
@ 2013-06-06 12:40           ` Glauber Costa
  2013-06-06 22:25             ` Andrew Morton
  1 sibling, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06 12:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On 06/06/2013 06:48 AM, Andrew Morton wrote:
> On Thu, 6 Jun 2013 11:45:09 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
>> Andrew, if you want to push the changes back to generic per-cpu
>> counters through to Linus, then I'll write the patches for you.  But
>> - and this is a big but - I'll only do this if you are going to deal
>> with the "performance trumps all other concerns" fanatics over
>> whether it should be merged or not. I have better things to do
>> with my time have a flamewar over trivial details like this.
> 
> Please view my comments as a critique of the changelog, not of the code. 
> 
> There are presumably good (but undisclosed) reasons for going this way,
> but this question is so bleeding obvious that the decision should have
> been addressed up-front and in good detail.
> 
> And, preferably, with benchmark numbers.  Because it might have been
> the wrong decision - stranger things have happened.
> 

I have folded the attached patch here. Let me know if it still needs
more love.


[-- Attachment #2: 3.patch --]
[-- Type: text/x-patch, Size: 1013 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index 9f2aa96..0466dbd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -121,7 +121,19 @@ static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
-/* scan possible cpus instead of online and avoid worrying about CPU hotplug. */
+
+/*
+ * Here we resort to our own counters instead of using generic per-cpu counters
+ * for consistency with what the vfs inode code does. We are expected to harvest
+ * better code and performance by having our own specialized counters.
+ *
+ * Please note that the loop is done over all possible CPUs, not over all online
+ * CPUs. The reason for this is that we don't want to play games with CPUs going
+ * on and off. If one of them goes off, we will just keep their counters.
+ *
+ * glommer: See cffbc8a for details, and if you ever intend to change this,
+ * please update all vfs counters to match.
+ */
 static long get_nr_dentry(void)
 {
 	int i;

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-06 12:40           ` Glauber Costa
@ 2013-06-06 22:25             ` Andrew Morton
       [not found]               ` <20130606152546.52f614d852da32d28a0b460f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06 22:25 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 16:40:42 +0400 Glauber Costa <glommer@parallels.com> wrote:

> +/*
> + * Here we resort to our own counters instead of using generic per-cpu counters
> + * for consistency with what the vfs inode code does. We are expected to harvest
> + * better code and performance by having our own specialized counters.
> + *
> + * Please note that the loop is done over all possible CPUs, not over all online
> + * CPUs. The reason for this is that we don't want to play games with CPUs going
> + * on and off. If one of them goes off, we will just keep their counters.
> + *
> + * glommer: See cffbc8a for details, and if you ever intend to change this,
> + * please update all vfs counters to match.

Handling CPU hotplug is really quite simple - see lib/percpu_counter.c

(I can't imagine why percpu_counter_hotcpu_callback() sums all the
counters - all it needs to do is to spill hcpu's counter into current's
counter).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130606152546.52f614d852da32d28a0b460f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
       [not found]               ` <20130606152546.52f614d852da32d28a0b460f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06 23:42                 ` Dave Chinner
  2013-06-07  6:03                   ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06 23:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, Glauber Costa,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On Thu, Jun 06, 2013 at 03:25:46PM -0700, Andrew Morton wrote:
> On Thu, 6 Jun 2013 16:40:42 +0400 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> 
> > +/*
> > + * Here we resort to our own counters instead of using generic per-cpu counters
> > + * for consistency with what the vfs inode code does. We are expected to harvest
> > + * better code and performance by having our own specialized counters.
> > + *
> > + * Please note that the loop is done over all possible CPUs, not over all online
> > + * CPUs. The reason for this is that we don't want to play games with CPUs going
> > + * on and off. If one of them goes off, we will just keep their counters.
> > + *
> > + * glommer: See cffbc8a for details, and if you ever intend to change this,
> > + * please update all vfs counters to match.
> 
> Handling CPU hotplug is really quite simple - see lib/percpu_counter.c

Yes, it is - you're preaching to the choir, Andrew.

But, well, if you want us to add notifiers to optimise the summation
to just the active CPUs, then lets just covert the code to use the
generic per-cpu counters and stop wasting time rehashing tired old
arguments.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2013-06-06 23:42                 ` Dave Chinner
@ 2013-06-07  6:03                   ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-07  6:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Glauber Costa,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, KAMEZAWA Hiroyuki

On 06/07/2013 03:42 AM, Dave Chinner wrote:
> On Thu, Jun 06, 2013 at 03:25:46PM -0700, Andrew Morton wrote:
>> On Thu, 6 Jun 2013 16:40:42 +0400 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>>
>>> +/*
>>> + * Here we resort to our own counters instead of using generic per-cpu counters
>>> + * for consistency with what the vfs inode code does. We are expected to harvest
>>> + * better code and performance by having our own specialized counters.
>>> + *
>>> + * Please note that the loop is done over all possible CPUs, not over all online
>>> + * CPUs. The reason for this is that we don't want to play games with CPUs going
>>> + * on and off. If one of them goes off, we will just keep their counters.
>>> + *
>>> + * glommer: See cffbc8a for details, and if you ever intend to change this,
>>> + * please update all vfs counters to match.
>>
>> Handling CPU hotplug is really quite simple - see lib/percpu_counter.c
> 
> Yes, it is - you're preaching to the choir, Andrew.
> 
> But, well, if you want us to add notifiers to optimise the summation
> to just the active CPUs, then lets just covert the code to use the
> generic per-cpu counters and stop wasting time rehashing tired old
> arguments.
> 

It is not even only this. I had this very same discussion a while ago
with Kamezawa - memcg also uses its own percpu counters. If my mind does
not betray me, that was because the patterns generated for a
percpu_counter array are quite bad. So this is not the single offender.
(And again, I came up with the "why not percpu counters" as soon as Dave
posted this patch for the first time).

One thing that it seems to indicate is that the percpu counters are too
generic, and maybe could use some work.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 04/35] dentry: move to per-sb LRU locks
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2013-06-03 19:29   ` [PATCH v10 01/35] fs: bump inode and dentry counters to long Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:07     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 06/35] mm: new shrinker API Glauber Costa
                     ` (27 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

With the dentry LRUs being per-sb structures, there is no real need
for a global dentry_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 fs/dcache.c        | 33 ++++++++++++++++-----------------
 fs/super.c         |  1 +
 include/linux/fs.h |  4 +++-
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9f2aa96..9d8ec4a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -48,7 +48,7 @@
  *   - the dcache hash table
  * s_anon bl list spinlock protects:
  *   - the s_anon list (see __d_drop)
- * dcache_lru_lock protects:
+ * dentry->d_sb->s_dentry_lru_lock protects:
  *   - the dcache lru lists and counters
  * d_lock protects:
  *   - d_flags
@@ -63,7 +63,7 @@
  * Ordering:
  * dentry->d_inode->i_lock
  *   dentry->d_lock
- *     dcache_lru_lock
+ *     dentry->d_sb->s_dentry_lru_lock
  *     dcache_hash_bucket lock
  *     s_anon lock
  *
@@ -81,7 +81,6 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
@@ -321,11 +320,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
 static void dentry_lru_add(struct dentry *dentry)
 {
 	if (list_empty(&dentry->d_lru)) {
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
 		this_cpu_inc(nr_dentry_unused);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
@@ -343,15 +342,15 @@ static void __dentry_lru_del(struct dentry *dentry)
 static void dentry_lru_del(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
 		dentry->d_sb->s_nr_dentry_unused++;
@@ -359,7 +358,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 	} else {
 		list_move_tail(&dentry->d_lru, list);
 	}
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
 
 /**
@@ -839,14 +838,14 @@ void prune_dcache_sb(struct super_block *sb, int count)
 	LIST_HEAD(tmp);
 
 relock:
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		dentry = list_entry(sb->s_dentry_lru.prev,
 				struct dentry, d_lru);
 		BUG_ON(dentry->d_sb != sb);
 
 		if (!spin_trylock(&dentry->d_lock)) {
-			spin_unlock(&dcache_lru_lock);
+			spin_unlock(&sb->s_dentry_lru_lock);
 			cpu_relax();
 			goto relock;
 		}
@@ -862,11 +861,11 @@ relock:
 			if (!--count)
 				break;
 		}
-		cond_resched_lock(&dcache_lru_lock);
+		cond_resched_lock(&sb->s_dentry_lru_lock);
 	}
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&sb->s_dentry_lru_lock);
 
 	shrink_dentry_list(&tmp);
 }
@@ -882,14 +881,14 @@ void shrink_dcache_sb(struct super_block *sb)
 {
 	LIST_HEAD(tmp);
 
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&sb->s_dentry_lru_lock);
 		shrink_dentry_list(&tmp);
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&sb->s_dentry_lru_lock);
 	}
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&sb->s_dentry_lru_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
diff --git a/fs/super.c b/fs/super.c
index 2a37fd6..0be75fb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -182,6 +182,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
+		spin_lock_init(&s->s_dentry_lru_lock);
 		INIT_LIST_HEAD(&s->s_inode_lru);
 		spin_lock_init(&s->s_inode_lru_lock);
 		INIT_LIST_HEAD(&s->s_mounts);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 204d615..11f9ad2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1263,7 +1263,9 @@ struct super_block {
 	struct list_head	s_files;
 #endif
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
+
+	/* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
+	spinlock_t		s_dentry_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	long			s_nr_dentry_unused;	/* # of dentry on lru */
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 04/35] dentry: move to per-sb LRU locks
  2013-06-03 19:29   ` [PATCH v10 04/35] dentry: move to per-sb LRU locks Glauber Costa
@ 2013-06-05 23:07     ` Andrew Morton
  2013-06-06  1:56       ` Dave Chinner
  2013-06-06  8:03       ` Glauber Costa
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Mon,  3 Jun 2013 23:29:33 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> With the dentry LRUs being per-sb structures, there is no real need
> for a global dentry_lru_lock. The locking can be made more
> fine-grained by moving to a per-sb LRU lock, isolating the LRU
> operations of different filesytsems completely from each other.

What's the point to this patch?  Is it to enable some additional
development, or is it a standalone performance tweak?

If the latter then the patch obviously makes this dentry code bloatier
and straight-line slower.  So we're assuming that the multiprocessor
contention-avoidance benefits will outweigh that cost.  Got any proof
of this?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 04/35] dentry: move to per-sb LRU locks
  2013-06-05 23:07     ` Andrew Morton
@ 2013-06-06  1:56       ` Dave Chinner
  2013-06-06  8:03       ` Glauber Costa
  1 sibling, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  1:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 04:07:38PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:33 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With the dentry LRUs being per-sb structures, there is no real need
> > for a global dentry_lru_lock. The locking can be made more
> > fine-grained by moving to a per-sb LRU lock, isolating the LRU
> > operations of different filesytsems completely from each other.
> 
> What's the point to this patch?  Is it to enable some additional
> development, or is it a standalone performance tweak?

It's the separation of the global lock into locks of the same scope
the generic LRU list requires.

> If the latter then the patch obviously makes this dentry code bloatier
> and straight-line slower.  So we're assuming that the multiprocessor
> contention-avoidance benefits will outweigh that cost.  Got any proof
> of this?

Well, it will do that too for workloads that span multiple
filesytems, but that isn't the point of the patch. it's merely a
setting stone...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 04/35] dentry: move to per-sb LRU locks
  2013-06-05 23:07     ` Andrew Morton
  2013-06-06  1:56       ` Dave Chinner
@ 2013-06-06  8:03       ` Glauber Costa
  2013-06-06 12:51         ` Glauber Costa
  1 sibling, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On 06/06/2013 03:07 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:33 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> With the dentry LRUs being per-sb structures, there is no real need
>> for a global dentry_lru_lock. The locking can be made more
>> fine-grained by moving to a per-sb LRU lock, isolating the LRU
>> operations of different filesytsems completely from each other.
> 
> What's the point to this patch?  Is it to enable some additional
> development, or is it a standalone performance tweak?
> 
> If the latter then the patch obviously makes this dentry code bloatier
> and straight-line slower.  So we're assuming that the multiprocessor
> contention-avoidance benefits will outweigh that cost.  Got any proof
> of this?
> 
> 
This is preparation for the whole point of this series, which is to
abstract the lru manipulation into a list_lru. It is hard to do that
when the dcache has a single lock for all manipulations, and multiple
lists under its umbrella.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 04/35] dentry: move to per-sb LRU locks
  2013-06-06  8:03       ` Glauber Costa
@ 2013-06-06 12:51         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06 12:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On 06/06/2013 12:03 PM, Glauber Costa wrote:
> On 06/06/2013 03:07 AM, Andrew Morton wrote:
>> On Mon,  3 Jun 2013 23:29:33 +0400 Glauber Costa <glommer@openvz.org> wrote:
>>
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> With the dentry LRUs being per-sb structures, there is no real need
>>> for a global dentry_lru_lock. The locking can be made more
>>> fine-grained by moving to a per-sb LRU lock, isolating the LRU
>>> operations of different filesytsems completely from each other.
>>
>> What's the point to this patch?  Is it to enable some additional
>> development, or is it a standalone performance tweak?
>>
>> If the latter then the patch obviously makes this dentry code bloatier
>> and straight-line slower.  So we're assuming that the multiprocessor
>> contention-avoidance benefits will outweigh that cost.  Got any proof
>> of this?
>>
>>
> This is preparation for the whole point of this series, which is to
> abstract the lru manipulation into a list_lru. It is hard to do that
> when the dcache has a single lock for all manipulations, and multiple
> lists under its umbrella.
> 
> 

I have updated the Changelog, that now reads:

With the dentry LRUs being per-sb structures, there is no real need for
a global dentry_lru_lock. The locking can be made more fine-grained by
moving to a per-sb LRU lock, isolating the LRU operations of different
filesytsems completely from each other. The need for this is independent
of any performance consideration that may arise: in the interest of
abstracting the lru operations away, it is mandatory that each lru works
around its own lock instead of a global lock for all of them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 06/35] mm: new shrinker API
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 04/35] dentry: move to per-sb LRU locks Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:07     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 07/35] shrinker: convert superblock shrinkers to new API Glauber Costa
                     ` (26 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

The current shrinker callout API uses an a single shrinker call for
multiple functions. To determine the function, a special magical
value is passed in a parameter to change the behaviour. This
complicates the implementation and return value specification for
the different behaviours.

Separate the two different behaviours into separate operations, one
to return a count of freeable objects in the cache, and another to
scan a certain number of objects in the cache for freeing. In
defining these new operations, ensure the return values and
resultant behaviours are clearly defined and documented.

Modify shrink_slab() to use the new API and implement the callouts
for all the existing shrinkers.

Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 include/linux/shrinker.h | 36 ++++++++++++++++++++++++----------
 mm/vmscan.c              | 50 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 58 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index ac6b8ee..c277b4e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,31 +4,47 @@
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
  * We consolidate the values for easier extention later.
+ *
+ * The 'gfpmask' refers to the allocation we are currently trying to
+ * fulfil.
+ *
+ * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
+ * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
 	gfp_t gfp_mask;
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
-	unsigned long nr_to_scan;
+	long nr_to_scan;
 };
 
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
- * and a 'gfpmask'.  It should look through the least-recently-used
- * 'nr_to_scan' entries and attempt to free them up.  It should return
- * the number of objects which remain in the cache.  If it returns -1, it means
- * it cannot do any scanning at this time (eg. there is a risk of deadlock).
+ * @shrink() should look through the least-recently-used 'nr_to_scan' entries
+ * and attempt to free them up.  It should return the number of objects which
+ * remain in the cache.  If it returns -1, it means it cannot do any scanning at
+ * this time (eg. there is a risk of deadlock).
  *
- * The 'gfpmask' refers to the allocation we are currently trying to
- * fulfil.
+ * @count_objects should return the number of freeable items in the cache. If
+ * there are no objects to free or the number of freeable items cannot be
+ * determined, it should return 0. No deadlock checks should be done during the
+ * count callback - the shrinker relies on aggregating scan counts that couldn't
+ * be executed due to potential deadlocks to be run at a later call when the
+ * deadlock condition is no longer pending.
  *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
+ * @scan_objects will only be called if @count_objects returned a positive
+ * value for the number of freeable objects. The callout should scan the cache
+ * and attempt to free items from the cache. It should then return the number of
+ * objects freed during the scan, or -1 if progress cannot be made due to
+ * potential deadlocks. If -1 is returned, then no further attempts to call the
+ * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
 	int (*shrink)(struct shrinker *, struct shrink_control *sc);
+	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
+	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
+
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b38ad..6ac3ec2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -205,19 +205,19 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
  *
  * Returns the number of slab objects which we shrunk.
  */
-unsigned long shrink_slab(struct shrink_control *shrink,
+unsigned long shrink_slab(struct shrink_control *shrinkctl,
 			  unsigned long nr_pages_scanned,
 			  unsigned long lru_pages)
 {
 	struct shrinker *shrinker;
-	unsigned long ret = 0;
+	unsigned long freed = 0;
 
 	if (nr_pages_scanned == 0)
 		nr_pages_scanned = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem)) {
 		/* Assume we'll be able to shrink next time */
-		ret = 1;
+		freed = 1;
 		goto out;
 	}
 
@@ -225,13 +225,16 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		long total_scan;
 		long max_pass;
-		int shrink_ret = 0;
 		long nr;
 		long new_nr;
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
-		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+		if (shrinker->scan_objects) {
+			max_pass = shrinker->count_objects(shrinker, shrinkctl);
+			WARN_ON(max_pass < 0);
+		} else
+			max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
 		if (max_pass <= 0)
 			continue;
 
@@ -248,8 +251,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		do_div(delta, lru_pages + 1);
 		total_scan += delta;
 		if (total_scan < 0) {
-			printk(KERN_ERR "shrink_slab: %pF negative objects to "
-			       "delete nr=%ld\n",
+			printk(KERN_ERR
+			"shrink_slab: %pF negative objects to delete nr=%ld\n",
 			       shrinker->shrink, total_scan);
 			total_scan = max_pass;
 		}
@@ -277,20 +280,31 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		if (total_scan > max_pass * 2)
 			total_scan = max_pass * 2;
 
-		trace_mm_shrink_slab_start(shrinker, shrink, nr,
+		trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 					nr_pages_scanned, lru_pages,
 					max_pass, delta, total_scan);
 
 		while (total_scan >= batch_size) {
-			int nr_before;
+			long ret;
+
+			if (shrinker->scan_objects) {
+				shrinkctl->nr_to_scan = batch_size;
+				ret = shrinker->scan_objects(shrinker, shrinkctl);
+
+				if (ret == -1)
+					break;
+				freed += ret;
+			} else {
+				int nr_before;
+				nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
+				ret = do_shrinker_shrink(shrinker, shrinkctl,
+								batch_size);
+				if (ret == -1)
+					break;
+				if (ret < nr_before)
+					freed += nr_before - ret;
+			}
 
-			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
-			shrink_ret = do_shrinker_shrink(shrinker, shrink,
-							batch_size);
-			if (shrink_ret == -1)
-				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
 			count_vm_events(SLABS_SCANNED, batch_size);
 			total_scan -= batch_size;
 
@@ -308,12 +322,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		else
 			new_nr = atomic_long_read(&shrinker->nr_in_batch);
 
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+		trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
-	return ret;
+	return freed;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 06/35] mm: new shrinker API
  2013-06-03 19:29   ` [PATCH v10 06/35] mm: new shrinker API Glauber Costa
@ 2013-06-05 23:07     ` Andrew Morton
       [not found]       ` <20130605160751.499f0ebb35e89a80dd7931f2-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Glauber Costa

On Mon,  3 Jun 2013 23:29:35 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> The current shrinker callout API uses an a single shrinker call for
> multiple functions. To determine the function, a special magical
> value is passed in a parameter to change the behaviour. This
> complicates the implementation and return value specification for
> the different behaviours.
> 
> Separate the two different behaviours into separate operations, one
> to return a count of freeable objects in the cache, and another to
> scan a certain number of objects in the cache for freeing. In
> defining these new operations, ensure the return values and
> resultant behaviours are clearly defined and documented.
> 
> Modify shrink_slab() to use the new API and implement the callouts
> for all the existing shrinkers.
> 
> ...
>
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,31 +4,47 @@
>  /*
>   * This struct is used to pass information from page reclaim to the shrinkers.
>   * We consolidate the values for easier extention later.
> + *
> + * The 'gfpmask' refers to the allocation we are currently trying to
> + * fulfil.
> + *
> + * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
> + * querying the cache size, so a fastpath for that case is appropriate.
>   */
>  struct shrink_control {
>  	gfp_t gfp_mask;
>  
>  	/* How many slab objects shrinker() should scan and try to reclaim */
> -	unsigned long nr_to_scan;
> +	long nr_to_scan;

Why this change?

(I might have asked this before, but because the changelog wasn't
updated, you get to answer it again!)

>  };
>  
>  /*
>   * A callback you can register to apply pressure to ageable caches.
>   *
> - * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
> - * and a 'gfpmask'.  It should look through the least-recently-used
> - * 'nr_to_scan' entries and attempt to free them up.  It should return
> - * the number of objects which remain in the cache.  If it returns -1, it means
> - * it cannot do any scanning at this time (eg. there is a risk of deadlock).
> + * @shrink() should look through the least-recently-used 'nr_to_scan' entries
> + * and attempt to free them up.  It should return the number of objects which
> + * remain in the cache.  If it returns -1, it means it cannot do any scanning at
> + * this time (eg. there is a risk of deadlock).
>   *
> - * The 'gfpmask' refers to the allocation we are currently trying to
> - * fulfil.
> + * @count_objects should return the number of freeable items in the cache. If
> + * there are no objects to free or the number of freeable items cannot be
> + * determined, it should return 0. No deadlock checks should be done during the
> + * count callback - the shrinker relies on aggregating scan counts that couldn't
> + * be executed due to potential deadlocks to be run at a later call when the
> + * deadlock condition is no longer pending.
>   *
> - * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
> - * querying the cache size, so a fastpath for that case is appropriate.
> + * @scan_objects will only be called if @count_objects returned a positive
> + * value for the number of freeable objects.

Saying "positive value" implies to me that count_objects() can return a
negative code, but such a thing is not documented here.  If
count_objects() *doesn't* return a -ve code then s/positive/non-zero/
here would clear up confusion.

> The callout should scan the cache
> + * and attempt to free items from the cache. It should then return the number of
> + * objects freed during the scan, or -1 if progress cannot be made due to
> + * potential deadlocks. If -1 is returned, then no further attempts to call the
> + * @scan_objects will be made from the current reclaim context.
>   */
>  struct shrinker {
>  	int (*shrink)(struct shrinker *, struct shrink_control *sc);
> +	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
> +	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);

As these both return counts-of-things, one would expect the return type
to be unsigned.

I assume that scan_objects was made signed for the "return -1" thing,
although that might not have been the best decision - it could return
~0UL, for example.

It's unclear why count_objects() returns a signed quantity.


>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b1b38ad..6ac3ec2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -205,19 +205,19 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
>   *
>   * Returns the number of slab objects which we shrunk.
>   */
> -unsigned long shrink_slab(struct shrink_control *shrink,
> +unsigned long shrink_slab(struct shrink_control *shrinkctl,
>  			  unsigned long nr_pages_scanned,
>  			  unsigned long lru_pages)
>  {
>  	struct shrinker *shrinker;
> -	unsigned long ret = 0;
> +	unsigned long freed = 0;
>  
>  	if (nr_pages_scanned == 0)
>  		nr_pages_scanned = SWAP_CLUSTER_MAX;
>  
>  	if (!down_read_trylock(&shrinker_rwsem)) {
>  		/* Assume we'll be able to shrink next time */
> -		ret = 1;
> +		freed = 1;

That's odd - it didn't free anything?  Needs a comment to avoid
mystifying other readers.

>  		goto out;
>  	}
>  
> @@ -225,13 +225,16 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>  		unsigned long long delta;
>  		long total_scan;
>  		long max_pass;
> -		int shrink_ret = 0;
>  		long nr;
>  		long new_nr;
>  		long batch_size = shrinker->batch ? shrinker->batch
>  						  : SHRINK_BATCH;
>  
> -		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
> +		if (shrinker->scan_objects) {

Did you mean to test ->scan_objects here?  Or ->count_objects? 
->scan_objects makes sense but I wanna know if it was a copy-n-paste
bug.

> +			max_pass = shrinker->count_objects(shrinker, shrinkctl);
> +			WARN_ON(max_pass < 0);

OK so from that I see that ->count_objects() doesn't return negative.

I this warning ever triggers, I expect it will trigger *a lot*. 
WARN_ON_ONCE would be more prudent.  Or just nuke it.

> +		} else
> +			max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
>  		if (max_pass <= 0)
>  			continue;
>  
> @@ -248,8 +251,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>  		do_div(delta, lru_pages + 1);
>  		total_scan += delta;
>  		if (total_scan < 0) {
> -			printk(KERN_ERR "shrink_slab: %pF negative objects to "
> -			       "delete nr=%ld\n",
> +			printk(KERN_ERR
> +			"shrink_slab: %pF negative objects to delete nr=%ld\n",
>  			       shrinker->shrink, total_scan);
>  			total_scan = max_pass;
>  		}
> @@ -277,20 +280,31 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>  		if (total_scan > max_pass * 2)
>  			total_scan = max_pass * 2;
>  
> -		trace_mm_shrink_slab_start(shrinker, shrink, nr,
> +		trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>  					nr_pages_scanned, lru_pages,
>  					max_pass, delta, total_scan);
>  
>  		while (total_scan >= batch_size) {
> -			int nr_before;
> +			long ret;
> +
> +			if (shrinker->scan_objects) {
> +				shrinkctl->nr_to_scan = batch_size;
> +				ret = shrinker->scan_objects(shrinker, shrinkctl);
> +
> +				if (ret == -1)
> +					break;
> +				freed += ret;
> +			} else {
> +				int nr_before;
> +				nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
> +				ret = do_shrinker_shrink(shrinker, shrinkctl,
> +								batch_size);
> +				if (ret == -1)
> +					break;
> +				if (ret < nr_before)

This test seems unnecessary.

> +					freed += nr_before - ret;
> +			}
>  
> -			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
> -			shrink_ret = do_shrinker_shrink(shrinker, shrink,
> -							batch_size);
> -			if (shrink_ret == -1)
> -				break;
> -			if (shrink_ret < nr_before)
> -				ret += nr_before - shrink_ret;
>  			count_vm_events(SLABS_SCANNED, batch_size);
>  			total_scan -= batch_size;
>  
> @@ -308,12 +322,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>  		else
>  			new_nr = atomic_long_read(&shrinker->nr_in_batch);
>  
> -		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
> +		trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
>  	}
>  	up_read(&shrinker_rwsem);
>  out:
>  	cond_resched();
> -	return ret;
> +	return freed;
>  }
>  
>  static inline int is_page_cache_freeable(struct page *page)

shrink_slab() has a long, long history of exhibiting various overflows
- both multiplicative and over-incrementing.  I looked, and can't see
any introduction of such problems here, but please do check it
carefully.  Expect the impossible :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160751.499f0ebb35e89a80dd7931f2-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 06/35] mm: new shrinker API
       [not found]       ` <20130605160751.499f0ebb35e89a80dd7931f2-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  7:58         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  7:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On 06/06/2013 03:07 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:35 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> 
>> From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> The current shrinker callout API uses an a single shrinker call for
>> multiple functions. To determine the function, a special magical
>> value is passed in a parameter to change the behaviour. This
>> complicates the implementation and return value specification for
>> the different behaviours.
>>
>> Separate the two different behaviours into separate operations, one
>> to return a count of freeable objects in the cache, and another to
>> scan a certain number of objects in the cache for freeing. In
>> defining these new operations, ensure the return values and
>> resultant behaviours are clearly defined and documented.
>>
>> Modify shrink_slab() to use the new API and implement the callouts
>> for all the existing shrinkers.
>>
>> ...
>>
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -4,31 +4,47 @@
>>  /*
>>   * This struct is used to pass information from page reclaim to the shrinkers.
>>   * We consolidate the values for easier extention later.
>> + *
>> + * The 'gfpmask' refers to the allocation we are currently trying to
>> + * fulfil.
>> + *
>> + * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
>> + * querying the cache size, so a fastpath for that case is appropriate.
>>   */
>>  struct shrink_control {
>>  	gfp_t gfp_mask;
>>  
>>  	/* How many slab objects shrinker() should scan and try to reclaim */
>> -	unsigned long nr_to_scan;
>> +	long nr_to_scan;
> 
> Why this change?
> 
> (I might have asked this before, but because the changelog wasn't
> updated, you get to answer it again!)
> 

There were various reasons to have a signed quantity for nr_to_scan, I
believe I fixed all of them by now. We still want the lru nr_items to be
a signed quantity, but this one can go. I will make sure of that, and
shout if there is any impediment still.

>>  };
>>  
>>  /*
>>   * A callback you can register to apply pressure to ageable caches.
>>   *
>> - * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
>> - * and a 'gfpmask'.  It should look through the least-recently-used
>> - * 'nr_to_scan' entries and attempt to free them up.  It should return
>> - * the number of objects which remain in the cache.  If it returns -1, it means
>> - * it cannot do any scanning at this time (eg. there is a risk of deadlock).
>> + * @shrink() should look through the least-recently-used 'nr_to_scan' entries
>> + * and attempt to free them up.  It should return the number of objects which
>> + * remain in the cache.  If it returns -1, it means it cannot do any scanning at
>> + * this time (eg. there is a risk of deadlock).
>>   *
>> - * The 'gfpmask' refers to the allocation we are currently trying to
>> - * fulfil.
>> + * @count_objects should return the number of freeable items in the cache. If
>> + * there are no objects to free or the number of freeable items cannot be
>> + * determined, it should return 0. No deadlock checks should be done during the
>> + * count callback - the shrinker relies on aggregating scan counts that couldn't
>> + * be executed due to potential deadlocks to be run at a later call when the
>> + * deadlock condition is no longer pending.
>>   *
>> - * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
>> - * querying the cache size, so a fastpath for that case is appropriate.
>> + * @scan_objects will only be called if @count_objects returned a positive
>> + * value for the number of freeable objects.
> 
> Saying "positive value" implies to me that count_objects() can return a
> negative code, but such a thing is not documented here.  If
> count_objects() *doesn't* return a -ve code then s/positive/non-zero/
> here would clear up confusion.
> 
Ok, I will update.

>> The callout should scan the cache
>> + * and attempt to free items from the cache. It should then return the number of
>> + * objects freed during the scan, or -1 if progress cannot be made due to
>> + * potential deadlocks. If -1 is returned, then no further attempts to call the
>> + * @scan_objects will be made from the current reclaim context.
>>   */
>>  struct shrinker {
>>  	int (*shrink)(struct shrinker *, struct shrink_control *sc);
>> +	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
>> +	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
> 
> As these both return counts-of-things, one would expect the return type
> to be unsigned.
> 
> I assume that scan_objects was made signed for the "return -1" thing,
> although that might not have been the best decision - it could return
> ~0UL, for example.
> 

Ok. By using long we are already limiting the amount of scanned objects
to half the size of an int anyway, so separating a special value won't hurt.

> It's unclear why count_objects() returns a signed quantity.
> 
> 
I can only guess, but I believe Dave originally just wanted them
symmetrical, for it was a slightly mechanical conversion.

The only benefit that can come from count_objects returning -1, is
catching conversion bugs. We had already caught one like this. Like a
shrinker is returning count < 0 because it was mistakenly converted, and
the "return -1" that existed before ended up in count and scan.

Since this have already proved useful once, how about we leave it like
this, give it some time in linux-next (I have audited Dave's conversion,
but very honestly I obviously haven't stressed tested all possible
drivers that have shrinkers).

vmscan have a WARN_ON() testing for that, so we'll know. I can provide
another patch to fix that after a while.

Would that work for you ?

>>  	int seeks;	/* seeks to recreate an obj */
>>  	long batch;	/* reclaim batch size, 0 = default */
>>  
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b1b38ad..6ac3ec2 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -205,19 +205,19 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
>>   *
>>   * Returns the number of slab objects which we shrunk.
>>   */
>> -unsigned long shrink_slab(struct shrink_control *shrink,
>> +unsigned long shrink_slab(struct shrink_control *shrinkctl,
>>  			  unsigned long nr_pages_scanned,
>>  			  unsigned long lru_pages)
>>  {
>>  	struct shrinker *shrinker;
>> -	unsigned long ret = 0;
>> +	unsigned long freed = 0;
>>  
>>  	if (nr_pages_scanned == 0)
>>  		nr_pages_scanned = SWAP_CLUSTER_MAX;
>>  
>>  	if (!down_read_trylock(&shrinker_rwsem)) {
>>  		/* Assume we'll be able to shrink next time */
>> -		ret = 1;
>> +		freed = 1;
> 
> That's odd - it didn't free anything?  Needs a comment to avoid
> mystifying other readers.
> 

This is because a return value of zero would make us stop trying. There
is a comment saying that: "Assume we'll be able to shrink next time",
but admittedly it is not saying much. I confess that I still remember
the first time I looked into this code, and it took me a while to figure
out this was the reason.

>>  		goto out;
>>  	}
>>  
>> @@ -225,13 +225,16 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>  		unsigned long long delta;
>>  		long total_scan;
>>  		long max_pass;
>> -		int shrink_ret = 0;
>>  		long nr;
>>  		long new_nr;
>>  		long batch_size = shrinker->batch ? shrinker->batch
>>  						  : SHRINK_BATCH;
>>  
>> -		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
>> +		if (shrinker->scan_objects) {
> 
> Did you mean to test ->scan_objects here?  Or ->count_objects? 
> ->scan_objects makes sense but I wanna know if it was a copy-n-paste
> bug.
> 
It doesn't really matter, because:
1) This is temporary and will go away.
2) No shrinker is half-converted.

>> +			max_pass = shrinker->count_objects(shrinker, shrinkctl);
>> +			WARN_ON(max_pass < 0);
> 
> OK so from that I see that ->count_objects() doesn't return negative.
> 
> I this warning ever triggers, I expect it will trigger *a lot*. 
> WARN_ON_ONCE would be more prudent.  Or just nuke it.
> 

I can change it to WARN_ON_ONCE. As I have suggested, we could leave it
like this (with WARN_ON_ONCE) for some time in linux-next until we are
more or less confident that this was stressed enough.

>> +		} else
>> +			max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
>>  		if (max_pass <= 0)
>>  			continue;
>>  
>> @@ -248,8 +251,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>  		do_div(delta, lru_pages + 1);
>>  		total_scan += delta;
>>  		if (total_scan < 0) {
>> -			printk(KERN_ERR "shrink_slab: %pF negative objects to "
>> -			       "delete nr=%ld\n",
>> +			printk(KERN_ERR
>> +			"shrink_slab: %pF negative objects to delete nr=%ld\n",
>>  			       shrinker->shrink, total_scan);
>>  			total_scan = max_pass;
>>  		}
>> @@ -277,20 +280,31 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>  		if (total_scan > max_pass * 2)
>>  			total_scan = max_pass * 2;
>>  
>> -		trace_mm_shrink_slab_start(shrinker, shrink, nr,
>> +		trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>>  					nr_pages_scanned, lru_pages,
>>  					max_pass, delta, total_scan);
>>  
>>  		while (total_scan >= batch_size) {
>> -			int nr_before;
>> +			long ret;
>> +
>> +			if (shrinker->scan_objects) {
>> +				shrinkctl->nr_to_scan = batch_size;
>> +				ret = shrinker->scan_objects(shrinker, shrinkctl);
>> +
>> +				if (ret == -1)
>> +					break;
>> +				freed += ret;
>> +			} else {
>> +				int nr_before;
>> +				nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
>> +				ret = do_shrinker_shrink(shrinker, shrinkctl,
>> +								batch_size);
>> +				if (ret == -1)
>> +					break;
>> +				if (ret < nr_before)
> 
> This test seems unnecessary.
> 

Everything within the "else" is going away in a couple of patches. This
is just to keep the tree working while we convert everybody. And since
this is just moving the code below inside a conditional, I would prefer
leaving it this way to make sure that this is actually just the same
code going to a different place.

>> +					freed += nr_before - ret;
>> +			}
>>  
>> -			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
>> -			shrink_ret = do_shrinker_shrink(shrinker, shrink,
>> -							batch_size);
>> -			if (shrink_ret == -1)
>> -				break;
>> -			if (shrink_ret < nr_before)
>> -				ret += nr_before - shrink_ret;
>>  			count_vm_events(SLABS_SCANNED, batch_size);
>>  			total_scan -= batch_size;
>>  
>> @@ -308,12 +322,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>  		else
>>  			new_nr = atomic_long_read(&shrinker->nr_in_batch);
>>  
>> -		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
>> +		trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
>>  	}
>>  	up_read(&shrinker_rwsem);
>>  out:
>>  	cond_resched();
>> -	return ret;
>> +	return freed;
>>  }
>>  
>>  static inline int is_page_cache_freeable(struct page *page)
> 
> shrink_slab() has a long, long history of exhibiting various overflows
> - both multiplicative and over-incrementing.  I looked, and can't see
> any introduction of such problems here, but please do check it
> carefully.  Expect the impossible :(
> 

Yes, cap'n. Will do that.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 07/35] shrinker: convert superblock shrinkers to new API
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 06/35] mm: new shrinker API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 08/35] list: add a new LRU list type Glauber Costa
                     ` (25 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Convert superblock shrinker to use the new count/scan API, and
propagate the API changes through to the filesystem callouts. The
filesystem callouts already use a count/scan API, so it's just
changing counters to longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted
to the count/scan API. This is mainly a mechanical change.

[ v8: fix super_cache_count() return value ]
[ glommer: use mult_frac for fractional proportions, build fixes ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 fs/dcache.c         | 10 +++++---
 fs/inode.c          |  7 +++--
 fs/internal.h       |  2 ++
 fs/super.c          | 74 ++++++++++++++++++++++++++++++++---------------------
 fs/xfs/xfs_icache.c |  4 +--
 fs/xfs/xfs_icache.h |  2 +-
 fs/xfs/xfs_super.c  |  8 +++---
 include/linux/fs.h  |  8 ++----
 8 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 03d0c21..f048f95 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -856,11 +856,12 @@ static void shrink_dentry_list(struct list_head *list)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-void prune_dcache_sb(struct super_block *sb, int count)
+long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan)
 {
 	struct dentry *dentry;
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
+	long freed = 0;
 
 relock:
 	spin_lock(&sb->s_dentry_lru_lock);
@@ -885,7 +886,8 @@ relock:
 			this_cpu_dec(nr_dentry_unused);
 			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
-			if (!--count)
+			freed++;
+			if (!--nr_to_scan)
 				break;
 		}
 		cond_resched_lock(&sb->s_dentry_lru_lock);
@@ -895,6 +897,7 @@ relock:
 	spin_unlock(&sb->s_dentry_lru_lock);
 
 	shrink_dentry_list(&tmp);
+	return freed;
 }
 
 /*
@@ -1282,9 +1285,8 @@ rename_retry:
 void shrink_dcache_parent(struct dentry * parent)
 {
 	LIST_HEAD(dispose);
-	int found;
 
-	while ((found = select_parent(parent, &dispose)) != 0) {
+	while (select_parent(parent, &dispose)) {
 		shrink_dentry_list(&dispose);
 		cond_resched();
 	}
diff --git a/fs/inode.c b/fs/inode.c
index ff29765..1ddaa2e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -704,10 +704,11 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-void prune_icache_sb(struct super_block *sb, int nr_to_scan)
+long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_scanned;
+	long nr_scanned;
+	long freed = 0;
 	unsigned long reap = 0;
 
 	spin_lock(&sb->s_inode_lru_lock);
@@ -777,6 +778,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 		list_move(&inode->i_lru, &freeable);
 		sb->s_nr_inodes_unused--;
 		this_cpu_dec(nr_unused);
+		freed++;
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -787,6 +789,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 		current->reclaim_state->reclaimed_slab += reap;
 
 	dispose_list(&freeable);
+	return freed;
 }
 
 static void __wait_on_freeing_inode(struct inode *inode);
diff --git a/fs/internal.h b/fs/internal.h
index cd5009f..ea43c89 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,6 +110,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
+extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,6 +126,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index 0be75fb..18871f6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,11 +53,14 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
-static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
+static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
-	int	fs_objects = 0;
-	int	total_objects;
+	long	fs_objects = 0;
+	long	total_objects;
+	long	freed = 0;
+	long	dentries;
+	long	inodes;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -65,7 +68,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
 	 * to recurse into the FS that called us in clear_inode() and friends..
 	 */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
 
 	if (!grab_super_passive(sb))
@@ -77,33 +80,45 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	total_objects = sb->s_nr_dentry_unused +
 			sb->s_nr_inodes_unused + fs_objects + 1;
 
-	if (sc->nr_to_scan) {
-		int	dentries;
-		int	inodes;
-
-		/* proportion the scan between the caches */
-		dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
-							total_objects);
-		inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
-							total_objects);
-		if (fs_objects)
-			fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-							total_objects);
-		/*
-		 * prune the dcache first as the icache is pinned by it, then
-		 * prune the icache, followed by the filesystem specific caches
-		 */
-		prune_dcache_sb(sb, dentries);
-		prune_icache_sb(sb, inodes);
+	/* proportion the scan between the caches */
+	dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
+								total_objects);
+	inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
+								total_objects);
 
-		if (fs_objects && sb->s_op->free_cached_objects) {
-			sb->s_op->free_cached_objects(sb, fs_objects);
-			fs_objects = sb->s_op->nr_cached_objects(sb);
-		}
-		total_objects = sb->s_nr_dentry_unused +
-				sb->s_nr_inodes_unused + fs_objects;
+	/*
+	 * prune the dcache first as the icache is pinned by it, then
+	 * prune the icache, followed by the filesystem specific caches
+	 */
+	freed = prune_dcache_sb(sb, dentries);
+	freed += prune_icache_sb(sb, inodes);
+
+	if (fs_objects) {
+		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
+								total_objects);
+		freed += sb->s_op->free_cached_objects(sb, fs_objects);
 	}
 
+	drop_super(sb);
+	return freed;
+}
+
+static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct super_block *sb;
+	long	total_objects = 0;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	if (!grab_super_passive(sb))
+		return 0;
+
+	if (sb->s_op && sb->s_op->nr_cached_objects)
+		total_objects = sb->s_op->nr_cached_objects(sb);
+
+	total_objects += sb->s_nr_dentry_unused;
+	total_objects += sb->s_nr_inodes_unused;
+
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
 	return total_objects;
@@ -217,7 +232,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->cleancache_poolid = -1;
 
 		s->s_shrink.seeks = DEFAULT_SEEKS;
-		s->s_shrink.shrink = prune_super;
+		s->s_shrink.scan_objects = super_cache_scan;
+		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
 	}
 out:
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 96e344e..b35c311 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1164,7 +1164,7 @@ xfs_reclaim_inodes(
  * them to be cleaned, which we hope will not be very long due to the
  * background walker having already kicked the IO off on those dirty inodes.
  */
-void
+long
 xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
@@ -1173,7 +1173,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
 }
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index e0f138c..2d6d2d3 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -31,7 +31,7 @@ void xfs_reclaim_worker(struct work_struct *work);
 
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
-void xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
+long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ea341ce..1ff991b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1523,19 +1523,19 @@ xfs_fs_mount(
 	return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
 }
 
-static int
+static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
 
-static void
+static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	int			nr_to_scan)
+	long			nr_to_scan)
 {
-	xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11f9ad2..b0170ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1326,10 +1326,6 @@ struct super_block {
 	int s_readonly_remount;
 };
 
-/* superblock cache pruning functions */
-extern void prune_icache_sb(struct super_block *sb, int nr_to_scan);
-extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
-
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
@@ -1616,8 +1612,8 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	int (*nr_cached_objects)(struct super_block *);
-	void (*free_cached_objects)(struct super_block *, int);
+	long (*nr_cached_objects)(struct super_block *);
+	long (*free_cached_objects)(struct super_block *, long);
 };
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 08/35] list: add a new LRU list type
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 07/35] shrinker: convert superblock shrinkers to new API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
       [not found]     ` <1370287804-3481-9-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2013-06-03 19:29   ` [PATCH v10 09/35] inode: convert inode lru list to generic lru list code Glauber Costa
                     ` (24 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Several subsystems use the same construct for LRU lists - a list
head, a spin lock and and item count. They also use exactly the same
code for adding and removing items from the LRU. Create a generic
type for these LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to
work with.

[ glommer: enum defined constants for lru. Suggested by gthelen,
  don't relock over retry ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 include/linux/list_lru.h |  46 ++++++++++++++++++
 lib/Makefile             |   2 +-
 lib/list_lru.c           | 122 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 169 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/list_lru.h
 create mode 100644 lib/list_lru.c

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
new file mode 100644
index 0000000..4f82a57
--- /dev/null
+++ b/include/linux/list_lru.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#ifndef _LRU_LIST_H
+#define _LRU_LIST_H
+
+#include <linux/list.h>
+
+enum lru_status {
+	LRU_REMOVED,		/* item removed from list */
+	LRU_ROTATE,		/* item referenced, give another pass */
+	LRU_SKIP,		/* item cannot be locked, skip */
+	LRU_RETRY,		/* item not freeable. May drop the lock
+				   internally, but has to return locked. */
+};
+
+struct list_lru {
+	spinlock_t		lock;
+	struct list_head	list;
+	long			nr_items;
+};
+
+int list_lru_init(struct list_lru *lru);
+int list_lru_add(struct list_lru *lru, struct list_head *item);
+int list_lru_del(struct list_lru *lru, struct list_head *item);
+
+static inline unsigned long list_lru_count(struct list_lru *lru)
+{
+	return lru->nr_items;
+}
+
+typedef enum lru_status
+(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
+
+typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
+
+unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+		   void *cb_arg, unsigned long nr_to_walk);
+
+unsigned long
+list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
+
+#endif /* _LRU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index af911db..d610fda 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o percpu-refcount.o
+	 earlycpio.o percpu-refcount.o list_lru.o
 
 obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
 lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/list_lru.c b/lib/list_lru.c
new file mode 100644
index 0000000..3127edd
--- /dev/null
+++ b/lib/list_lru.c
@@ -0,0 +1,122 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_lru.h>
+
+int
+list_lru_add(
+	struct list_lru	*lru,
+	struct list_head *item)
+{
+	spin_lock(&lru->lock);
+	if (list_empty(item)) {
+		list_add_tail(item, &lru->list);
+		lru->nr_items++;
+		spin_unlock(&lru->lock);
+		return 1;
+	}
+	spin_unlock(&lru->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_add);
+
+int
+list_lru_del(
+	struct list_lru	*lru,
+	struct list_head *item)
+{
+	spin_lock(&lru->lock);
+	if (!list_empty(item)) {
+		list_del_init(item);
+		lru->nr_items--;
+		spin_unlock(&lru->lock);
+		return 1;
+	}
+	spin_unlock(&lru->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_del);
+
+unsigned long
+list_lru_walk(
+	struct list_lru *lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	unsigned long	nr_to_walk)
+{
+	struct list_head *item, *n;
+	unsigned long removed = 0;
+
+	spin_lock(&lru->lock);
+	list_for_each_safe(item, n, &lru->list) {
+		enum lru_status ret;
+		bool first_pass = true;
+restart:
+		ret = isolate(item, &lru->lock, cb_arg);
+		switch (ret) {
+		case LRU_REMOVED:
+			lru->nr_items--;
+			removed++;
+			break;
+		case LRU_ROTATE:
+			list_move_tail(item, &lru->list);
+			break;
+		case LRU_SKIP:
+			break;
+		case LRU_RETRY:
+			if (!first_pass)
+				break;
+			first_pass = false;
+			goto restart;
+		default:
+			BUG();
+		}
+
+		if (nr_to_walk-- == 0)
+			break;
+
+	}
+	spin_unlock(&lru->lock);
+	return removed;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+unsigned long
+list_lru_dispose_all(
+	struct list_lru *lru,
+	list_lru_dispose_cb dispose)
+{
+	unsigned long disposed = 0;
+	LIST_HEAD(dispose_list);
+
+	spin_lock(&lru->lock);
+	while (!list_empty(&lru->list)) {
+		list_splice_init(&lru->list, &dispose_list);
+		disposed += lru->nr_items;
+		lru->nr_items = 0;
+		spin_unlock(&lru->lock);
+
+		dispose(&dispose_list);
+
+		spin_lock(&lru->lock);
+	}
+	spin_unlock(&lru->lock);
+	return disposed;
+}
+
+int
+list_lru_init(
+	struct list_lru	*lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_init);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

[parent not found: <1370287804-3481-9-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]

* Re: [PATCH v10 08/35] list: add a new LRU list type
       [not found]     ` <1370287804-3481-9-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2013-06-05 23:07       ` Andrew Morton
  2013-06-06  2:49         ` Dave Chinner
  2013-06-06  8:10         ` Glauber Costa
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On Mon,  3 Jun 2013 23:29:37 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:

> From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> Several subsystems use the same construct for LRU lists - a list
> head, a spin lock and and item count. They also use exactly the same
> code for adding and removing items from the LRU. Create a generic
> type for these LRU lists.
> 
> This is the beginning of generic, node aware LRUs for shrinkers to
> work with.
> 
> ...
>
> --- /dev/null
> +++ b/include/linux/list_lru.h
> @@ -0,0 +1,46 @@
> +/*
> + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
> + * Author: David Chinner
> + *
> + * Generic LRU infrastructure
> + */
> +#ifndef _LRU_LIST_H
> +#define _LRU_LIST_H
> +
> +#include <linux/list.h>
> +
> +enum lru_status {
> +	LRU_REMOVED,		/* item removed from list */
> +	LRU_ROTATE,		/* item referenced, give another pass */
> +	LRU_SKIP,		/* item cannot be locked, skip */
> +	LRU_RETRY,		/* item not freeable. May drop the lock
> +				   internally, but has to return locked. */
> +};

What's this?

Seems to be the return code from the undocumented list_lru_walk_cb?

> +struct list_lru {
> +	spinlock_t		lock;
> +	struct list_head	list;
> +	long			nr_items;

Should be an unsigned type.

> +};
> +
> +int list_lru_init(struct list_lru *lru);
> +int list_lru_add(struct list_lru *lru, struct list_head *item);
> +int list_lru_del(struct list_lru *lru, struct list_head *item);
> +
> +static inline unsigned long list_lru_count(struct list_lru *lru)
> +{
> +	return lru->nr_items;
> +}

It got changed to unsigned here!

> +typedef enum lru_status
> +(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
> +
> +typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
> +
> +unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
> +		   void *cb_arg, unsigned long nr_to_walk);
> +
> +unsigned long
> +list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
> +
> +#endif /* _LRU_LIST_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index af911db..d610fda 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
>  	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> -	 earlycpio.o percpu-refcount.o
> +	 earlycpio.o percpu-refcount.o list_lru.o
>  
>  obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
>  lib-$(CONFIG_MMU) += ioremap.o
> diff --git a/lib/list_lru.c b/lib/list_lru.c
> new file mode 100644
> index 0000000..3127edd
> --- /dev/null
> +++ b/lib/list_lru.c
> @@ -0,0 +1,122 @@
> +/*
> + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
> + * Author: David Chinner
> + *
> + * Generic LRU infrastructure
> + */
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/list_lru.h>
> +
> +int
> +list_lru_add(
> +	struct list_lru	*lru,
> +	struct list_head *item)

This is lib/, not fs/xfs/ ;)

> +{
> +	spin_lock(&lru->lock);

OK, problems.  Long experience has shown us that in-kernel container
library code like this should not perform its own locking.  Things like:

- I want to call it from interrupts!

- I want to use a mutex!

- I want to use RCU!

- I already hold a lock and don't need this code to take another one!

- I need to sleep in my isolate callback, but the library code is
  holding a spinlock!

- I want to test lru.nr_items in a non-racy fashion, but to do that I
  have to take a lib/-private spinlock!

etcetera.  It's just heaps less flexible and useful this way, and
library code should be flexible and useful.

If you want to put a spinlocked layer on top of the core code then fine
- that looks to be simple enough, apart from list_lru_dispose_all().

> +	if (list_empty(item)) {
> +		list_add_tail(item, &lru->list);
> +		lru->nr_items++;
> +		spin_unlock(&lru->lock);
> +		return 1;
> +	}
> +	spin_unlock(&lru->lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_add);

So an undocumented, i-have-to-guess-why feature of list_lru_add() is
that it will refuse to add an item which appears to be on a list
already?

This is a little bit strange, because one could legitimately do

	list_del(item);		/* from my private list */
	list_lru_add(lru, item);

but this interface forced me to do a needless lru_del_init().

Maybe this is good, maybe it is bad.  It depends on what the author(s)
were thinking at the time ;)


Either way, returning 1 on success and 0 on failure is surprising.  0
means success, please.  Alternatively I guess one could make it return
bool and document the dang thing, hence retaining the current 0/1 concept.

> +
> +int
> +list_lru_del(
> +	struct list_lru	*lru,
> +	struct list_head *item)
> +{
> +	spin_lock(&lru->lock);
> +	if (!list_empty(item)) {
> +		list_del_init(item);
> +		lru->nr_items--;
> +		spin_unlock(&lru->lock);
> +		return 1;
> +	}
> +	spin_unlock(&lru->lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_del);
> +
> +unsigned long
> +list_lru_walk(
> +	struct list_lru *lru,
> +	list_lru_walk_cb isolate,
> +	void		*cb_arg,
> +	unsigned long	nr_to_walk)

Interface documentation, please.

> +{
> +	struct list_head *item, *n;
> +	unsigned long removed = 0;
> +
> +	spin_lock(&lru->lock);
> +	list_for_each_safe(item, n, &lru->list) {
> +		enum lru_status ret;
> +		bool first_pass = true;
> +restart:
> +		ret = isolate(item, &lru->lock, cb_arg);
> +		switch (ret) {
> +		case LRU_REMOVED:
> +			lru->nr_items--;
> +			removed++;
> +			break;
> +		case LRU_ROTATE:
> +			list_move_tail(item, &lru->list);
> +			break;
> +		case LRU_SKIP:
> +			break;
> +		case LRU_RETRY:

With no documentation in the code or the changelog, I haven't a clue why
these four possibilities exist :(

> +			if (!first_pass)
> +				break;
> +			first_pass = false;
> +			goto restart;
> +		default:
> +			BUG();
> +		}
> +
> +		if (nr_to_walk-- == 0)
> +			break;
> +
> +	}
> +	spin_unlock(&lru->lock);
> +	return removed;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_walk);

Passing the address of the spinlock to the list_lru_walk_cb handler is
rather gross.

And afacit it is unresolvably buggy - if the handler dropped that lock,
list_lru_walk() is now left holding a list_head at *item which could
have been altered or even freed.

How [patch 09/35]'s inode_lru_isolate() avoids this bug I don't know. 
Perhaps it doesn't.


Addendum: having now read through the evolution of lib/list_lru.c, it's
pretty apparent that this code is highly specific to the inode and
dcache shrinkers and is unlikely to see applications elsewhere.  So
hrm, perhaps we're kinda kidding ourselves by putting it in lib/ at
all.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-05 23:07       ` Andrew Morton
@ 2013-06-06  2:49         ` Dave Chinner
  2013-06-06  3:05           ` Andrew Morton
  2013-06-06 14:28           ` Glauber Costa
  2013-06-06  8:10         ` Glauber Costa
  1 sibling, 2 replies; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  2:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 04:07:58PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:37 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Several subsystems use the same construct for LRU lists - a list
> > head, a spin lock and and item count. They also use exactly the same
> > code for adding and removing items from the LRU. Create a generic
> > type for these LRU lists.
> > 
> > This is the beginning of generic, node aware LRUs for shrinkers to
> > work with.
> > 
> > ...
> >
> > --- /dev/null
> > +++ b/include/linux/list_lru.h
> > @@ -0,0 +1,46 @@
> > +/*
> > + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
> > + * Author: David Chinner
> > + *
> > + * Generic LRU infrastructure
> > + */
> > +#ifndef _LRU_LIST_H
> > +#define _LRU_LIST_H
> > +
> > +#include <linux/list.h>
> > +
> > +enum lru_status {
> > +	LRU_REMOVED,		/* item removed from list */
> > +	LRU_ROTATE,		/* item referenced, give another pass */
> > +	LRU_SKIP,		/* item cannot be locked, skip */
> > +	LRU_RETRY,		/* item not freeable. May drop the lock
> > +				   internally, but has to return locked. */
> > +};
> 
> What's this?
> 
> Seems to be the return code from the undocumented list_lru_walk_cb?
> 
> > +struct list_lru {
> > +	spinlock_t		lock;
> > +	struct list_head	list;
> > +	long			nr_items;
> 
> Should be an unsigned type.
> 
> > +};
> > +
> > +int list_lru_init(struct list_lru *lru);
> > +int list_lru_add(struct list_lru *lru, struct list_head *item);
> > +int list_lru_del(struct list_lru *lru, struct list_head *item);
> > +
> > +static inline unsigned long list_lru_count(struct list_lru *lru)
> > +{
> > +	return lru->nr_items;
> > +}
> 
> It got changed to unsigned here!
> 
> > +typedef enum lru_status
> > +(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
> > +
> > +typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
> > +
> > +unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
> > +		   void *cb_arg, unsigned long nr_to_walk);
> > +
> > +unsigned long
> > +list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
> > +
> > +#endif /* _LRU_LIST_H */
> > diff --git a/lib/Makefile b/lib/Makefile
> > index af911db..d610fda 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> >  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
> >  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> >  	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> > -	 earlycpio.o percpu-refcount.o
> > +	 earlycpio.o percpu-refcount.o list_lru.o
> >  
> >  obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
> >  lib-$(CONFIG_MMU) += ioremap.o
> > diff --git a/lib/list_lru.c b/lib/list_lru.c
> > new file mode 100644
> > index 0000000..3127edd
> > --- /dev/null
> > +++ b/lib/list_lru.c
> > @@ -0,0 +1,122 @@
> > +/*
> > + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
> > + * Author: David Chinner
> > + *
> > + * Generic LRU infrastructure
> > + */
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/list_lru.h>
> > +
> > +int
> > +list_lru_add(
> > +	struct list_lru	*lru,
> > +	struct list_head *item)
> 
> This is lib/, not fs/xfs/ ;)
> 
> > +{
> > +	spin_lock(&lru->lock);
> 
> OK, problems.  Long experience has shown us that in-kernel container
> library code like this should not perform its own locking.  Things like:
> 
> - I want to call it from interrupts!
> - I want to use a mutex!
> - I want to use RCU!

Wrap them around the outside of all your LRU operations, then.

> - I already hold a lock and don't need this code to take another one!

The internal lru lock is for simplicity of implementation.

> - I need to sleep in my isolate callback, but the library code is
>   holding a spinlock!

The isolate callback gets passed the spinlock that it is holding
precisely so the callback can drop it and do sleeping operations.

> - I want to test lru.nr_items in a non-racy fashion, but to do that I
>   have to take a lib/-private spinlock!

Nobody should be peeking at the internals of the list structures.
That's just completely broken. Use the APIs that are provided, as
there is no guarantee that the implementation of the lists is going
to remain the same over time. The LRU list locks are an internal
implementation detail, and are only exposed in the places where
callbacks might need to drop them. And even then they are exposed as
just a pointer to the lock to avoid exposing internal details that
nobody has any business fucking with.

The current implementation is designed to be basic and obviously
correct, not some wacky, amazingly optimised code that nobody but
the original author can understand.

> etcetera.  It's just heaps less flexible and useful this way, and
> library code should be flexible and useful.

Quite frankly, the problem with all the existing LRU code is that
everyone rolls their own list and locking scheme. And you know what?
All people do is cookie-cutter copy-n-paste some buggy
implementation from somewhere else.

> If you want to put a spinlocked layer on top of the core code then fine
> - that looks to be simple enough, apart from list_lru_dispose_all().

I'm not interested in modifying the code for some nebulous "what if"
scenario. When someone comes up with an actual need that they can't
scratch by wrapping their needed exclusion around the outside of the
LRU like the dentry and inode caches do, then we can change it to
addresss that need.

> > +		list_add_tail(item, &lru->list);
> > +		lru->nr_items++;
> > +		spin_unlock(&lru->lock);
> > +		return 1;
> > +	}
> > +	spin_unlock(&lru->lock);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(list_lru_add);
> 
> So an undocumented, i-have-to-guess-why feature of list_lru_add() is
> that it will refuse to add an item which appears to be on a list
> already?

Because callers don't know ahead of time if the item is on a list
already. This is the same beheviour that the inode and dentry cache
LRUs have had for years. i.e. it supports lazy LRU addtion and
prevents objects that might be on dispose lists from being readded
to the LRU list and corrupting the lists...

> This is a little bit strange, because one could legitimately do
> 
> 	list_del(item);		/* from my private list */
> 	list_lru_add(lru, item);
> 
> but this interface forced me to do a needless lru_del_init().

How do you know what list the item is on in list_lru_add()? We have
to know to get the accounting right. i.e. if it is already on the
LRU and we remove it and the re-add it, the number of items on the
list doesn't change. but it it's on some private list, then we have
to increment the number of items on the LRU list.

So, if it's already on a list, we cannot determine what the correct
thing to do it, and hence the callers of list_lru_add() must ensure
that the item is not on a private list before trying to add it to
the LRU.

> Maybe this is good, maybe it is bad.  It depends on what the author(s)
> were thinking at the time ;)
> 
> 
> Either way, returning 1 on success and 0 on failure is surprising.  0
> means success, please.  Alternatively I guess one could make it return
> bool and document the dang thing, hence retaining the current 0/1 concept.

Sure, that can be fixed. Documentation is lacking at this point.

> > +restart:
> > +		ret = isolate(item, &lru->lock, cb_arg);
> > +		switch (ret) {
> > +		case LRU_REMOVED:
> > +			lru->nr_items--;
> > +			removed++;
> > +			break;
> > +		case LRU_ROTATE:
> > +			list_move_tail(item, &lru->list);
> > +			break;
> > +		case LRU_SKIP:
> > +			break;
> > +		case LRU_RETRY:
> 
> With no documentation in the code or the changelog, I haven't a clue why
> these four possibilities exist :(

Documentation would explain that:

> Passing the address of the spinlock to the list_lru_walk_cb handler is
> rather gross.
> 
> And afacit it is unresolvably buggy - if the handler dropped that lock,
> list_lru_walk() is now left holding a list_head at *item which could
> have been altered or even freed.
>
> How [patch 09/35]'s inode_lru_isolate() avoids this bug I don't know. 
> Perhaps it doesn't.

The LRU_RETRY cse is supposed to handle this. However, the LRU_RETRY
return code is now buggy and you've caught that. It'll need fixing.
My original code only had inode_lru_isolate() drop the lru lock, and
it would return LRU_RETRY which would restart the scan of the list
from the start, thereby avoiding those problems.

> Addendum: having now read through the evolution of lib/list_lru.c, it's
> pretty apparent that this code is highly specific to the inode and
> dcache shrinkers and is unlikely to see applications elsewhere.  So
> hrm, perhaps we're kinda kidding ourselves by putting it in lib/ at
> all.

In this patch set, it replaces the LRU in the xfs buffer cache, the
LRU in the XFS dquot cache, and I've got patches that use it in the
XFS inode cache as well. And they were all drop-in replacements,
just like for the inode and dentry caches. It's hard to claim that
it's so specific to the inode/dentry caches when there are at least
3 other LRUs that were pretty trivial to convert for use...

The whole point of the patchset is to introduce infrastructure that
is generically useful. Sure, it might start out looking like the
thing that it was derived from, but we've got to start somewhere.
Given that there are 5 different users already, it's obviously
already more than just usable for the inode and dentry caches.

The only reason that there haven't been more subsystems converted is
that we are concentrating on getting what we alreayd have merged
first....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-06  2:49         ` Dave Chinner
@ 2013-06-06  3:05           ` Andrew Morton
       [not found]             ` <20130605200554.d4dae16f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2013-06-06 14:28           ` Glauber Costa
  1 sibling, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  3:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 12:49:09 +1000 Dave Chinner <david@fromorbit.com> wrote:

> > > +{
> > > +	spin_lock(&lru->lock);
> > 
> > OK, problems.  Long experience has shown us that in-kernel container
> > library code like this should not perform its own locking.  Things like:
> > 
> > - I want to call it from interrupts!
> > - I want to use a mutex!
> > - I want to use RCU!
> 
> Wrap them around the outside of all your LRU operations, then.
> 
> > - I already hold a lock and don't need this code to take another one!
> 
> The internal lru lock is for simplicity of implementation.
> 
> > - I need to sleep in my isolate callback, but the library code is
> >   holding a spinlock!
> 
> The isolate callback gets passed the spinlock that it is holding
> precisely so the callback can drop it and do sleeping operations.

As I said, "Long experience has shown".  These restrictions reduce the
usefulness of this code.

> > - I want to test lru.nr_items in a non-racy fashion, but to do that I
> >   have to take a lib/-private spinlock!
> 
> Nobody should be peeking at the internals of the list structures.
> That's just completely broken. Use the APIs that are provided

Those APIs don't work.  It isn't possible for callers to get an exact
count, unless they provide redundant external locking.  This problem is
a consequence of the decision to perform lib-internal locking.

> The current implementation is designed to be basic and obviously
> correct, not some wacky, amazingly optimised code that nobody but
> the original author can understand.

Implementations which expect caller-provided locking are simpler.

> > This is a little bit strange, because one could legitimately do
> > 
> > 	list_del(item);		/* from my private list */
> > 	list_lru_add(lru, item);
> > 
> > but this interface forced me to do a needless lru_del_init().
> 
> How do you know what list the item is on in list_lru_add()? We have
> to know to get the accounting right. i.e. if it is already on the
> LRU and we remove it and the re-add it, the number of items on the
> list doesn't change. but it it's on some private list, then we have
> to increment the number of items on the LRU list.
> 
> So, if it's already on a list, we cannot determine what the correct
> thing to do it, and hence the callers of list_lru_add() must ensure
> that the item is not on a private list before trying to add it to
> the LRU.

It isn't "already on a list" - the caller just removed it!

It's suboptimal, but I'm not saying this decision was wrong.  However
explanation and documentation is needed to demonstrate that it was
correct.

> > Addendum: having now read through the evolution of lib/list_lru.c, it's
> > pretty apparent that this code is highly specific to the inode and
> > dcache shrinkers and is unlikely to see applications elsewhere.  So
> > hrm, perhaps we're kinda kidding ourselves by putting it in lib/ at
> > all.
> 
> In this patch set, it replaces the LRU in the xfs buffer cache, the
> LRU in the XFS dquot cache, and I've got patches that use it in the
> XFS inode cache as well. And they were all drop-in replacements,
> just like for the inode and dentry caches. It's hard to claim that
> it's so specific to the inode/dentry caches when there are at least
> 3 other LRUs that were pretty trivial to convert for use...
> 
> The whole point of the patchset is to introduce infrastructure that
> is generically useful. Sure, it might start out looking like the
> thing that it was derived from, but we've got to start somewhere.
> Given that there are 5 different users already, it's obviously
> already more than just usable for the inode and dentry caches.
> 
> The only reason that there haven't been more subsystems converted is
> that we are concentrating on getting what we alreayd have merged
> first....

I'm not objecting to the code per-se - I'm sure it's appropriate to the
current callsites.  But these restrictions do reduce its overall
applicability.  And I do agree that it's not worth generalizing it
because of what-if scenarios.

Why was it called "lru", btw?  iirc it's actually a "stack" (or
"queue"?) and any lru functionality is actually implemented externally.
There is no "list_lru_touch()".


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605200554.d4dae16f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 08/35] list: add a new LRU list type
       [not found]             ` <20130605200554.d4dae16f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  4:44               ` Dave Chinner
  2013-06-06  7:04                 ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  4:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On Wed, Jun 05, 2013 at 08:05:54PM -0700, Andrew Morton wrote:
> On Thu, 6 Jun 2013 12:49:09 +1000 Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> 
> > > > +{
> > > > +	spin_lock(&lru->lock);
> > > 
> > > OK, problems.  Long experience has shown us that in-kernel container
> > > library code like this should not perform its own locking.  Things like:
> > > 
> > > - I want to call it from interrupts!
> > > - I want to use a mutex!
> > > - I want to use RCU!
> > 
> > Wrap them around the outside of all your LRU operations, then.
> > 
> > > - I already hold a lock and don't need this code to take another one!
> > 
> > The internal lru lock is for simplicity of implementation.
> > 
> > > - I need to sleep in my isolate callback, but the library code is
> > >   holding a spinlock!
> > 
> > The isolate callback gets passed the spinlock that it is holding
> > precisely so the callback can drop it and do sleeping operations.
> 
> As I said, "Long experience has shown".  These restrictions reduce the
> usefulness of this code.

Only if you want generic, "use for absolutely anything"
functionality.

This code isn't "use for absolutely anything" infrastructure - it's
implementing a specific design pattern that is repeated over and
over again in the kernel in a generic, abstracted manner. It solves
one problem, not an abstract class of problems. The fact is that
this one problem is solved in 15 different ways, s

> > > - I want to test lru.nr_items in a non-racy fashion, but to do that I
> > >   have to take a lib/-private spinlock!
> > 
> > Nobody should be peeking at the internals of the list structures.
> > That's just completely broken. Use the APIs that are provided
> 
> Those APIs don't work.  It isn't possible for callers to get an exact
> count, unless they provide redundant external locking.  This problem is
> a consequence of the decision to perform lib-internal locking.

There hasn't been a requirement for an exact count. There never has
been. The shrinkers certainly don't need one, and I can't think of
any reason why you'd need a exact count...

> > The current implementation is designed to be basic and obviously
> > correct, not some wacky, amazingly optimised code that nobody but
> > the original author can understand.
> 
> Implementations which expect caller-provided locking are simpler.

In some situations, yes.

> > > This is a little bit strange, because one could legitimately do
> > > 
> > > 	list_del(item);		/* from my private list */
> > > 	list_lru_add(lru, item);
> > > 
> > > but this interface forced me to do a needless lru_del_init().
> > 
> > How do you know what list the item is on in list_lru_add()? We have
> > to know to get the accounting right. i.e. if it is already on the
> > LRU and we remove it and the re-add it, the number of items on the
> > list doesn't change. but it it's on some private list, then we have
> > to increment the number of items on the LRU list.
> > 
> > So, if it's already on a list, we cannot determine what the correct
> > thing to do it, and hence the callers of list_lru_add() must ensure
> > that the item is not on a private list before trying to add it to
> > the LRU.
> 
> It isn't "already on a list" - the caller just removed it!

Sorry, then I didn't understand what you question is? Why would you
need to call lru_del_init() for an object on a private list?

Oh, you meant the list_del_init()? In which case, your item won't
get added to the LRU. Too bad, so sad. Needs documentation.

> > > Addendum: having now read through the evolution of lib/list_lru.c, it's
> > > pretty apparent that this code is highly specific to the inode and
> > > dcache shrinkers and is unlikely to see applications elsewhere.  So
> > > hrm, perhaps we're kinda kidding ourselves by putting it in lib/ at
> > > all.
> > 
> > In this patch set, it replaces the LRU in the xfs buffer cache, the
> > LRU in the XFS dquot cache, and I've got patches that use it in the
> > XFS inode cache as well. And they were all drop-in replacements,
> > just like for the inode and dentry caches. It's hard to claim that
> > it's so specific to the inode/dentry caches when there are at least
> > 3 other LRUs that were pretty trivial to convert for use...
> > 
> > The whole point of the patchset is to introduce infrastructure that
> > is generically useful. Sure, it might start out looking like the
> > thing that it was derived from, but we've got to start somewhere.
> > Given that there are 5 different users already, it's obviously
> > already more than just usable for the inode and dentry caches.
> > 
> > The only reason that there haven't been more subsystems converted is
> > that we are concentrating on getting what we alreayd have merged
> > first....
> 
> I'm not objecting to the code per-se - I'm sure it's appropriate to the
> current callsites.  But these restrictions do reduce its overall
> applicability.  And I do agree that it's not worth generalizing it
> because of what-if scenarios.

I'm not disagreeing with you about the restrictions and how they
limit what it can be used for. But as I explained about there is a
specific design patther/use case for these lists - that of an
independent list based LRU that tightly integrates with the shrinker
infrastructure.

> Why was it called "lru", btw?  iirc it's actually a "stack" (or
> "queue"?) and any lru functionality is actually implemented externally.

Because it's a bunch of infrastructure and helper functions that
callers use to implement a list based LRU that tightly integrates
with the shrinker infrastructure.  ;)

I'm open to a better name - something just as short and concise
would be nice ;)

> There is no "list_lru_touch()".

Different LRU implementations have different methods of marking
objects referenced and reclaiming them, and so it is kept external.
e.g.  inode/dentries use a single flag within the object. The XFS
buffer cache uses a LRU reference count to do heirarchical
referencing of objects, and so that isn't implemented within the
list infrstructure itself. All the infrastructure provides is the
lists itself and methods to add, remove and scan the lists; Anything
specific to an object on the list needs to be managed externally.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-06  4:44               ` Dave Chinner
@ 2013-06-06  7:04                 ` Andrew Morton
  2013-06-06  9:03                   ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  7:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 14:44:26 +1000 Dave Chinner <david@fromorbit.com> wrote:

> > Why was it called "lru", btw?  iirc it's actually a "stack" (or
> > "queue"?) and any lru functionality is actually implemented externally.
> 
> Because it's a bunch of infrastructure and helper functions that
> callers use to implement a list based LRU that tightly integrates
> with the shrinker infrastructure.  ;)
> 
> I'm open to a better name - something just as short and concise
> would be nice ;)

Not a biggie, but it's nice to get these things exact on day one.

"queue"?  Because someone who wants a queue is likely to look at
list_lru.c and think "hm, that's no good".  Whereas if it's queue.c
then they're more likely to use it.  Then start cursing at its
internal spin_lock() :)

But anyone who just wants a queue doesn't want their queue_lru_del()
calling into memcg code(!).  I do think it would be more appropriate to
discard the lib/ idea and move it all into fs/ or mm/.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-06  7:04                 ` Andrew Morton
@ 2013-06-06  9:03                   ` Glauber Costa
  2013-06-06  9:55                     ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On 06/06/2013 11:04 AM, Andrew Morton wrote:
> But anyone who just wants a queue doesn't want their queue_lru_del()
> calling into memcg code(!).

It won't call any relevant memcg code unless the list_lru (or queue, or
whatever) is explicitly marked as memcg-aware.


 I do think it would be more appropriate to
> discard the lib/ idea and move it all into fs/ or mm/.
I have no particular love for this in lib/

Most of the users are in fs/, so I see no point in mm/
So for me, if you are really not happy about lib, I would suggest moving
this to fs/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-06  9:03                   ` Glauber Costa
@ 2013-06-06  9:55                     ` Andrew Morton
       [not found]                       ` <20130606025517.8400c279.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  9:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 13:03:37 +0400 Glauber Costa <glommer@parallels.com> wrote:

>  I do think it would be more appropriate to
> > discard the lib/ idea and move it all into fs/ or mm/.
> I have no particular love for this in lib/
> 
> Most of the users are in fs/, so I see no point in mm/
> So for me, if you are really not happy about lib, I would suggest moving
> this to fs/

Always feel free to differ but yes, fs/ seems better to me.

I suggested mm/ also because that's where the shrinker core resides.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130606025517.8400c279.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 08/35] list: add a new LRU list type
       [not found]                       ` <20130606025517.8400c279.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06 11:47                         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06 11:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Mel Gorman, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On 06/06/2013 01:55 PM, Andrew Morton wrote:
> On Thu, 6 Jun 2013 13:03:37 +0400 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> 
>>  I do think it would be more appropriate to
>>> discard the lib/ idea and move it all into fs/ or mm/.
>> I have no particular love for this in lib/
>>
>> Most of the users are in fs/, so I see no point in mm/
>> So for me, if you are really not happy about lib, I would suggest moving
>> this to fs/
> 
> Always feel free to differ but yes, fs/ seems better to me.
> 
> I suggested mm/ also because that's where the shrinker core resides.
> 
As I said, unless Dave has a strong point against it, I don't really
care if it lives in lib/ or not. It is infrastructure, but not
necessarily lib-like infrastructure.

Now, I have been thinking about this during the last hour, and as much
as all users are in fs/, putting it into mm/ would give us quite some
other advantage: namely, it has been already detected that we would like
to have, if possible, stronger ties between shrinkers, caches and the
underlying lists. We use a bunch of mm/ infrastructure, etc.

This is always something we can change if it really hurts, but right now
I am 51 % mm/ 49 % fs/

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-06  2:49         ` Dave Chinner
  2013-06-06  3:05           ` Andrew Morton
@ 2013-06-06 14:28           ` Glauber Costa
  1 sibling, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06 14:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Glauber Costa,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On 06/06/2013 06:49 AM, Dave Chinner wrote:
>> > How [patch 09/35]'s inode_lru_isolate() avoids this bug I don't know. 
>> > Perhaps it doesn't.
> The LRU_RETRY cse is supposed to handle this. However, the LRU_RETRY
> return code is now buggy and you've caught that. It'll need fixing.
> My original code only had inode_lru_isolate() drop the lru lock, and
> it would return LRU_RETRY which would restart the scan of the list
> from the start, thereby avoiding those problems.
> 
Yes, I have changed that, but I wasn't aware that your original
intention for restarting from the beginning was to avoid such problems.
And having only half the brain Andrew has, I didn't notice it myself.

I will fix this somehow while trying to keep the behavior Mel insisted
on; iow; not retrying forever.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 08/35] list: add a new LRU list type
  2013-06-05 23:07       ` Andrew Morton
  2013-06-06  2:49         ` Dave Chinner
@ 2013-06-06  8:10         ` Glauber Costa
  1 sibling, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On 06/06/2013 03:07 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:37 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Several subsystems use the same construct for LRU lists - a list
>> head, a spin lock and and item count. They also use exactly the same
>> code for adding and removing items from the LRU. Create a generic
>> type for these LRU lists.
>>
>> This is the beginning of generic, node aware LRUs for shrinkers to
>> work with.
>>
>> ...
>>
>> --- /dev/null
>> +++ b/include/linux/list_lru.h
>> @@ -0,0 +1,46 @@
>> +/*
>> + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
>> + * Author: David Chinner
>> + *
>> + * Generic LRU infrastructure
>> + */
>> +#ifndef _LRU_LIST_H
>> +#define _LRU_LIST_H
>> +
>> +#include <linux/list.h>
>> +
>> +enum lru_status {
>> +	LRU_REMOVED,		/* item removed from list */
>> +	LRU_ROTATE,		/* item referenced, give another pass */
>> +	LRU_SKIP,		/* item cannot be locked, skip */
>> +	LRU_RETRY,		/* item not freeable. May drop the lock
>> +				   internally, but has to return locked. */
>> +};
> 
> What's this?
> 
> Seems to be the return code from the undocumented list_lru_walk_cb?
> 
Yes, it is.

>> +struct list_lru {
>> +	spinlock_t		lock;
>> +	struct list_head	list;
>> +	long			nr_items;
> 
> Should be an unsigned type.
> 

I can change if you *really* insist, but this one in particular will
increase with list_lru_add, but can decrease in two places: with an
explicit list_lru_del, and also later when the element is finally purged
through the walker.

Although it seems to be quite stable now, it is quite easy for an
imbalance to appear tomorrow, and having a signed type help us find it
very easily (we have also a WARN_ON for this)

>> +};
>> +
>> +int list_lru_init(struct list_lru *lru);
>> +int list_lru_add(struct list_lru *lru, struct list_head *item);
>> +int list_lru_del(struct list_lru *lru, struct list_head *item);
>> +
>> +static inline unsigned long list_lru_count(struct list_lru *lru)
>> +{
>> +	return lru->nr_items;
>> +}
> 
> It got changed to unsigned here!
> 

Yes, because this is the interface that is exported.
The internal interface is kept as a long to make sure that we're not
having imbalances. We WARN at every deletion.

>> +typedef enum lru_status
>> +(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
>> +
>> +typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
>> +
>> +unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
>> +		   void *cb_arg, unsigned long nr_to_walk);
>> +
>> +unsigned long
>> +list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
>> +
>> +#endif /* _LRU_LIST_H */
>> diff --git a/lib/Makefile b/lib/Makefile
>> index af911db..d610fda 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> @@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
>>  	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
>> -	 earlycpio.o percpu-refcount.o
>> +	 earlycpio.o percpu-refcount.o list_lru.o
>>  
>>  obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
>>  lib-$(CONFIG_MMU) += ioremap.o
>> diff --git a/lib/list_lru.c b/lib/list_lru.c
>> new file mode 100644
>> index 0000000..3127edd
>> --- /dev/null
>> +++ b/lib/list_lru.c
>> @@ -0,0 +1,122 @@
>> +/*
>> + * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
>> + * Author: David Chinner
>> + *
>> + * Generic LRU infrastructure
>> + */
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/list_lru.h>
>> +
>> +int
>> +list_lru_add(
>> +	struct list_lru	*lru,
>> +	struct list_head *item)
> 
> This is lib/, not fs/xfs/ ;)
> 
>> +{
>> +	spin_lock(&lru->lock);
> 
> OK, problems.  Long experience has shown us that in-kernel container
> library code like this should not perform its own locking.  Things like:
> 
> - I want to call it from interrupts!
> 
> - I want to use a mutex!
> 
> - I want to use RCU!
> 
> - I already hold a lock and don't need this code to take another one!
> 
> - I need to sleep in my isolate callback, but the library code is
>   holding a spinlock!
> 
> - I want to test lru.nr_items in a non-racy fashion, but to do that I
>   have to take a lib/-private spinlock!
> 
> etcetera.  It's just heaps less flexible and useful this way, and
> library code should be flexible and useful.
> 
> If you want to put a spinlocked layer on top of the core code then fine
> - that looks to be simple enough, apart from list_lru_dispose_all().
> 
I will leave that to Dave =p

>> +	if (list_empty(item)) {
>> +		list_add_tail(item, &lru->list);
>> +		lru->nr_items++;
>> +		spin_unlock(&lru->lock);
>> +		return 1;
>> +	}
>> +	spin_unlock(&lru->lock);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(list_lru_add);
> 
> So an undocumented, i-have-to-guess-why feature of list_lru_add() is
> that it will refuse to add an item which appears to be on a list
> already?
> 
> This is a little bit strange, because one could legitimately do
> 
> 	list_del(item);		/* from my private list */
> 	list_lru_add(lru, item);
> 
> but this interface forced me to do a needless lru_del_init().
> 
> Maybe this is good, maybe it is bad.  It depends on what the author(s)
> were thinking at the time ;)
> 
> 
> Either way, returning 1 on success and 0 on failure is surprising.  0
> means success, please.  Alternatively I guess one could make it return
> bool and document the dang thing, hence retaining the current 0/1 concept.
> 
>> +
>> +int
>> +list_lru_del(
>> +	struct list_lru	*lru,
>> +	struct list_head *item)
>> +{
>> +	spin_lock(&lru->lock);
>> +	if (!list_empty(item)) {
>> +		list_del_init(item);
>> +		lru->nr_items--;
>> +		spin_unlock(&lru->lock);
>> +		return 1;
>> +	}
>> +	spin_unlock(&lru->lock);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(list_lru_del);
>> +
>> +unsigned long
>> +list_lru_walk(
>> +	struct list_lru *lru,
>> +	list_lru_walk_cb isolate,
>> +	void		*cb_arg,
>> +	unsigned long	nr_to_walk)
> 
> Interface documentation, please.
> 
>> +{
>> +	struct list_head *item, *n;
>> +	unsigned long removed = 0;
>> +
>> +	spin_lock(&lru->lock);
>> +	list_for_each_safe(item, n, &lru->list) {
>> +		enum lru_status ret;
>> +		bool first_pass = true;
>> +restart:
>> +		ret = isolate(item, &lru->lock, cb_arg);
>> +		switch (ret) {
>> +		case LRU_REMOVED:
>> +			lru->nr_items--;
>> +			removed++;
>> +			break;
>> +		case LRU_ROTATE:
>> +			list_move_tail(item, &lru->list);
>> +			break;
>> +		case LRU_SKIP:
>> +			break;
>> +		case LRU_RETRY:
> 
> With no documentation in the code or the changelog, I haven't a clue why
> these four possibilities exist :(
> 
>> +			if (!first_pass)
>> +				break;
>> +			first_pass = false;
>> +			goto restart;
>> +		default:
>> +			BUG();
>> +		}
>> +
>> +		if (nr_to_walk-- == 0)
>> +			break;
>> +
>> +	}
>> +	spin_unlock(&lru->lock);
>> +	return removed;
>> +}
>> +EXPORT_SYMBOL_GPL(list_lru_walk);
> 
> Passing the address of the spinlock to the list_lru_walk_cb handler is
> rather gross.
> 
> And afacit it is unresolvably buggy - if the handler dropped that lock,
> list_lru_walk() is now left holding a list_head at *item which could
> have been altered or even freed.
> 
> How [patch 09/35]'s inode_lru_isolate() avoids this bug I don't know. 
> Perhaps it doesn't.
> 
> 
> Addendum: having now read through the evolution of lib/list_lru.c, it's
> pretty apparent that this code is highly specific to the inode and
> dcache shrinkers and is unlikely to see applications elsewhere.  So
> hrm, perhaps we're kinda kidding ourselves by putting it in lib/ at
> all.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 09/35] inode: convert inode lru list to generic lru list code.
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 08/35] list: add a new LRU list type Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 10/35] dcache: convert to use new lru list infrastructure Glauber Costa
                     ` (23 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[ glommer: adapted for new LRU return codes ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/inode.c         | 175 +++++++++++++++++++++--------------------------------
 fs/super.c         |  12 ++--
 include/linux/fs.h |   6 +-
 3 files changed, 77 insertions(+), 116 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 1ddaa2e..5d85521 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
 #include <linux/prefetch.h>
 #include <linux/buffer_head.h> /* for inode_has_buffers */
 #include <linux/ratelimit.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 
 /*
@@ -24,7 +25,7 @@
  *
  * inode->i_lock protects:
  *   inode->i_state, inode->i_hash, __iget()
- * inode->i_sb->s_inode_lru_lock protects:
+ * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
@@ -37,7 +38,7 @@
  *
  * inode_sb_list_lock
  *   inode->i_lock
- *     inode->i_sb->s_inode_lru_lock
+ *     Inode LRU list locks
  *
  * bdi->wb.list_lock
  *   inode->i_lock
@@ -399,13 +400,8 @@ EXPORT_SYMBOL(ihold);
 
 static void inode_lru_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_lru_lock);
-	if (list_empty(&inode->i_lru)) {
-		list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
-		inode->i_sb->s_nr_inodes_unused++;
+	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_inc(nr_unused);
-	}
-	spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 /*
@@ -423,13 +419,9 @@ void inode_add_lru(struct inode *inode)
 
 static void inode_lru_list_del(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_lru_lock);
-	if (!list_empty(&inode->i_lru)) {
-		list_del_init(&inode->i_lru);
-		inode->i_sb->s_nr_inodes_unused--;
+
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_dec(nr_unused);
-	}
-	spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 /**
@@ -673,24 +665,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	return busy;
 }
 
-static int can_unuse(struct inode *inode)
-{
-	if (inode->i_state & ~I_REFERENCED)
-		return 0;
-	if (inode_has_buffers(inode))
-		return 0;
-	if (atomic_read(&inode->i_count))
-		return 0;
-	if (inode->i_data.nrpages)
-		return 0;
-	return 1;
-}
-
 /*
- * Walk the superblock inode LRU for freeable inodes and attempt to free them.
- * This is called from the superblock shrinker function with a number of inodes
- * to trim from the LRU. Inodes to be freed are moved to a temporary list and
- * then are freed outside inode_lock by dispose_list().
+ * Isolate the inode from the LRU in preparation for freeing it.
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  If the inode has metadata buffers attached to
@@ -704,90 +680,79 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan)
+static enum lru_status
+inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 {
-	LIST_HEAD(freeable);
-	long nr_scanned;
-	long freed = 0;
-	unsigned long reap = 0;
+	struct list_head *freeable = arg;
+	struct inode	*inode = container_of(item, struct inode, i_lru);
 
-	spin_lock(&sb->s_inode_lru_lock);
-	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
-		struct inode *inode;
+	/*
+	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
+	 * If we fail to get the lock, just skip it.
+	 */
+	if (!spin_trylock(&inode->i_lock))
+		return LRU_SKIP;
 
-		if (list_empty(&sb->s_inode_lru))
-			break;
+	/*
+	 * Referenced or dirty inodes are still in use. Give them another pass
+	 * through the LRU as we canot reclaim them now.
+	 */
+	if (atomic_read(&inode->i_count) ||
+	    (inode->i_state & ~I_REFERENCED)) {
+		list_del_init(&inode->i_lru);
+		spin_unlock(&inode->i_lock);
+		this_cpu_dec(nr_unused);
+		return LRU_REMOVED;
+	}
 
-		inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);
+	/* recently referenced inodes get one more pass */
+	if (inode->i_state & I_REFERENCED) {
+		inode->i_state &= ~I_REFERENCED;
+		spin_unlock(&inode->i_lock);
+		return LRU_ROTATE;
+	}
 
-		/*
-		 * we are inverting the sb->s_inode_lru_lock/inode->i_lock here,
-		 * so use a trylock. If we fail to get the lock, just move the
-		 * inode to the back of the list so we don't spin on it.
-		 */
-		if (!spin_trylock(&inode->i_lock)) {
-			list_move(&inode->i_lru, &sb->s_inode_lru);
-			continue;
+	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(lru_lock);
+		if (remove_inode_buffers(inode)) {
+			unsigned long reap;
+			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
+			if (current_is_kswapd())
+				__count_vm_events(KSWAPD_INODESTEAL, reap);
+			else
+				__count_vm_events(PGINODESTEAL, reap);
+			if (current->reclaim_state)
+				current->reclaim_state->reclaimed_slab += reap;
 		}
+		iput(inode);
+		spin_lock(lru_lock);
+		return LRU_RETRY;
+	}
 
-		/*
-		 * Referenced or dirty inodes are still in use. Give them
-		 * another pass through the LRU as we canot reclaim them now.
-		 */
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_lru);
-			spin_unlock(&inode->i_lock);
-			sb->s_nr_inodes_unused--;
-			this_cpu_dec(nr_unused);
-			continue;
-		}
+	WARN_ON(inode->i_state & I_NEW);
+	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
-		/* recently referenced inodes get one more pass */
-		if (inode->i_state & I_REFERENCED) {
-			inode->i_state &= ~I_REFERENCED;
-			list_move(&inode->i_lru, &sb->s_inode_lru);
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
-			spin_unlock(&inode->i_lock);
-			spin_unlock(&sb->s_inode_lru_lock);
-			if (remove_inode_buffers(inode))
-				reap += invalidate_mapping_pages(&inode->i_data,
-								0, -1);
-			iput(inode);
-			spin_lock(&sb->s_inode_lru_lock);
-
-			if (inode != list_entry(sb->s_inode_lru.next,
-						struct inode, i_lru))
-				continue;	/* wrong inode or list_empty */
-			/* avoid lock inversions with trylock */
-			if (!spin_trylock(&inode->i_lock))
-				continue;
-			if (!can_unuse(inode)) {
-				spin_unlock(&inode->i_lock);
-				continue;
-			}
-		}
-		WARN_ON(inode->i_state & I_NEW);
-		inode->i_state |= I_FREEING;
-		spin_unlock(&inode->i_lock);
+	list_move(&inode->i_lru, freeable);
+	this_cpu_dec(nr_unused);
+	return LRU_REMOVED;
+}
 
-		list_move(&inode->i_lru, &freeable);
-		sb->s_nr_inodes_unused--;
-		this_cpu_dec(nr_unused);
-		freed++;
-	}
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_INODESTEAL, reap);
-	else
-		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&sb->s_inode_lru_lock);
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += reap;
+/*
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
+ */
+long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan)
+{
+	LIST_HEAD(freeable);
+	long freed;
 
+	freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
+						&freeable, nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/super.c b/fs/super.c
index 18871f6..83f6eb4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,14 +77,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb);
 
-	total_objects = sb->s_nr_dentry_unused +
-			sb->s_nr_inodes_unused + fs_objects + 1;
+	inodes = list_lru_count(&sb->s_inode_lru);
+	total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
 								total_objects);
-	inodes = mult_frac(sc->nr_to_scan, sb->s_nr_inodes_unused,
-								total_objects);
+	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
@@ -117,7 +116,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 		total_objects = sb->s_op->nr_cached_objects(sb);
 
 	total_objects += sb->s_nr_dentry_unused;
-	total_objects += sb->s_nr_inodes_unused;
+	total_objects += list_lru_count(&sb->s_inode_lru);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -198,8 +197,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		spin_lock_init(&s->s_dentry_lru_lock);
-		INIT_LIST_HEAD(&s->s_inode_lru);
-		spin_lock_init(&s->s_inode_lru_lock);
+		list_lru_init(&s->s_inode_lru);
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b0170ec..06695d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
 #include <linux/stat.h>
 #include <linux/cache.h>
 #include <linux/list.h>
+#include <linux/list_lru.h>
 #include <linux/radix-tree.h>
 #include <linux/rbtree.h>
 #include <linux/init.h>
@@ -1269,10 +1270,7 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	long			s_nr_dentry_unused;	/* # of dentry on lru */
 
-	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
-	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inode_lru;		/* unused inode lru */
-	long			s_nr_inodes_unused;	/* # of inodes on lru */
+	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 10/35] dcache: convert to use new lru list infrastructure
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (6 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 09/35] inode: convert inode lru list to generic lru list code Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 11/35] list_lru: per-node " Glauber Costa
                     ` (22 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[ glommer: don't reintroduce double decrement of nr_unused_dentries,
  adapted for new LRU return codes ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/dcache.c        | 165 ++++++++++++++++++++++++-----------------------------
 fs/super.c         |  11 ++--
 include/linux/fs.h |  15 +++--
 3 files changed, 87 insertions(+), 104 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f048f95..30731d3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/prefetch.h>
 #include <linux/ratelimit.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -319,20 +320,8 @@ static void dentry_unlink_inode(struct dentry * dentry)
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
-	if (list_empty(&dentry->d_lru)) {
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-		dentry->d_sb->s_nr_dentry_unused++;
+	if (list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
 		this_cpu_inc(nr_dentry_unused);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-	}
-}
-
-static void __dentry_lru_del(struct dentry *dentry)
-{
-	list_del_init(&dentry->d_lru);
-	dentry->d_sb->s_nr_dentry_unused--;
-	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -350,26 +339,8 @@ static void dentry_lru_del(struct dentry *dentry)
 		return;
 	}
 
-	if (!list_empty(&dentry->d_lru)) {
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		__dentry_lru_del(dentry);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-	}
-}
-
-static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
-{
-	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
-
-	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-	if (list_empty(&dentry->d_lru)) {
-		list_add_tail(&dentry->d_lru, list);
-	} else {
-		list_move_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused--;
+	if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
 		this_cpu_dec(nr_dentry_unused);
-	}
-	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
 
 /**
@@ -844,12 +815,72 @@ static void shrink_dentry_list(struct list_head *list)
 	rcu_read_unlock();
 }
 
+static enum lru_status
+dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
+{
+	struct list_head *freeable = arg;
+	struct dentry	*dentry = container_of(item, struct dentry, d_lru);
+
+
+	/*
+	 * we are inverting the lru lock/dentry->d_lock here,
+	 * so use a trylock. If we fail to get the lock, just skip
+	 * it
+	 */
+	if (!spin_trylock(&dentry->d_lock))
+		return LRU_SKIP;
+
+	/*
+	 * Referenced dentries are still in use. If they have active
+	 * counts, just remove them from the LRU. Otherwise give them
+	 * another pass through the LRU.
+	 */
+	if (dentry->d_count) {
+		list_del_init(&dentry->d_lru);
+		spin_unlock(&dentry->d_lock);
+		return LRU_REMOVED;
+	}
+
+	if (dentry->d_flags & DCACHE_REFERENCED) {
+		dentry->d_flags &= ~DCACHE_REFERENCED;
+		spin_unlock(&dentry->d_lock);
+
+		/*
+		 * The list move itself will be made by the common LRU code. At
+		 * this point, we've dropped the dentry->d_lock but keep the
+		 * lru lock. This is safe to do, since every list movement is
+		 * protected by the lru lock even if both locks are held.
+		 *
+		 * This is guaranteed by the fact that all LRU management
+		 * functions are intermediated by the LRU API calls like
+		 * list_lru_add and list_lru_del. List movement in this file
+		 * only ever occur through this functions or through callbacks
+		 * like this one, that are called from the LRU API.
+		 *
+		 * The only exceptions to this are functions like
+		 * shrink_dentry_list, and code that first checks for the
+		 * DCACHE_SHRINK_LIST flag.  Those are guaranteed to be
+		 * operating only with stack provided lists after they are
+		 * properly isolated from the main list.  It is thus, always a
+		 * local access.
+		 */
+		return LRU_ROTATE;
+	}
+
+	dentry->d_flags |= DCACHE_SHRINK_LIST;
+	list_move_tail(&dentry->d_lru, freeable);
+	this_cpu_dec(nr_dentry_unused);
+	spin_unlock(&dentry->d_lock);
+
+	return LRU_REMOVED;
+}
+
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @count: number of entries to try to free
+ * @nr_to_scan : number of entries to try to free
  *
- * Attempt to shrink the superblock dcache LRU by @count entries. This is
+ * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
  * done when we need more memory an called from the superblock shrinker
  * function.
  *
@@ -858,45 +889,12 @@ static void shrink_dentry_list(struct list_head *list)
  */
 long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan)
 {
-	struct dentry *dentry;
-	LIST_HEAD(referenced);
-	LIST_HEAD(tmp);
-	long freed = 0;
-
-relock:
-	spin_lock(&sb->s_dentry_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		dentry = list_entry(sb->s_dentry_lru.prev,
-				struct dentry, d_lru);
-		BUG_ON(dentry->d_sb != sb);
-
-		if (!spin_trylock(&dentry->d_lock)) {
-			spin_unlock(&sb->s_dentry_lru_lock);
-			cpu_relax();
-			goto relock;
-		}
-
-		if (dentry->d_flags & DCACHE_REFERENCED) {
-			dentry->d_flags &= ~DCACHE_REFERENCED;
-			list_move(&dentry->d_lru, &referenced);
-			spin_unlock(&dentry->d_lock);
-		} else {
-			list_move(&dentry->d_lru, &tmp);
-			dentry->d_flags |= DCACHE_SHRINK_LIST;
-			this_cpu_dec(nr_dentry_unused);
-			sb->s_nr_dentry_unused--;
-			spin_unlock(&dentry->d_lock);
-			freed++;
-			if (!--nr_to_scan)
-				break;
-		}
-		cond_resched_lock(&sb->s_dentry_lru_lock);
-	}
-	if (!list_empty(&referenced))
-		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&sb->s_dentry_lru_lock);
+	LIST_HEAD(dispose);
+	long freed;
 
-	shrink_dentry_list(&tmp);
+	freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
+			      &dispose, nr_to_scan);
+	shrink_dentry_list(&dispose);
 	return freed;
 }
 
@@ -930,24 +928,10 @@ shrink_dcache_list(
  */
 void shrink_dcache_sb(struct super_block *sb)
 {
-	LIST_HEAD(tmp);
-
-	spin_lock(&sb->s_dentry_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		/*
-		 * account for removal here so we don't need to handle it later
-		 * even though the dentry is no longer on the lru list.
-		 */
-		list_splice_init(&sb->s_dentry_lru, &tmp);
-		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
-		sb->s_nr_dentry_unused = 0;
-		spin_unlock(&sb->s_dentry_lru_lock);
+	long disposed;
 
-		shrink_dcache_list(&tmp);
-
-		spin_lock(&sb->s_dentry_lru_lock);
-	}
-	spin_unlock(&sb->s_dentry_lru_lock);
+	disposed = list_lru_dispose_all(&sb->s_dentry_lru, shrink_dcache_list);
+	this_cpu_sub(nr_dentry_unused, disposed);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -1220,7 +1204,8 @@ resume:
 		if (dentry->d_count) {
 			dentry_lru_del(dentry);
 		} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
-			dentry_lru_move_list(dentry, dispose);
+			dentry_lru_del(dentry);
+			list_add_tail(&dentry->d_lru, dispose);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
 			found++;
 		}
diff --git a/fs/super.c b/fs/super.c
index 83f6eb4..8d8a62c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,11 +78,11 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 		fs_objects = sb->s_op->nr_cached_objects(sb);
 
 	inodes = list_lru_count(&sb->s_inode_lru);
-	total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
+	dentries = list_lru_count(&sb->s_dentry_lru);
+	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
-	dentries = mult_frac(sc->nr_to_scan, sb->s_nr_dentry_unused,
-								total_objects);
+	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
 
 	/*
@@ -115,7 +115,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb);
 
-	total_objects += sb->s_nr_dentry_unused;
+	total_objects += list_lru_count(&sb->s_dentry_lru);
 	total_objects += list_lru_count(&sb->s_inode_lru);
 
 	total_objects = vfs_pressure_ratio(total_objects);
@@ -195,8 +195,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
-		INIT_LIST_HEAD(&s->s_dentry_lru);
-		spin_lock_init(&s->s_dentry_lru_lock);
+		list_lru_init(&s->s_dentry_lru);
 		list_lru_init(&s->s_inode_lru);
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 06695d7..0d05a98 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1264,14 +1264,6 @@ struct super_block {
 	struct list_head	s_files;
 #endif
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-
-	/* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
-	spinlock_t		s_dentry_lru_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_dentry_lru;	/* unused dentry lru */
-	long			s_nr_dentry_unused;	/* # of dentry on lru */
-
-	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
-
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
@@ -1322,6 +1314,13 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/*
+	 * Keep the lru lists last in the structure so they always sit on their
+	 * own individual cachelines.
+	 */
+	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
+	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 11/35] list_lru: per-node list infrastructure
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (7 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 10/35] dcache: convert to use new lru list infrastructure Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 12/35] shrinker: add node awareness Glauber Costa
                     ` (21 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Now that we have an LRU list API, we can start to enhance the
implementation.  This splits the single LRU list into per-node lists
and locks to enhance scalability. Items are placed on lists
according to the node the memory belongs to. To make scanning the
lists efficient, also track whether the per-node lists have entries
in them in a active nodemask.

Note:
We use a fixed-size array for the node LRU, this struct can be very big
if MAX_NUMNODES is big. If this becomes a problem this is fixable by
turning this into a pointer and dynamically allocating this to
nr_node_ids. This quantity is firwmare-provided, and still would provide
room for all nodes at the cost of a pointer lookup and an extra
allocation. Because that allocation will most likely come from a
different slab cache than the main structure holding this structure, we
may very well fail.

[ glommer: fixed warnings, added note about node lru ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 include/linux/list_lru.h |  24 +++++--
 lib/list_lru.c           | 161 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 139 insertions(+), 46 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 4f82a57..668f1f1 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -8,6 +8,7 @@
 #define _LRU_LIST_H
 
 #include <linux/list.h>
+#include <linux/nodemask.h>
 
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -17,20 +18,31 @@ enum lru_status {
 				   internally, but has to return locked. */
 };
 
-struct list_lru {
+struct list_lru_node {
 	spinlock_t		lock;
 	struct list_head	list;
 	long			nr_items;
+} ____cacheline_aligned_in_smp;
+
+struct list_lru {
+	/*
+	 * Because we use a fixed-size array, this struct can be very big if
+	 * MAX_NUMNODES is big. If this becomes a problem this is fixable by
+	 * turning this into a pointer and dynamically allocating this to
+	 * nr_node_ids. This quantity is firwmare-provided, and still would
+	 * provide room for all nodes at the cost of a pointer lookup and an
+	 * extra allocation. Because that allocation will most likely come from
+	 * a different slab cache than the main structure holding this
+	 * structure, we may very well fail.
+	 */
+	struct list_lru_node	node[MAX_NUMNODES];
+	nodemask_t		active_nodes;
 };
 
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-
-static inline unsigned long list_lru_count(struct list_lru *lru)
-{
-	return lru->nr_items;
-}
+unsigned long list_lru_count(struct list_lru *lru);
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 3127edd..7611df7 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -6,6 +6,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 #include <linux/list_lru.h>
 
 int
@@ -13,14 +14,19 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	spin_lock(&lru->lock);
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	spin_lock(&nlru->lock);
+	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
-		list_add_tail(item, &lru->list);
-		lru->nr_items++;
-		spin_unlock(&lru->lock);
+		list_add_tail(item, &nlru->list);
+		if (nlru->nr_items++ == 0)
+			node_set(nid, lru->active_nodes);
+		spin_unlock(&nlru->lock);
 		return 1;
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
@@ -30,41 +36,69 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	spin_lock(&lru->lock);
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		lru->nr_items--;
-		spin_unlock(&lru->lock);
+		if (--nlru->nr_items == 0)
+			node_clear(nid, lru->active_nodes);
+		BUG_ON(nlru->nr_items < 0);
+		spin_unlock(&nlru->lock);
 		return 1;
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 unsigned long
-list_lru_walk(
-	struct list_lru *lru,
-	list_lru_walk_cb isolate,
-	void		*cb_arg,
-	unsigned long	nr_to_walk)
+list_lru_count(struct list_lru *lru)
 {
+	long count = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		struct list_lru_node *nlru = &lru->node[nid];
+
+		spin_lock(&nlru->lock);
+		BUG_ON(nlru->nr_items < 0);
+		count += nlru->nr_items;
+		spin_unlock(&nlru->lock);
+	}
+
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static unsigned long
+list_lru_walk_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_walk_cb	isolate,
+	void			*cb_arg,
+	unsigned long		*nr_to_walk)
+{
+	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
-	unsigned long removed = 0;
+	unsigned long isolated = 0;
 
-	spin_lock(&lru->lock);
-	list_for_each_safe(item, n, &lru->list) {
+	spin_lock(&nlru->lock);
+	list_for_each_safe(item, n, &nlru->list) {
 		enum lru_status ret;
 		bool first_pass = true;
 restart:
-		ret = isolate(item, &lru->lock, cb_arg);
+		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case LRU_REMOVED:
-			lru->nr_items--;
-			removed++;
+			if (--nlru->nr_items == 0)
+				node_clear(nid, lru->active_nodes);
+			BUG_ON(nlru->nr_items < 0);
+			isolated++;
 			break;
 		case LRU_ROTATE:
-			list_move_tail(item, &lru->list);
+			list_move_tail(item, &nlru->list);
 			break;
 		case LRU_SKIP:
 			break;
@@ -77,46 +111,93 @@ restart:
 			BUG();
 		}
 
-		if (nr_to_walk-- == 0)
+		if ((*nr_to_walk)-- == 0)
 			break;
 
 	}
-	spin_unlock(&lru->lock);
-	return removed;
+	spin_unlock(&nlru->lock);
+	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk);
 
 unsigned long
-list_lru_dispose_all(
-	struct list_lru *lru,
-	list_lru_dispose_cb dispose)
+list_lru_walk(
+	struct list_lru	*lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	unsigned long	nr_to_walk)
 {
-	unsigned long disposed = 0;
+	long isolated = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+static unsigned long
+list_lru_dispose_all_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_dispose_cb	dispose)
+{
+	struct list_lru_node	*nlru = &lru->node[nid];
 	LIST_HEAD(dispose_list);
+	unsigned long disposed = 0;
 
-	spin_lock(&lru->lock);
-	while (!list_empty(&lru->list)) {
-		list_splice_init(&lru->list, &dispose_list);
-		disposed += lru->nr_items;
-		lru->nr_items = 0;
-		spin_unlock(&lru->lock);
+	spin_lock(&nlru->lock);
+	while (!list_empty(&nlru->list)) {
+		list_splice_init(&nlru->list, &dispose_list);
+		disposed += nlru->nr_items;
+		nlru->nr_items = 0;
+		node_clear(nid, lru->active_nodes);
+		spin_unlock(&nlru->lock);
 
 		dispose(&dispose_list);
 
-		spin_lock(&lru->lock);
+		spin_lock(&nlru->lock);
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return disposed;
 }
 
+unsigned long
+list_lru_dispose_all(
+	struct list_lru		*lru,
+	list_lru_dispose_cb	dispose)
+{
+	unsigned long disposed;
+	unsigned long total = 0;
+	int nid;
+
+	do {
+		disposed = 0;
+		for_each_node_mask(nid, lru->active_nodes) {
+			disposed += list_lru_dispose_all_node(lru, nid,
+							      dispose);
+		}
+		total += disposed;
+	} while (disposed != 0);
+
+	return total;
+}
+
 int
 list_lru_init(
 	struct list_lru	*lru)
 {
-	spin_lock_init(&lru->lock);
-	INIT_LIST_HEAD(&lru->list);
-	lru->nr_items = 0;
+	int i;
 
+	nodes_clear(lru->active_nodes);
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		spin_lock_init(&lru->node[i].lock);
+		INIT_LIST_HEAD(&lru->node[i].list);
+		lru->node[i].nr_items = 0;
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_init);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 11/35] list_lru: per-node list infrastructure
  2013-06-03 19:29   ` [PATCH v10 11/35] list_lru: per-node " Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
  2013-06-06  3:21       ` Dave Chinner
  2013-06-06 16:15       ` Glauber Costa
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Mon,  3 Jun 2013 23:29:40 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have an LRU list API, we can start to enhance the
> implementation.  This splits the single LRU list into per-node lists
> and locks to enhance scalability.

Do we have any runtime measurements?  They're pretty important for
justifying inclusion of the code.

Measurememnts for non-NUMA and uniprocessor kernels would be useful in
making that decision as well.

In fact a lot of the patchset is likely to be injurious to small
machines.  We should quantify this and then persade ourselves that the
large-machine gains are worth the small-machine losses.

> Items are placed on lists
> according to the node the memory belongs to. To make scanning the
> lists efficient, also track whether the per-node lists have entries
> in them in a active nodemask.
> 
> Note:
> We use a fixed-size array for the node LRU, this struct can be very big
> if MAX_NUMNODES is big. If this becomes a problem this is fixable by
> turning this into a pointer and dynamically allocating this to
> nr_node_ids. This quantity is firwmare-provided, and still would provide
> room for all nodes at the cost of a pointer lookup and an extra
> allocation. Because that allocation will most likely come from a
> different slab cache than the main structure holding this structure, we
> may very well fail.

Surprised.  How big is MAX_NUMNODES likely to get?

lib/flex_array.c might be of use here.

>
> ...
>
> -struct list_lru {
> +struct list_lru_node {
>  	spinlock_t		lock;
>  	struct list_head	list;
>  	long			nr_items;
> +} ____cacheline_aligned_in_smp;
> +
> +struct list_lru {
> +	/*
> +	 * Because we use a fixed-size array, this struct can be very big if
> +	 * MAX_NUMNODES is big. If this becomes a problem this is fixable by
> +	 * turning this into a pointer and dynamically allocating this to
> +	 * nr_node_ids. This quantity is firwmare-provided, and still would
> +	 * provide room for all nodes at the cost of a pointer lookup and an
> +	 * extra allocation. Because that allocation will most likely come from
> +	 * a different slab cache than the main structure holding this
> +	 * structure, we may very well fail.
> +	 */
> +	struct list_lru_node	node[MAX_NUMNODES];
> +	nodemask_t		active_nodes;

Some documentation of the data structure would be helpful.  It appears
that active_nodes tracks (ie: duplicates) node[x].nr_items!=0.

It's unclear that active_nodes is really needed - we could just iterate
across all items in list_lru.node[].  Are we sure that the correct
tradeoff decision was made here?

What's the story on NUMA node hotplug, btw?

>  };
>  
>
> ...
>
>  unsigned long
> -list_lru_walk(
> -	struct list_lru *lru,
> -	list_lru_walk_cb isolate,
> -	void		*cb_arg,
> -	unsigned long	nr_to_walk)
> +list_lru_count(struct list_lru *lru)
>  {
> +	long count = 0;
> +	int nid;
> +
> +	for_each_node_mask(nid, lru->active_nodes) {
> +		struct list_lru_node *nlru = &lru->node[nid];
> +
> +		spin_lock(&nlru->lock);
> +		BUG_ON(nlru->nr_items < 0);

This is buggy.

The bit in lru->active_nodes could be cleared by now.  We can only make
this assertion if we recheck lru->active_nodes[nid] inside the
spinlocked region.

> +		count += nlru->nr_items;
> +		spin_unlock(&nlru->lock);
> +	}
> +
> +	return count;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_count);

list_lru_count()'s return value is of course approximate.  If callers
require that the returned value be exact, they will need to provide
their own locking on top of list_lru's internal locking (which would
then become redundant).

This is the sort of thing which should be discussed in the interface
documentation.

list_lru_count() can be very expensive.

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 11/35] list_lru: per-node list infrastructure
  2013-06-05 23:08     ` Andrew Morton
@ 2013-06-06  3:21       ` Dave Chinner
  2013-06-06  3:51         ` Andrew Morton
  2013-06-06  8:21         ` Glauber Costa
  2013-06-06 16:15       ` Glauber Costa
  1 sibling, 2 replies; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  3:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 04:08:04PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:40 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have an LRU list API, we can start to enhance the
> > implementation.  This splits the single LRU list into per-node lists
> > and locks to enhance scalability.
> 
> Do we have any runtime measurements?  They're pretty important for
> justifying inclusion of the code.

Nothing I've officially posted, because I've been busy with other
XFS stuff. But, well, if you look here:

http://oss.sgi.com/pipermail/xfs/2013-June/026888.html

-  12.74%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 60.49% _raw_spin_lock
         + 91.79% inode_add_lru			>>> inode_lru_lock
         + 2.98% dentry_lru_del			>>> dcache_lru_lock
         + 1.30% shrink_dentry_list
         + 0.71% evict
      - 20.42% do_raw_spin_lock
         - _raw_spin_lock
            + 13.41% inode_add_lru		>>> inode_lru_lock
            + 10.55% evict
            + 8.26% dentry_lru_del		>>> dcache_lru_lock
            + 7.62% __remove_inode_hash
....
      - 10.37% do_raw_spin_trylock
         - _raw_spin_trylock
            + 79.65% prune_icache_sb		>>> inode_lru_lock
            + 11.04% shrink_dentry_list
            + 9.24% prune_dcache_sb		>>> dcache_lru_lock
      - 8.72% _raw_spin_trylock
         + 46.33% prune_icache_sb		>>> inode_lru_lock
         + 46.08% shrink_dentry_list
         + 7.60% prune_dcache_sb		>>> dcache_lru_lock

This is from an 8p system w/ fake-numa=4 running an 8-way find+stat
workload on 50 million files. 12.5% CPU usage means we are burning
an entire CPU of that system just in __ticket_spin_trylock(), and
the numbers above indicate that roughly 60% of that CPU time is from
the inode_lru_lock.

So, more than half a CPU being spent just trying to get the
inode_lru_lock. The generic LRU list code drops
__ticket_spin_trylock() back down to roughly 2% of the total CPU
usage for the same workload - the CPU burn associated with the
contention on the global lock goes away.

It's pretty obvious if a global lock is causing contention issues on
an 8p system, then larger systems are going to be much, much worse.

> Measurememnts for non-NUMA and uniprocessor kernels would be useful in
> making that decision as well.

I get the same spinlock contention problems when I run without the
fake-numa kernel parameter on the VM. The generic LRU lists can't
fix the problem for non-numa systems.

> In fact a lot of the patchset is likely to be injurious to small
> machines.  We should quantify this and then persade ourselves that the
> large-machine gains are worth the small-machine losses.

I haven't been able to measure any CPU usage difference from the
changes for non-numa systems on workloads that stress the LRUs. if
you've got any ideas on how I might demonstrate a regression, then
I'm all ears. But If I can't measure the difference, there is
none...

> 
> > Items are placed on lists
> > according to the node the memory belongs to. To make scanning the
> > lists efficient, also track whether the per-node lists have entries
> > in them in a active nodemask.
> > 
> > Note:
> > We use a fixed-size array for the node LRU, this struct can be very big
> > if MAX_NUMNODES is big. If this becomes a problem this is fixable by
> > turning this into a pointer and dynamically allocating this to
> > nr_node_ids. This quantity is firwmare-provided, and still would provide
> > room for all nodes at the cost of a pointer lookup and an extra
> > allocation. Because that allocation will most likely come from a
> > different slab cache than the main structure holding this structure, we
> > may very well fail.
> 
> Surprised.  How big is MAX_NUMNODES likely to get?

AFAICT, 1024.

> lib/flex_array.c might be of use here.

Never heard of it :/

Perhaps it might, but that woul dbe something to do further down the
track...

> 
> >
> > ...
> >
> > -struct list_lru {
> > +struct list_lru_node {
> >  	spinlock_t		lock;
> >  	struct list_head	list;
> >  	long			nr_items;
> > +} ____cacheline_aligned_in_smp;
> > +
> > +struct list_lru {
> > +	/*
> > +	 * Because we use a fixed-size array, this struct can be very big if
> > +	 * MAX_NUMNODES is big. If this becomes a problem this is fixable by
> > +	 * turning this into a pointer and dynamically allocating this to
> > +	 * nr_node_ids. This quantity is firwmare-provided, and still would
> > +	 * provide room for all nodes at the cost of a pointer lookup and an
> > +	 * extra allocation. Because that allocation will most likely come from
> > +	 * a different slab cache than the main structure holding this
> > +	 * structure, we may very well fail.
> > +	 */
> > +	struct list_lru_node	node[MAX_NUMNODES];
> > +	nodemask_t		active_nodes;
> 
> Some documentation of the data structure would be helpful.  It appears
> that active_nodes tracks (ie: duplicates) node[x].nr_items!=0.
> 
> It's unclear that active_nodes is really needed - we could just iterate
> across all items in list_lru.node[].  Are we sure that the correct
> tradeoff decision was made here?

Yup. Think of all the cache line misses that checking
node[x].nr_items != 0 entails. If MAX_NUMNODES = 1024, there's 1024
cacheline misses right there. The nodemask is a much more cache
friendly method of storing active node state.

not to mention that for small machines with a large MAX_NUMNODES,
we'd be checking nodes that never have items stored on them...

> What's the story on NUMA node hotplug, btw?

Do we care? hotplug doesn't change MAX_NUMNODES, and if you are
removing a node you have to free all the memory on the node,
so that should already be tken care of by external code....

> 
> >  };
> >  
> >
> > ...
> >
> >  unsigned long
> > -list_lru_walk(
> > -	struct list_lru *lru,
> > -	list_lru_walk_cb isolate,
> > -	void		*cb_arg,
> > -	unsigned long	nr_to_walk)
> > +list_lru_count(struct list_lru *lru)
> >  {
> > +	long count = 0;
> > +	int nid;
> > +
> > +	for_each_node_mask(nid, lru->active_nodes) {
> > +		struct list_lru_node *nlru = &lru->node[nid];
> > +
> > +		spin_lock(&nlru->lock);
> > +		BUG_ON(nlru->nr_items < 0);
> 
> This is buggy.

Yup, good catch.

> > +EXPORT_SYMBOL_GPL(list_lru_count);
> 
> list_lru_count()'s return value is of course approximate.  If callers
> require that the returned value be exact, they will need to provide
> their own locking on top of list_lru's internal locking (which would
> then become redundant).
> 
> This is the sort of thing which should be discussed in the interface
> documentation.

Yup.

> list_lru_count() can be very expensive.

Well, yes. But it's far less expensive than a global LRU lock on a
machine of the size that we are concerned about list_lru_count()
being expensive.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 11/35] list_lru: per-node list infrastructure
  2013-06-06  3:21       ` Dave Chinner
@ 2013-06-06  3:51         ` Andrew Morton
  2013-06-06  8:21         ` Glauber Costa
  1 sibling, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  3:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 13:21:07 +1000 Dave Chinner <david@fromorbit.com> wrote:

> > > +struct list_lru {
> > > +	/*
> > > +	 * Because we use a fixed-size array, this struct can be very big if
> > > +	 * MAX_NUMNODES is big. If this becomes a problem this is fixable by
> > > +	 * turning this into a pointer and dynamically allocating this to
> > > +	 * nr_node_ids. This quantity is firwmare-provided, and still would
> > > +	 * provide room for all nodes at the cost of a pointer lookup and an
> > > +	 * extra allocation. Because that allocation will most likely come from
> > > +	 * a different slab cache than the main structure holding this
> > > +	 * structure, we may very well fail.
> > > +	 */
> > > +	struct list_lru_node	node[MAX_NUMNODES];
> > > +	nodemask_t		active_nodes;
> > 
> > Some documentation of the data structure would be helpful.  It appears
> > that active_nodes tracks (ie: duplicates) node[x].nr_items!=0.
> > 
> > It's unclear that active_nodes is really needed - we could just iterate
> > across all items in list_lru.node[].  Are we sure that the correct
> > tradeoff decision was made here?
> 
> Yup. Think of all the cache line misses that checking
> node[x].nr_items != 0 entails. If MAX_NUMNODES = 1024, there's 1024
> cacheline misses right there. The nodemask is a much more cache
> friendly method of storing active node state.

Well, it depends on the relative frequency of list-wide walking.  If
that's "very low" then the cost of maintaining active_nodes could
dominate.

Plus all the callsites which traverse active_nodes will touch
list_lru.node[n] anyway, so the cache-miss impact will be unaltered.

> not to mention that for small machines with a large MAX_NUMNODES,
> we'd be checking nodes that never have items stored on them...

Yes, there is that.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 11/35] list_lru: per-node list infrastructure
  2013-06-06  3:21       ` Dave Chinner
  2013-06-06  3:51         ` Andrew Morton
@ 2013-06-06  8:21         ` Glauber Costa
  1 sibling, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Glauber Costa,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 1250 bytes --]

On 06/06/2013 07:21 AM, Dave Chinner wrote:
>> It's unclear that active_nodes is really needed - we could just iterate
>> > across all items in list_lru.node[].  Are we sure that the correct
>> > tradeoff decision was made here?
> Yup. Think of all the cache line misses that checking
> node[x].nr_items != 0 entails. If MAX_NUMNODES = 1024, there's 1024
> cacheline misses right there. The nodemask is a much more cache
> friendly method of storing active node state.
> 
> not to mention that for small machines with a large MAX_NUMNODES,
> we'd be checking nodes that never have items stored on them...
> 
>> > What's the story on NUMA node hotplug, btw?
> Do we care? hotplug doesn't change MAX_NUMNODES, and if you are
> removing a node you have to free all the memory on the node,
> so that should already be tken care of by external code....
> 

Mel have already complained about this.
I have a patch that makes it dynamic but I didn't include it in here
because the series was already too big. I was also hoping to get it
ontop of the others, to avoid disruption.

I am attaching here for your appreciation.

For the record, nr_node_ids is firmware provided and it is actually
possible nodes, not online nodes. So hotplug won't change that.



[-- Attachment #2: 0001-list_lru-dynamically-adjust-node-arrays.patch --]
[-- Type: text/x-patch, Size: 7342 bytes --]

>From cfc280ee20d93b1901c5ad2dcb13635ce7703d92 Mon Sep 17 00:00:00 2001
From: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Date: Wed, 22 May 2013 09:55:15 +0400
Subject: [PATCH] list_lru: dynamically adjust node arrays

We currently use a compile-time constant to size the node array for the
list_lru structure. Due to this, we don't need to allocate any memory at
initialization time. But as a consequence, the structures that contain
embedded list_lru lists can become way too big (the superblock for
instance contains two of them).

This patch aims at ameliorating this situation by dynamically allocating
the node arrays with the firmware provided nr_node_ids.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 fs/super.c               |  9 +++++++--
 fs/xfs/xfs_buf.c         |  6 +++++-
 fs/xfs/xfs_qm.c          | 10 ++++++++--
 include/linux/list_lru.h | 21 ++++---------------
 lib/list_lru.c           | 52 ++++++++++++++++++++++++++++++++++++++++++------
 5 files changed, 70 insertions(+), 28 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index ff40e33..f8dfcec 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -209,8 +209,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 
-		list_lru_init_memcg(&s->s_dentry_lru);
-		list_lru_init_memcg(&s->s_inode_lru);
+		if (list_lru_init_memcg(&s->s_dentry_lru))
+			goto err_out;
+		if (list_lru_init_memcg(&s->s_inode_lru))
+			goto err_out_dentry_lru;
 
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
@@ -251,6 +253,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	}
 out:
 	return s;
+
+err_out_dentry_lru:
+	list_lru_destroy(&s->s_dentry_lru);
 err_out:
 	security_sb_free(s);
 #ifdef CONFIG_SMP
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 0d7a619..b8cde02 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1592,6 +1592,7 @@ xfs_free_buftarg(
 	struct xfs_mount	*mp,
 	struct xfs_buftarg	*btp)
 {
+	list_lru_destroy(&btp->bt_lru);
 	unregister_shrinker(&btp->bt_shrinker);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
@@ -1666,9 +1667,12 @@ xfs_alloc_buftarg(
 	if (!btp->bt_bdi)
 		goto error;
 
-	list_lru_init(&btp->bt_lru);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
+
+	if (list_lru_init(&btp->bt_lru))
+		goto error;
+
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 85ca39e..29ea575 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -780,11 +780,18 @@ xfs_qm_init_quotainfo(
 
 	qinf = mp->m_quotainfo = kmem_zalloc(sizeof(xfs_quotainfo_t), KM_SLEEP);
 
+	if ((error = list_lru_init(&qinf->qi_lru))) {
+		kmem_free(qinf);
+		mp->m_quotainfo = NULL;
+		return error;
+	}
+
 	/*
 	 * See if quotainodes are setup, and if not, allocate them,
 	 * and change the superblock accordingly.
 	 */
 	if ((error = xfs_qm_init_quotainos(mp))) {
+		list_lru_destroy(&qinf->qi_lru);
 		kmem_free(qinf);
 		mp->m_quotainfo = NULL;
 		return error;
@@ -794,8 +801,6 @@ xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
 
-	list_lru_init(&qinf->qi_lru);
-
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
 
@@ -883,6 +888,7 @@ xfs_qm_destroy_quotainfo(
 	qi = mp->m_quotainfo;
 	ASSERT(qi != NULL);
 
+	list_lru_destroy(&qi->qi_lru);
 	unregister_shrinker(&qi->qi_shrinker);
 
 	if (qi->qi_uquotaip) {
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index dcb67dc..6d6efda 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -42,18 +42,8 @@ struct list_lru_array {
 };
 
 struct list_lru {
-	/*
-	 * Because we use a fixed-size array, this struct can be very big if
-	 * MAX_NUMNODES is big. If this becomes a problem this is fixable by
-	 * turning this into a pointer and dynamically allocating this to
-	 * nr_node_ids. This quantity is firwmare-provided, and still would
-	 * provide room for all nodes at the cost of a pointer lookup and an
-	 * extra allocation. Because that allocation will most likely come from
-	 * a different slab cache than the main structure holding this
-	 * structure, we may very well fail.
-	 */
-	struct list_lru_node	node[MAX_NUMNODES];
-	atomic_long_t		node_totals[MAX_NUMNODES];
+	struct list_lru_node	*node;
+	atomic_long_t		*node_totals;
 	nodemask_t		active_nodes;
 #ifdef CONFIG_MEMCG_KMEM
 	/* All memcg-aware LRUs will be chained in the lrus list */
@@ -78,14 +68,11 @@ struct mem_cgroup;
 struct list_lru_array *lru_alloc_array(void);
 int memcg_update_all_lrus(unsigned long num);
 void memcg_destroy_all_lrus(struct mem_cgroup *memcg);
-void list_lru_destroy(struct list_lru *lru);
 int __memcg_init_lru(struct list_lru *lru);
-#else
-static inline void list_lru_destroy(struct list_lru *lru)
-{
-}
 #endif
 
+void list_lru_destroy(struct list_lru *lru);
+
 int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
 static inline int list_lru_init(struct list_lru *lru)
 {
diff --git a/lib/list_lru.c b/lib/list_lru.c
index f919f99..1b38d67 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -334,7 +334,6 @@ int __memcg_init_lru(struct list_lru *lru)
 {
 	int ret;
 
-	INIT_LIST_HEAD(&lru->lrus);
 	mutex_lock(&all_memcg_lrus_mutex);
 	list_add(&lru->lrus, &all_memcg_lrus);
 	ret = memcg_new_lru(lru);
@@ -369,8 +368,11 @@ out:
 	return ret;
 }
 
-void list_lru_destroy(struct list_lru *lru)
+static void list_lru_destroy_memcg(struct list_lru *lru)
 {
+	if (list_empty(&lru->lrus))
+		return;
+
 	mutex_lock(&all_memcg_lrus_mutex);
 	list_del(&lru->lrus);
 	mutex_unlock(&all_memcg_lrus_mutex);
@@ -388,20 +390,58 @@ void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
 	}
 	mutex_unlock(&all_memcg_lrus_mutex);
 }
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	INIT_LIST_HEAD(&lru->lrus);
+	if (memcg_enabled)
+		return memcg_init_lru(lru);
+
+	return 0;
+}
+#else
+static void list_lru_destroy_memcg(struct list_lru *lru)
+{
+}
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	return 0;
+}
 #endif
 
 int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 {
 	int i;
 
+	size_t size;
+
+	size = sizeof(*lru->node) * nr_node_ids;
+	lru->node = kzalloc(size, GFP_KERNEL);
+	if (!lru->node)
+		return -ENOMEM;
+
+	size = sizeof(*lru->node) * nr_node_ids;
+	lru->node_totals = kzalloc(size, GFP_KERNEL);
+	if (!lru->node_totals) {
+		kfree(lru->node);
+		return -ENOMEM;
+	}
+
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++) {
+	for (i = 0; i < nr_node_ids; i++) {
 		list_lru_init_one(&lru->node[i]);
 		atomic_long_set(&lru->node_totals[i], 0);
 	}
 
-	if (memcg_enabled)
-		return memcg_init_lru(lru);
-	return 0;
+	return memcg_list_lru_init(lru, memcg_enabled);
 }
 EXPORT_SYMBOL_GPL(__list_lru_init);
+
+void list_lru_destroy(struct list_lru *lru)
+{
+	kfree(lru->node);
+	kfree(lru->node_totals);
+	list_lru_destroy_memcg(lru);
+}
+EXPORT_SYMBOL_GPL(list_lru_destroy);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 11/35] list_lru: per-node list infrastructure
  2013-06-05 23:08     ` Andrew Morton
  2013-06-06  3:21       ` Dave Chinner
@ 2013-06-06 16:15       ` Glauber Costa
  2013-06-06 16:48         ` Andrew Morton
  1 sibling, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On 06/06/2013 03:08 AM, Andrew Morton wrote:
>> +	for_each_node_mask(nid, lru->active_nodes) {
>> > +		struct list_lru_node *nlru = &lru->node[nid];
>> > +
>> > +		spin_lock(&nlru->lock);
>> > +		BUG_ON(nlru->nr_items < 0);
> This is buggy.
> 
> The bit in lru->active_nodes could be cleared by now.  We can only make
> this assertion if we recheck lru->active_nodes[nid] inside the
> spinlocked region.
> 
Sorry Andrew, how so ?
We will clear that flag if nr_items == 0. nr_items should *never* get to
be less than 0, it doesn't matter if the node is cleared or not.

If the node is cleared, we would expected the following statement to
expand to
   count += nlru->nr_items = 0;
   spin_unlock(&nlru->lock);

Which is actually cheaper than testing for the bit being still set.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 11/35] list_lru: per-node list infrastructure
  2013-06-06 16:15       ` Glauber Costa
@ 2013-06-06 16:48         ` Andrew Morton
  0 siblings, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-06 16:48 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 20:15:07 +0400 Glauber Costa <glommer@parallels.com> wrote:

> On 06/06/2013 03:08 AM, Andrew Morton wrote:
> >> +	for_each_node_mask(nid, lru->active_nodes) {
> >> > +		struct list_lru_node *nlru = &lru->node[nid];
> >> > +
> >> > +		spin_lock(&nlru->lock);
> >> > +		BUG_ON(nlru->nr_items < 0);
> > This is buggy.
> > 
> > The bit in lru->active_nodes could be cleared by now.  We can only make
> > this assertion if we recheck lru->active_nodes[nid] inside the
> > spinlocked region.
> > 
> Sorry Andrew, how so ?
> We will clear that flag if nr_items == 0. nr_items should *never* get to
> be less than 0, it doesn't matter if the node is cleared or not.
> 
> If the node is cleared, we would expected the following statement to
> expand to
>    count += nlru->nr_items = 0;
>    spin_unlock(&nlru->lock);
> 
> Which is actually cheaper than testing for the bit being still set.

Well OK - I didn't actually look at the expression the BUG_ON() was
testing.  You got lucky ;)

My point was that nlru->lock protects ->active_nodes and so the above
code is racy due to a locking error.  I now see that was incorrect -
active_nodes has no locking.

Well, it kinda has accidental locking - nrlu->lock happens to protect
this nrlu's bit in active_nodes while permitting other nrlu's bits to
concurrently change.

The bottom line is that code which does

	if (node_isset(n, active_nodes))
		use(n);

can end up using a node which is no longer in the active_nodes, because
there is no locking.  This is a bit weird and worrisome and might lead
to bugs in the future, at least.  Perhaps we can improve the
maintainability by documenting this at the active_nodes site, dunno.

This code gets changed a lot in later patches and I didn't check to see
if the problem remains in the final product.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 12/35] shrinker: add node awareness
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (8 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 11/35] list_lru: per-node " Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 13/35] vmscan: per-node deferred work Glauber Costa
                     ` (20 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for
node aware shrinkers.

[ v3: update ashmem ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 drivers/staging/android/ashmem.c |  3 +++
 fs/drop_caches.c                 |  1 +
 include/linux/shrinker.h         |  3 +++
 mm/memory-failure.c              |  2 ++
 mm/vmscan.c                      | 11 ++++++++---
 5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 21a3f72..65f36d7 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -692,6 +692,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 				.gfp_mask = GFP_KERNEL,
 				.nr_to_scan = 0,
 			};
+
+			nodes_setall(sc.nodes_to_scan);
+
 			ret = ashmem_shrink(&ashmem_shrinker, &sc);
 			sc.nr_to_scan = ret;
 			ashmem_shrink(&ashmem_shrinker, &sc);
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f23d2a7..c3f44e7 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -44,6 +44,7 @@ static void drop_slab(void)
 		.gfp_mask = GFP_KERNEL,
 	};
 
+	nodes_setall(shrink.nodes_to_scan);
 	do {
 		nr_objects = shrink_slab(&shrink, 1000, 1000);
 	} while (nr_objects > 10);
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index c277b4e..98be3ab 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -16,6 +16,9 @@ struct shrink_control {
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
 	long nr_to_scan;
+
+	/* shrink from these nodes */
+	nodemask_t nodes_to_scan;
 };
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ceb0c7f..86788ff 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -248,10 +248,12 @@ void shake_page(struct page *p, int access)
 	 */
 	if (access) {
 		int nr;
+		int nid = page_to_nid(p);
 		do {
 			struct shrink_control shrink = {
 				.gfp_mask = GFP_KERNEL,
 			};
+			node_set(nid, shrink.nodes_to_scan);
 
 			nr = shrink_slab(&shrink, 1000, 1000);
 			if (page_count(p) == 1)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6ac3ec2..53e647f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2296,12 +2296,16 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
+
+			nodes_clear(shrink->nodes_to_scan);
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
 				lru_pages += zone_reclaimable_pages(zone);
+				node_set(zone_to_nid(zone),
+					 shrink->nodes_to_scan);
 			}
 
 			shrink_slab(shrink, sc->nr_scanned, lru_pages);
@@ -2769,6 +2773,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
 		return true;
 
 	shrink_zone(zone, sc);
+	nodes_clear(shrink.nodes_to_scan);
+	node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 
 	reclaim_state->reclaimed_slab = 0;
 	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
@@ -3477,10 +3483,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * number of slab pages and shake the slab until it is reduced
 		 * by the same nr_pages that we used for reclaiming unmapped
 		 * pages.
-		 *
-		 * Note that shrink_slab will free memory on all zones and may
-		 * take a long time.
 		 */
+		nodes_clear(shrink.nodes_to_scan);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 		for (;;) {
 			unsigned long lru_pages = zone_reclaimable_pages(zone);
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 12/35] shrinker: add node awareness
  2013-06-03 19:29   ` [PATCH v10 12/35] shrinker: add node awareness Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
  2013-06-06  3:26       ` Dave Chinner
       [not found]       ` <20130605160810.5b203c3368b9df7d087ee3b1-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Mon,  3 Jun 2013 23:29:41 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> Pass the node of the current zone being reclaimed to shrink_slab(),
> allowing the shrinker control nodemask to be set appropriately for
> node aware shrinkers.

Again, some musings on node hotplug would be interesting.

> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -692,6 +692,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  				.gfp_mask = GFP_KERNEL,
>  				.nr_to_scan = 0,
>  			};
> +
> +			nodes_setall(sc.nodes_to_scan);

hm, is there some way to do this within the initializer? ie:

				.nodes_to_scan = magic_goes_here(),

Also, it's a bit sad to set bits for not-present and not-online nodes.

>  			ret = ashmem_shrink(&ashmem_shrinker, &sc);
>  			sc.nr_to_scan = ret;
>  			ashmem_shrink(&ashmem_shrinker, &sc);
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 12/35] shrinker: add node awareness
  2013-06-05 23:08     ` Andrew Morton
@ 2013-06-06  3:26       ` Dave Chinner
  2013-06-06  3:54         ` Andrew Morton
       [not found]       ` <20130605160810.5b203c3368b9df7d087ee3b1-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  3:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 04:08:10PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:41 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Pass the node of the current zone being reclaimed to shrink_slab(),
> > allowing the shrinker control nodemask to be set appropriately for
> > node aware shrinkers.
> 
> Again, some musings on node hotplug would be interesting.
> 
> > --- a/drivers/staging/android/ashmem.c
> > +++ b/drivers/staging/android/ashmem.c
> > @@ -692,6 +692,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  				.gfp_mask = GFP_KERNEL,
> >  				.nr_to_scan = 0,
> >  			};
> > +
> > +			nodes_setall(sc.nodes_to_scan);
> 
> hm, is there some way to do this within the initializer? ie:
> 
> 				.nodes_to_scan = magic_goes_here(),

Nothing obvious - it's essentially a memset call, so I'm not sure
how that could be put in the initialiser...

> Also, it's a bit sad to set bits for not-present and not-online nodes.

Yup. Plenty of scope for future optimisation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 12/35] shrinker: add node awareness
  2013-06-06  3:26       ` Dave Chinner
@ 2013-06-06  3:54         ` Andrew Morton
  0 siblings, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  3:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 13:26:59 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 05, 2013 at 04:08:10PM -0700, Andrew Morton wrote:
> > On Mon,  3 Jun 2013 23:29:41 +0400 Glauber Costa <glommer@openvz.org> wrote:
> > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Pass the node of the current zone being reclaimed to shrink_slab(),
> > > allowing the shrinker control nodemask to be set appropriately for
> > > node aware shrinkers.
> > 
> > Again, some musings on node hotplug would be interesting.
> > 
> > > --- a/drivers/staging/android/ashmem.c
> > > +++ b/drivers/staging/android/ashmem.c
> > > @@ -692,6 +692,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > >  				.gfp_mask = GFP_KERNEL,
> > >  				.nr_to_scan = 0,
> > >  			};
> > > +
> > > +			nodes_setall(sc.nodes_to_scan);
> > 
> > hm, is there some way to do this within the initializer? ie:
> > 
> > 				.nodes_to_scan = magic_goes_here(),
> 
> Nothing obvious - it's essentially a memset call, so I'm not sure
> how that could be put in the initialiser...

I was thinking something like

		.nodes_to_scan = node_online_map,

which would solve both problems.  But node_online_map is nowhere near
the appropriate type, ho-hum.

We could newly accumulate such a thing in register_one_node(), but I
don't see a need.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160810.5b203c3368b9df7d087ee3b1-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 12/35] shrinker: add node awareness
       [not found]       ` <20130605160810.5b203c3368b9df7d087ee3b1-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  8:23         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On 06/06/2013 03:08 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:41 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> 
>> From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> Pass the node of the current zone being reclaimed to shrink_slab(),
>> allowing the shrinker control nodemask to be set appropriately for
>> node aware shrinkers.
> 
> Again, some musings on node hotplug would be interesting.
> 
>> --- a/drivers/staging/android/ashmem.c
>> +++ b/drivers/staging/android/ashmem.c
>> @@ -692,6 +692,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>>  				.gfp_mask = GFP_KERNEL,
>>  				.nr_to_scan = 0,
>>  			};
>> +
>> +			nodes_setall(sc.nodes_to_scan);
> 
> hm, is there some way to do this within the initializer? ie:
> 
> 				.nodes_to_scan = magic_goes_here(),
> 
> Also, it's a bit sad to set bits for not-present and not-online nodes.
> 

Unfortunately there is no "nodes_setpresent" or anything like that in
nodemask.h. Maybe I should just go ahead and write them.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 13/35] vmscan: per-node deferred work
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (9 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 12/35] shrinker: add node awareness Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 14/35] list_lru: per-node API Glauber Costa
                     ` (19 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner

We already keep per-node LRU lists for objects being shrunk, but the
work that is deferred from one run to another is kept global. This
creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.

In large machines, many nodes can accumulate at the same time, all
adding to the global counter.  As we accumulate more and more, we start
to ask for the caches to flush even bigger numbers. The result is that
the caches are depleted and do not stabilize. To achieve stable steady
state behavior, we need to tackle it differently.

In this patch we keep the deferred count per-node, and will never
accumulate that to other nodes.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 include/linux/shrinker.h |  30 +++++-
 mm/vmscan.c              | 245 ++++++++++++++++++++++++++++-------------------
 2 files changed, 175 insertions(+), 100 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 98be3ab..d70b123 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -19,6 +19,8 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+	/* current node being shrunk (for NUMA aware shrinkers) */
+	int nid;
 };
 
 /*
@@ -42,6 +44,8 @@ struct shrink_control {
  * objects freed during the scan, or -1 if progress cannot be made due to
  * potential deadlocks. If -1 is returned, then no further attempts to call the
  * @scan_objects will be made from the current reclaim context.
+ *
+ * @flags determine the shrinker abilities, like numa awareness 
  */
 struct shrinker {
 	int (*shrink)(struct shrinker *, struct shrink_control *sc);
@@ -50,12 +54,34 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
+	unsigned long flags;
 
 	/* These are for internal use */
 	struct list_head list;
-	atomic_long_t nr_in_batch; /* objs pending delete */
+	/*
+	 * We would like to avoid allocating memory when registering a new
+	 * shrinker. All shrinkers will need to keep track of deferred objects,
+	 * and we need a counter for this. If the shrinkers are not NUMA aware,
+	 * this is a small and bounded space that fits into an atomic_long_t.
+	 * This is because that the deferring decisions are global, and we will
+	 * not allocate in this case.
+	 *
+	 * When the shrinker is NUMA aware, we will need this to be a per-node
+	 * array. Numerically speaking, the minority of shrinkers are NUMA
+	 * aware, so this saves quite a bit.
+	 */
+	union {
+		/* objs pending delete */
+		atomic_long_t nr_deferred;
+		/* objs pending delete, per node */
+		atomic_long_t *nr_deferred_node;
+	};
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
-extern void register_shrinker(struct shrinker *);
+
+/* Flags */
+#define SHRINKER_NUMA_AWARE (1 << 0)
+
+extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53e647f..08eec9d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -155,14 +155,36 @@ static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 }
 
 /*
- * Add a shrinker callback to be called from the vm
+ * Add a shrinker callback to be called from the vm.
+ *
+ * It cannot fail, unless the flag SHRINKER_NUMA_AWARE is specified.
+ * With this flag set, this function will allocate memory and may fail.
  */
-void register_shrinker(struct shrinker *shrinker)
+int register_shrinker(struct shrinker *shrinker)
 {
-	atomic_long_set(&shrinker->nr_in_batch, 0);
+	/*
+	 * If we only have one possible node in the system anyway, save
+	 * ourselves the trouble and disable NUMA aware behavior. This way we
+	 * will allocate nothing and save memory and some small loop time
+	 * later.
+	 */
+	if (nr_node_ids == 1)
+		shrinker->flags &= ~SHRINKER_NUMA_AWARE;
+
+	if (shrinker->flags & SHRINKER_NUMA_AWARE) {
+		size_t size;
+
+		size = sizeof(*shrinker->nr_deferred_node) * nr_node_ids;
+		shrinker->nr_deferred_node = kzalloc(size, GFP_KERNEL);
+		if (!shrinker->nr_deferred_node)
+			return -ENOMEM;
+	} else
+		atomic_long_set(&shrinker->nr_deferred, 0);
+
 	down_write(&shrinker_rwsem);
 	list_add_tail(&shrinker->list, &shrinker_list);
 	up_write(&shrinker_rwsem);
+	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
 
@@ -186,6 +208,116 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
 }
 
 #define SHRINK_BATCH 128
+
+static unsigned long
+shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+		 unsigned long nr_pages_scanned, unsigned long lru_pages,
+		 atomic_long_t *deferred)
+{
+	unsigned long freed = 0;
+	unsigned long long delta;
+	long total_scan;
+	long max_pass;
+	long nr;
+	long new_nr;
+	long batch_size = shrinker->batch ? shrinker->batch
+					  : SHRINK_BATCH;
+
+	if (shrinker->scan_objects) {
+		max_pass = shrinker->count_objects(shrinker, shrinkctl);
+		WARN_ON(max_pass < 0);
+	} else
+		max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
+	if (max_pass <= 0)
+		return 0;
+
+	/*
+	 * copy the current shrinker scan count into a local variable
+	 * and zero it so that other concurrent shrinker invocations
+	 * don't also do this scanning work.
+	 */
+	nr = atomic_long_xchg(deferred, 0);
+
+	total_scan = nr;
+	delta = (4 * nr_pages_scanned) / shrinker->seeks;
+	delta *= max_pass;
+	do_div(delta, lru_pages + 1);
+	total_scan += delta;
+	if (total_scan < 0) {
+		printk(KERN_ERR
+		"shrink_slab: %pF negative objects to delete nr=%ld\n",
+		       shrinker->shrink, total_scan);
+		total_scan = max_pass;
+	}
+
+	/*
+	 * We need to avoid excessive windup on filesystem shrinkers
+	 * due to large numbers of GFP_NOFS allocations causing the
+	 * shrinkers to return -1 all the time. This results in a large
+	 * nr being built up so when a shrink that can do some work
+	 * comes along it empties the entire cache due to nr >>>
+	 * max_pass.  This is bad for sustaining a working set in
+	 * memory.
+	 *
+	 * Hence only allow the shrinker to scan the entire cache when
+	 * a large delta change is calculated directly.
+	 */
+	if (delta < max_pass / 4)
+		total_scan = min(total_scan, max_pass / 2);
+
+	/*
+	 * Avoid risking looping forever due to too large nr value:
+	 * never try to free more than twice the estimate number of
+	 * freeable entries.
+	 */
+	if (total_scan > max_pass * 2)
+		total_scan = max_pass * 2;
+
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+				nr_pages_scanned, lru_pages,
+				max_pass, delta, total_scan);
+
+	while (total_scan >= batch_size) {
+		long ret;
+
+		if (shrinker->scan_objects) {
+			shrinkctl->nr_to_scan = batch_size;
+			ret = shrinker->scan_objects(shrinker, shrinkctl);
+
+			if (ret == -1)
+				break;
+			freed += ret;
+		} else {
+			int nr_before;
+			nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
+			ret = do_shrinker_shrink(shrinker, shrinkctl,
+							batch_size);
+			if (ret == -1)
+				break;
+			if (ret < nr_before)
+				freed += nr_before - ret;
+		}
+
+		count_vm_events(SLABS_SCANNED, batch_size);
+		total_scan -= batch_size;
+
+		cond_resched();
+	}
+
+	/*
+	 * move the unused scan count back into the shrinker in a
+	 * manner that handles concurrent updates. If we exhausted the
+	 * scan, there is no need to do an update.
+	 */
+	if (total_scan > 0)
+		new_nr = atomic_long_add_return(total_scan, deferred);
+	else
+		new_nr = atomic_long_read(deferred);
+
+	trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -222,107 +354,24 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		unsigned long long delta;
-		long total_scan;
-		long max_pass;
-		long nr;
-		long new_nr;
-		long batch_size = shrinker->batch ? shrinker->batch
-						  : SHRINK_BATCH;
 
-		if (shrinker->scan_objects) {
-			max_pass = shrinker->count_objects(shrinker, shrinkctl);
-			WARN_ON(max_pass < 0);
-		} else
-			max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
-		if (max_pass <= 0)
-			continue;
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
 
-		/*
-		 * copy the current shrinker scan count into a local variable
-		 * and zero it so that other concurrent shrinker invocations
-		 * don't also do this scanning work.
-		 */
-		nr = atomic_long_xchg(&shrinker->nr_in_batch, 0);
-
-		total_scan = nr;
-		delta = (4 * nr_pages_scanned) / shrinker->seeks;
-		delta *= max_pass;
-		do_div(delta, lru_pages + 1);
-		total_scan += delta;
-		if (total_scan < 0) {
-			printk(KERN_ERR
-			"shrink_slab: %pF negative objects to delete nr=%ld\n",
-			       shrinker->shrink, total_scan);
-			total_scan = max_pass;
+			freed += shrink_slab_node(shrinkctl, shrinker,
+				 nr_pages_scanned, lru_pages,
+				 &shrinker->nr_deferred);
+			continue;
 		}
 
-		/*
-		 * We need to avoid excessive windup on filesystem shrinkers
-		 * due to large numbers of GFP_NOFS allocations causing the
-		 * shrinkers to return -1 all the time. This results in a large
-		 * nr being built up so when a shrink that can do some work
-		 * comes along it empties the entire cache due to nr >>>
-		 * max_pass.  This is bad for sustaining a working set in
-		 * memory.
-		 *
-		 * Hence only allow the shrinker to scan the entire cache when
-		 * a large delta change is calculated directly.
-		 */
-		if (delta < max_pass / 4)
-			total_scan = min(total_scan, max_pass / 2);
-
-		/*
-		 * Avoid risking looping forever due to too large nr value:
-		 * never try to free more than twice the estimate number of
-		 * freeable entries.
-		 */
-		if (total_scan > max_pass * 2)
-			total_scan = max_pass * 2;
-
-		trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
-					nr_pages_scanned, lru_pages,
-					max_pass, delta, total_scan);
-
-		while (total_scan >= batch_size) {
-			long ret;
-
-			if (shrinker->scan_objects) {
-				shrinkctl->nr_to_scan = batch_size;
-				ret = shrinker->scan_objects(shrinker, shrinkctl);
-
-				if (ret == -1)
-					break;
-				freed += ret;
-			} else {
-				int nr_before;
-				nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
-				ret = do_shrinker_shrink(shrinker, shrinkctl,
-								batch_size);
-				if (ret == -1)
-					break;
-				if (ret < nr_before)
-					freed += nr_before - ret;
-			}
-
-			count_vm_events(SLABS_SCANNED, batch_size);
-			total_scan -= batch_size;
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (!node_online(shrinkctl->nid))
+				continue;
 
-			cond_resched();
+			freed += shrink_slab_node(shrinkctl, shrinker,
+				 nr_pages_scanned, lru_pages,
+				 &shrinker->nr_deferred_node[shrinkctl->nid]);
 		}
-
-		/*
-		 * move the unused scan count back into the shrinker in a
-		 * manner that handles concurrent updates. If we exhausted the
-		 * scan, there is no need to do an update.
-		 */
-		if (total_scan > 0)
-			new_nr = atomic_long_add_return(total_scan,
-					&shrinker->nr_in_batch);
-		else
-			new_nr = atomic_long_read(&shrinker->nr_in_batch);
-
-		trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 13/35] vmscan: per-node deferred work
  2013-06-03 19:29   ` [PATCH v10 13/35] vmscan: per-node deferred work Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
  2013-06-06  3:37       ` Dave Chinner
       [not found]       ` <20130605160815.fb69f7d4d1736455727fc669-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Mon,  3 Jun 2013 23:29:42 +0400 Glauber Costa <glommer@openvz.org> wrote:

> We already keep per-node LRU lists for objects being shrunk, but the
> work that is deferred from one run to another is kept global. This
> creates an impedance problem, where upon node pressure, work deferred
> will accumulate and end up being flushed in other nodes.

This changelog would be more useful if it had more specificity.  Where
do we keep these per-node LRU lists (names of variables?).  Where do we
keep the global data?  In what function does this other-node flushing
happen?

Generally so that readers can go and look at the data structures and
functions which you're talking about.

> In large machines, many nodes can accumulate at the same time, all
> adding to the global counter.

What global counter?

>  As we accumulate more and more, we start
> to ask for the caches to flush even bigger numbers.

Where does this happen?

> The result is that
> the caches are depleted and do not stabilize. To achieve stable steady
> state behavior, we need to tackle it differently.
> 
> In this patch we keep the deferred count per-node, and will never
> accumulate that to other nodes.
> 
> ...
>
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -19,6 +19,8 @@ struct shrink_control {
>  
>  	/* shrink from these nodes */
>  	nodemask_t nodes_to_scan;
> +	/* current node being shrunk (for NUMA aware shrinkers) */
> +	int nid;
>  };
>  
>  /*
> @@ -42,6 +44,8 @@ struct shrink_control {
>   * objects freed during the scan, or -1 if progress cannot be made due to
>   * potential deadlocks. If -1 is returned, then no further attempts to call the
>   * @scan_objects will be made from the current reclaim context.
> + *
> + * @flags determine the shrinker abilities, like numa awareness 
>   */
>  struct shrinker {
>  	int (*shrink)(struct shrinker *, struct shrink_control *sc);
> @@ -50,12 +54,34 @@ struct shrinker {
>  
>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
> +	unsigned long flags;
>  
>  	/* These are for internal use */
>  	struct list_head list;
> -	atomic_long_t nr_in_batch; /* objs pending delete */
> +	/*
> +	 * We would like to avoid allocating memory when registering a new
> +	 * shrinker.

That's quite surprising.  What are the reasons for this?

>		 All shrinkers will need to keep track of deferred objects,

What is a deferred object and why does this deferral happen?

> +	 * and we need a counter for this. If the shrinkers are not NUMA aware,
> +	 * this is a small and bounded space that fits into an atomic_long_t.
> +	 * This is because that the deferring decisions are global, and we will

s/that//

> +	 * not allocate in this case.
> +	 *
> +	 * When the shrinker is NUMA aware, we will need this to be a per-node
> +	 * array. Numerically speaking, the minority of shrinkers are NUMA
> +	 * aware, so this saves quite a bit.
> +	 */

I don't really understand what's going on here :(

> +	union {
> +		/* objs pending delete */
> +		atomic_long_t nr_deferred;
> +		/* objs pending delete, per node */
> +		atomic_long_t *nr_deferred_node;
> +	};
>  };
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> -extern void register_shrinker(struct shrinker *);
> +
> +/* Flags */
> +#define SHRINKER_NUMA_AWARE (1 << 0)
> +
> +extern int register_shrinker(struct shrinker *);
>  extern void unregister_shrinker(struct shrinker *);
>  #endif
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 53e647f..08eec9d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -155,14 +155,36 @@ static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
>  }
>  
>  /*
> - * Add a shrinker callback to be called from the vm
> + * Add a shrinker callback to be called from the vm.
> + *
> + * It cannot fail, unless the flag SHRINKER_NUMA_AWARE is specified.
> + * With this flag set, this function will allocate memory and may fail.
>   */

Again, I don't see what the big deal is with memory allocation. 
register_shrinker() is pretty rare, is likely to happen when the system
is under little stress and GFP_KERNEL is quite strong.  Why all the
concern?

> -void register_shrinker(struct shrinker *shrinker)
> +int register_shrinker(struct shrinker *shrinker)
>  {
> -	atomic_long_set(&shrinker->nr_in_batch, 0);
> +	/*
> +	 * If we only have one possible node in the system anyway, save
> +	 * ourselves the trouble and disable NUMA aware behavior. This way we
> +	 * will allocate nothing and save memory and some small loop time
> +	 * later.
> +	 */
> +	if (nr_node_ids == 1)
> +		shrinker->flags &= ~SHRINKER_NUMA_AWARE;
> +
> +	if (shrinker->flags & SHRINKER_NUMA_AWARE) {
> +		size_t size;
> +
> +		size = sizeof(*shrinker->nr_deferred_node) * nr_node_ids;
> +		shrinker->nr_deferred_node = kzalloc(size, GFP_KERNEL);
> +		if (!shrinker->nr_deferred_node)
> +			return -ENOMEM;
> +	} else
> +		atomic_long_set(&shrinker->nr_deferred, 0);
> +
>  	down_write(&shrinker_rwsem);
>  	list_add_tail(&shrinker->list, &shrinker_list);
>  	up_write(&shrinker_rwsem);
> +	return 0;
>  }
>  EXPORT_SYMBOL(register_shrinker);

What would be the cost if we were to do away with SHRINKER_NUMA_AWARE
and treat all shrinkers the same way?  The need to allocate extra
memory per shrinker?  That sounds pretty cheap?

> @@ -186,6 +208,116 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
>  }
>  
>  #define SHRINK_BATCH 128
> +
> +static unsigned long
> +shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> +		 unsigned long nr_pages_scanned, unsigned long lru_pages,
> +		 atomic_long_t *deferred)
> +{
> +	unsigned long freed = 0;
> +	unsigned long long delta;
> +	long total_scan;
> +	long max_pass;
> +	long nr;
> +	long new_nr;
> +	long batch_size = shrinker->batch ? shrinker->batch
> +					  : SHRINK_BATCH;
> +
> +	if (shrinker->scan_objects) {
> +		max_pass = shrinker->count_objects(shrinker, shrinkctl);
> +		WARN_ON(max_pass < 0);
> +	} else
> +		max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
> +	if (max_pass <= 0)
> +		return 0;
> +
> +	/*
> +	 * copy the current shrinker scan count into a local variable
> +	 * and zero it so that other concurrent shrinker invocations
> +	 * don't also do this scanning work.
> +	 */
> +	nr = atomic_long_xchg(deferred, 0);

This comment seems wrong.  It implies that "deferred" refers to "the
current shrinker scan count".  But how are these two the same thing?  A
"scan count" would refer to the number of objects to be scanned (or
which were scanned - it's unclear).  Whereas "deferred" would refer to
the number of those to-be-scanned objects which we didn't process and
is hence less than or equal to the "scan count".

It's all very foggy :(  This whole concept of deferral needs more
explanation, please.

> +	total_scan = nr;
> +	delta = (4 * nr_pages_scanned) / shrinker->seeks;
> +	delta *= max_pass;
> +	do_div(delta, lru_pages + 1);
> +	total_scan += delta;
> +	if (total_scan < 0) {
> +		printk(KERN_ERR
> +		"shrink_slab: %pF negative objects to delete nr=%ld\n",
> +		       shrinker->shrink, total_scan);
> +		total_scan = max_pass;
> +	}
> +
> +	/*
> +	 * We need to avoid excessive windup on filesystem shrinkers
> +	 * due to large numbers of GFP_NOFS allocations causing the
> +	 * shrinkers to return -1 all the time. This results in a large
> +	 * nr being built up so when a shrink that can do some work
> +	 * comes along it empties the entire cache due to nr >>>
> +	 * max_pass.  This is bad for sustaining a working set in
> +	 * memory.
> +	 *
> +	 * Hence only allow the shrinker to scan the entire cache when
> +	 * a large delta change is calculated directly.
> +	 */

That was an important comment.  So the whole problem we're tackling
here is fs shrinkers baling out in GFP_NOFS allocations?


> +	if (delta < max_pass / 4)
> +		total_scan = min(total_scan, max_pass / 2);
> +
> +	/*
> +	 * Avoid risking looping forever due to too large nr value:
> +	 * never try to free more than twice the estimate number of

"estimated"

> +	 * freeable entries.
> +	 */
> +	if (total_scan > max_pass * 2)
> +		total_scan = max_pass * 2;
> +
> +	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> +				nr_pages_scanned, lru_pages,
> +				max_pass, delta, total_scan);
> +
> +	while (total_scan >= batch_size) {
> +		long ret;
> +
> +		if (shrinker->scan_objects) {
> +			shrinkctl->nr_to_scan = batch_size;
> +			ret = shrinker->scan_objects(shrinker, shrinkctl);
> +
> +			if (ret == -1)
> +				break;
> +			freed += ret;
> +		} else {
> +			int nr_before;
> +			nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
> +			ret = do_shrinker_shrink(shrinker, shrinkctl,
> +							batch_size);
> +			if (ret == -1)
> +				break;
> +			if (ret < nr_before)
> +				freed += nr_before - ret;
> +		}
> +
> +		count_vm_events(SLABS_SCANNED, batch_size);
> +		total_scan -= batch_size;
> +
> +		cond_resched();
> +	}
> +
> +	/*
> +	 * move the unused scan count back into the shrinker in a
> +	 * manner that handles concurrent updates. If we exhausted the
> +	 * scan, there is no need to do an update.
> +	 */
> +	if (total_scan > 0)
> +		new_nr = atomic_long_add_return(total_scan, deferred);
> +	else
> +		new_nr = atomic_long_read(deferred);
> +
> +	trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
> +	return freed;
> +}
> 
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 13/35] vmscan: per-node deferred work
  2013-06-05 23:08     ` Andrew Morton
@ 2013-06-06  3:37       ` Dave Chinner
  2013-06-06  4:59         ` Dave Chinner
       [not found]       ` <20130605160815.fb69f7d4d1736455727fc669-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  3:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Wed, Jun 05, 2013 at 04:08:15PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:42 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > We already keep per-node LRU lists for objects being shrunk, but the
> > work that is deferred from one run to another is kept global. This
> > creates an impedance problem, where upon node pressure, work deferred
> > will accumulate and end up being flushed in other nodes.
> 
> This changelog would be more useful if it had more specificity.  Where
> do we keep these per-node LRU lists (names of variables?).

In the per-node LRU lists the shrinker walks ;)

> Where do we
> keep the global data? 

In the struct shrinker

> In what function does this other-node flushing
> happen?

Any shrinker that is run on a different node.

> Generally so that readers can go and look at the data structures and
> functions which you're talking about.
> 
> > In large machines, many nodes can accumulate at the same time, all
> > adding to the global counter.
> 
> What global counter?

shrinker->nr

> >  As we accumulate more and more, we start
> > to ask for the caches to flush even bigger numbers.
> 
> Where does this happen?

The shrinker scan loop ;)

> > @@ -186,6 +208,116 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
> >  }
> >  
> >  #define SHRINK_BATCH 128
> > +
> > +static unsigned long
> > +shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> > +		 unsigned long nr_pages_scanned, unsigned long lru_pages,
> > +		 atomic_long_t *deferred)
> > +{
> > +	unsigned long freed = 0;
> > +	unsigned long long delta;
> > +	long total_scan;
> > +	long max_pass;
> > +	long nr;
> > +	long new_nr;
> > +	long batch_size = shrinker->batch ? shrinker->batch
> > +					  : SHRINK_BATCH;
> > +
> > +	if (shrinker->scan_objects) {
> > +		max_pass = shrinker->count_objects(shrinker, shrinkctl);
> > +		WARN_ON(max_pass < 0);
> > +	} else
> > +		max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
> > +	if (max_pass <= 0)
> > +		return 0;
> > +
> > +	/*
> > +	 * copy the current shrinker scan count into a local variable
> > +	 * and zero it so that other concurrent shrinker invocations
> > +	 * don't also do this scanning work.
> > +	 */
> > +	nr = atomic_long_xchg(deferred, 0);
> 
> This comment seems wrong.  It implies that "deferred" refers to "the
> current shrinker scan count".  But how are these two the same thing?  A
> "scan count" would refer to the number of objects to be scanned (or
> which were scanned - it's unclear).  Whereas "deferred" would refer to
> the number of those to-be-scanned objects which we didn't process and
> is hence less than or equal to the "scan count".
> 
> It's all very foggy :(  This whole concept of deferral needs more
> explanation, please.

You wrote the shrinker deferal code way back in 2.5.42 (IIRC), so
maybe you can explain it to us? :)

> 
> > +	total_scan = nr;
> > +	delta = (4 * nr_pages_scanned) / shrinker->seeks;
> > +	delta *= max_pass;
> > +	do_div(delta, lru_pages + 1);
> > +	total_scan += delta;
> > +	if (total_scan < 0) {
> > +		printk(KERN_ERR
> > +		"shrink_slab: %pF negative objects to delete nr=%ld\n",
> > +		       shrinker->shrink, total_scan);
> > +		total_scan = max_pass;
> > +	}
> > +
> > +	/*
> > +	 * We need to avoid excessive windup on filesystem shrinkers
> > +	 * due to large numbers of GFP_NOFS allocations causing the
> > +	 * shrinkers to return -1 all the time. This results in a large
> > +	 * nr being built up so when a shrink that can do some work
> > +	 * comes along it empties the entire cache due to nr >>>
> > +	 * max_pass.  This is bad for sustaining a working set in
> > +	 * memory.
> > +	 *
> > +	 * Hence only allow the shrinker to scan the entire cache when
> > +	 * a large delta change is calculated directly.
> > +	 */
> 
> That was an important comment.  So the whole problem we're tackling
> here is fs shrinkers baling out in GFP_NOFS allocations?

commit 3567b59aa80ac4417002bf58e35dce5c777d4164
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jul 8 14:14:36 2011 +1000

    vmscan: reduce wind up shrinker->nr when shrinker can't do work
    
    When a shrinker returns -1 to shrink_slab() to indicate it cannot do
    any work given the current memory reclaim requirements, it adds the
    entire total_scan count to shrinker->nr. The idea ehind this is that
    whenteh shrinker is next called and can do work, it will do the work
    of the previously aborted shrinker call as well.
    
    However, if a filesystem is doing lots of allocation with GFP_NOFS
    set, then we get many, many more aborts from the shrinkers than we
    do successful calls. The result is that shrinker->nr winds up to
    it's maximum permissible value (twice the current cache size) and
    then when the next shrinker call that can do work is issued, it
    has enough scan count built up to free the entire cache twice over.
    
    This manifests itself in the cache going from full to empty in a
    matter of seconds, even when only a small part of the cache is
    needed to be emptied to free sufficient memory.
    
    Under metadata intensive workloads on ext4 and XFS, I'm seeing the
    VFS caches increase memory consumption up to 75% of memory (no page
    cache pressure) over a period of 30-60s, and then the shrinker
    empties them down to zero in the space of 2-3s. This cycle repeats
    over and over again, with the shrinker completely trashing the inode
    and dentry caches every minute or so the workload continues.
    
    This behaviour was made obvious by the shrink_slab tracepoints added
    earlier in the series, and made worse by the patch that corrected
    the concurrent accounting of shrinker->nr.
    
    To avoid this problem, stop repeated small increments of the total
    scan value from winding shrinker->nr up to a value that can cause
    the entire cache to be freed. We still need to allow it to wind up,
    so use the delta as the "large scan" threshold check - if the delta
    is more than a quarter of the entire cache size, then it is a large
    scan and allowed to cause lots of windup because we are clearly
    needing to free lots of memory.
    
    If it isn't a large scan then limit the total scan to half the size
    of the cache so that windup never increases to consume the whole
    cache. Reducing the total scan limit further does not allow enough
    wind-up to maintain the current levels of performance, whilst a
    higher threshold does not prevent the windup from freeing the entire
    cache under sustained workloads.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>



-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 13/35] vmscan: per-node deferred work
  2013-06-06  3:37       ` Dave Chinner
@ 2013-06-06  4:59         ` Dave Chinner
  2013-06-06  7:12           ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  4:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, Jun 06, 2013 at 01:37:42PM +1000, Dave Chinner wrote:
> On Wed, Jun 05, 2013 at 04:08:15PM -0700, Andrew Morton wrote:
> > On Mon,  3 Jun 2013 23:29:42 +0400 Glauber Costa <glommer@openvz.org> wrote:
> > 
> > > We already keep per-node LRU lists for objects being shrunk, but the
> > > work that is deferred from one run to another is kept global. This
> > > creates an impedance problem, where upon node pressure, work deferred
> > > will accumulate and end up being flushed in other nodes.
> > 
> > This changelog would be more useful if it had more specificity.  Where
> > do we keep these per-node LRU lists (names of variables?).
> 
> In the per-node LRU lists the shrinker walks ;)
> 
> > Where do we
> > keep the global data? 
> 
> In the struct shrinker
> 
> > In what function does this other-node flushing
> > happen?
> 
> Any shrinker that is run on a different node.
> 
> > Generally so that readers can go and look at the data structures and
> > functions which you're talking about.
> > 
> > > In large machines, many nodes can accumulate at the same time, all
> > > adding to the global counter.
> > 
> > What global counter?
> 
> shrinker->nr
> 
> > >  As we accumulate more and more, we start
> > > to ask for the caches to flush even bigger numbers.
> > 
> > Where does this happen?
> 
> The shrinker scan loop ;)

Answers which doesn't really tell you more than you already knew :/

To give you more background, Andrew, here's a pointer to the
discussion where we analysed the problem that lead to this patch:

http://marc.info/?l=linux-fsdevel&m=136852512724091&w=4

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 13/35] vmscan: per-node deferred work
  2013-06-06  4:59         ` Dave Chinner
@ 2013-06-06  7:12           ` Andrew Morton
  0 siblings, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  7:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Thu, 6 Jun 2013 14:59:07 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Jun 06, 2013 at 01:37:42PM +1000, Dave Chinner wrote:
> > On Wed, Jun 05, 2013 at 04:08:15PM -0700, Andrew Morton wrote:
> > > On Mon,  3 Jun 2013 23:29:42 +0400 Glauber Costa <glommer@openvz.org> wrote:
> > > 
> > > > We already keep per-node LRU lists for objects being shrunk, but the
> > > > work that is deferred from one run to another is kept global. This
> > > > creates an impedance problem, where upon node pressure, work deferred
> > > > will accumulate and end up being flushed in other nodes.
> > > 
> > > This changelog would be more useful if it had more specificity.  Where
> > > do we keep these per-node LRU lists (names of variables?).
> > 
> > In the per-node LRU lists the shrinker walks ;)
> > 
> > > Where do we
> > > keep the global data? 
> > 
> > In the struct shrinker
> > 
> > > In what function does this other-node flushing
> > > happen?
> > 
> > Any shrinker that is run on a different node.
> > 
> > > Generally so that readers can go and look at the data structures and
> > > functions which you're talking about.
> > > 
> > > > In large machines, many nodes can accumulate at the same time, all
> > > > adding to the global counter.
> > > 
> > > What global counter?
> > 
> > shrinker->nr
> > 
> > > >  As we accumulate more and more, we start
> > > > to ask for the caches to flush even bigger numbers.
> > > 
> > > Where does this happen?
> > 
> > The shrinker scan loop ;)
> 
> Answers which doesn't really tell you more than you already knew :/
> 
> To give you more background, Andrew, here's a pointer to the
> discussion where we analysed the problem that lead to this patch:
> 
> http://marc.info/?l=linux-fsdevel&m=136852512724091&w=4

Thanks, I'll read that later.  But that only helps me!  And I'll forget
it all in six hours.

Please understand where I'm coming from here: I review code from the
point of view (amongst others) "how understandable and maintainable is
this".  And I hope that reviewees understand that "if this reader asked
that question then others will wonder the same thing, so I need to fix
that up".

And I do think that about 2% of readers look in Documentation/, 1% of
readers go back to look at changelogs and 0% of readers go back and
look at the mailing list discussion.  It's most effective if it's right
there in the .c file.

Obviously there are tradeoffs here, but code which overdoes the
explain-thyself thing is rare to non-existent.

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160815.fb69f7d4d1736455727fc669-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 13/35] vmscan: per-node deferred work
       [not found]       ` <20130605160815.fb69f7d4d1736455727fc669-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  9:00         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  9:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner

On 06/06/2013 03:08 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:42 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> 
>> We already keep per-node LRU lists for objects being shrunk, but the
>> work that is deferred from one run to another is kept global. This
>> creates an impedance problem, where upon node pressure, work deferred
>> will accumulate and end up being flushed in other nodes.
> 
> This changelog would be more useful if it had more specificity.  Where
> do we keep these per-node LRU lists (names of variables?).  Where do we
> keep the global data?  In what function does this other-node flushing
> happen?
> 
> Generally so that readers can go and look at the data structures and
> functions which you're talking about.
> 
>> In large machines, many nodes can accumulate at the same time, all
>> adding to the global counter.
> 
> What global counter?
> 
>>  As we accumulate more and more, we start
>> to ask for the caches to flush even bigger numbers.
> 
> Where does this happen?
> 
>> The result is that
>> the caches are depleted and do not stabilize. To achieve stable steady
>> state behavior, we need to tackle it differently.
>>
>> In this patch we keep the deferred count per-node, and will never
>> accumulate that to other nodes.
>>
>> ...
>>
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -19,6 +19,8 @@ struct shrink_control {
>>  
>>  	/* shrink from these nodes */
>>  	nodemask_t nodes_to_scan;
>> +	/* current node being shrunk (for NUMA aware shrinkers) */
>> +	int nid;
>>  };
>>  
>>  /*
>> @@ -42,6 +44,8 @@ struct shrink_control {
>>   * objects freed during the scan, or -1 if progress cannot be made due to
>>   * potential deadlocks. If -1 is returned, then no further attempts to call the
>>   * @scan_objects will be made from the current reclaim context.
>> + *
>> + * @flags determine the shrinker abilities, like numa awareness 
>>   */
>>  struct shrinker {
>>  	int (*shrink)(struct shrinker *, struct shrink_control *sc);
>> @@ -50,12 +54,34 @@ struct shrinker {
>>  
>>  	int seeks;	/* seeks to recreate an obj */
>>  	long batch;	/* reclaim batch size, 0 = default */
>> +	unsigned long flags;
>>  
>>  	/* These are for internal use */
>>  	struct list_head list;
>> -	atomic_long_t nr_in_batch; /* objs pending delete */
>> +	/*
>> +	 * We would like to avoid allocating memory when registering a new
>> +	 * shrinker.
> 
> That's quite surprising.  What are the reasons for this?
> 
>> 		 All shrinkers will need to keep track of deferred objects,
> 
> What is a deferred object and why does this deferral happen?
> 
>> +	 * and we need a counter for this. If the shrinkers are not NUMA aware,
>> +	 * this is a small and bounded space that fits into an atomic_long_t.
>> +	 * This is because that the deferring decisions are global, and we will
> 
> s/that//
> 
>> +	 * not allocate in this case.
>> +	 *
>> +	 * When the shrinker is NUMA aware, we will need this to be a per-node
>> +	 * array. Numerically speaking, the minority of shrinkers are NUMA
>> +	 * aware, so this saves quite a bit.
>> +	 */
> 
> I don't really understand what's going on here :(
> 

Ok. We need an array allocation for NUMA aware shrinkers, but we don't
need any for non NUMA-aware shrinkers. There is nothing wrong with the
memory allocation "per-se" , in terms of contexts, etc.

But in a NUMA *machine*, we would be allocating a lot of wasted memory
for creating arrays in shrinkers that are not NUMA capable at all.

Turns out, they seem to be the majority (at least so far).

Aside from the memory allocated, we still have all the useless loops and
cacheline dirtying. So I figured it would be useful to not make them all
NUMA aware if we can avoid it.


>> +	union {
>> +		/* objs pending delete */
>> +		atomic_long_t nr_deferred;
>> +		/* objs pending delete, per node */
>> +		atomic_long_t *nr_deferred_node;
>> +	};
>>  };
>>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>> -extern void register_shrinker(struct shrinker *);
>> +
>> +/* Flags */
>> +#define SHRINKER_NUMA_AWARE (1 << 0)
>> +
>> +extern int register_shrinker(struct shrinker *);
>>  extern void unregister_shrinker(struct shrinker *);
>>  #endif
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 53e647f..08eec9d 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -155,14 +155,36 @@ static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
>>  }
>>  
>>  /*
>> - * Add a shrinker callback to be called from the vm
>> + * Add a shrinker callback to be called from the vm.
>> + *
>> + * It cannot fail, unless the flag SHRINKER_NUMA_AWARE is specified.
>> + * With this flag set, this function will allocate memory and may fail.
>>   */
> 
> Again, I don't see what the big deal is with memory allocation. 
> register_shrinker() is pretty rare, is likely to happen when the system
> is under little stress and GFP_KERNEL is quite strong.  Why all the
> concern?
> 
>> -void register_shrinker(struct shrinker *shrinker)
>> +int register_shrinker(struct shrinker *shrinker)
>>  {
>> -	atomic_long_set(&shrinker->nr_in_batch, 0);
>> +	/*
>> +	 * If we only have one possible node in the system anyway, save
>> +	 * ourselves the trouble and disable NUMA aware behavior. This way we
>> +	 * will allocate nothing and save memory and some small loop time
>> +	 * later.
>> +	 */
>> +	if (nr_node_ids == 1)
>> +		shrinker->flags &= ~SHRINKER_NUMA_AWARE;
>> +
>> +	if (shrinker->flags & SHRINKER_NUMA_AWARE) {
>> +		size_t size;
>> +
>> +		size = sizeof(*shrinker->nr_deferred_node) * nr_node_ids;
>> +		shrinker->nr_deferred_node = kzalloc(size, GFP_KERNEL);
>> +		if (!shrinker->nr_deferred_node)
>> +			return -ENOMEM;
>> +	} else
>> +		atomic_long_set(&shrinker->nr_deferred, 0);
>> +
>>  	down_write(&shrinker_rwsem);
>>  	list_add_tail(&shrinker->list, &shrinker_list);
>>  	up_write(&shrinker_rwsem);
>> +	return 0;
>>  }
>>  EXPORT_SYMBOL(register_shrinker);
> 
> What would be the cost if we were to do away with SHRINKER_NUMA_AWARE
> and treat all shrinkers the same way?  The need to allocate extra
> memory per shrinker?  That sounds pretty cheap?
> 

Well, maybe I am just a little bit more frenetic about savings than you
are. There are quite a bunch of shrinkers.


>> @@ -186,6 +208,116 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
>>  }
>>  
>>  #define SHRINK_BATCH 128
>> +
>> +static unsigned long
>> +shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>> +		 unsigned long nr_pages_scanned, unsigned long lru_pages,
>> +		 atomic_long_t *deferred)
>> +{
>> +	unsigned long freed = 0;
>> +	unsigned long long delta;
>> +	long total_scan;
>> +	long max_pass;
>> +	long nr;
>> +	long new_nr;
>> +	long batch_size = shrinker->batch ? shrinker->batch
>> +					  : SHRINK_BATCH;
>> +
>> +	if (shrinker->scan_objects) {
>> +		max_pass = shrinker->count_objects(shrinker, shrinkctl);
>> +		WARN_ON(max_pass < 0);
>> +	} else
>> +		max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
>> +	if (max_pass <= 0)
>> +		return 0;
>> +
>> +	/*
>> +	 * copy the current shrinker scan count into a local variable
>> +	 * and zero it so that other concurrent shrinker invocations
>> +	 * don't also do this scanning work.
>> +	 */
>> +	nr = atomic_long_xchg(deferred, 0);
> 
> This comment seems wrong.  It implies that "deferred" refers to "the
> current shrinker scan count".  But how are these two the same thing?  A
> "scan count" would refer to the number of objects to be scanned (or
> which were scanned - it's unclear).  Whereas "deferred" would refer to
> the number of those to-be-scanned objects which we didn't process and
> is hence less than or equal to the "scan count".
> 
> It's all very foggy :(  This whole concept of deferral needs more
> explanation, please.
> 

>> +	total_scan = nr;
>> +	delta = (4 * nr_pages_scanned) / shrinker->seeks;
>> +	delta *= max_pass;
>> +	do_div(delta, lru_pages + 1);
>> +	total_scan += delta;
>> +	if (total_scan < 0) {
>> +		printk(KERN_ERR
>> +		"shrink_slab: %pF negative objects to delete nr=%ld\n",
>> +		       shrinker->shrink, total_scan);
>> +		total_scan = max_pass;
>> +	}
>> +
>> +	/*
>> +	 * We need to avoid excessive windup on filesystem shrinkers
>> +	 * due to large numbers of GFP_NOFS allocations causing the
>> +	 * shrinkers to return -1 all the time. This results in a large
>> +	 * nr being built up so when a shrink that can do some work
>> +	 * comes along it empties the entire cache due to nr >>>
>> +	 * max_pass.  This is bad for sustaining a working set in
>> +	 * memory.
>> +	 *
>> +	 * Hence only allow the shrinker to scan the entire cache when
>> +	 * a large delta change is calculated directly.
>> +	 */
> 
> That was an important comment.  So the whole problem we're tackling
> here is fs shrinkers baling out in GFP_NOFS allocations?
> 
The main problem, yes. Not the whole.
The whole problem is shrinkers bailing out. For the fs shrinkers it
happens in GFP_NOFS allocations. For the other shrinkers, I have no idea.

But if they bail out, we'll defer the scan just the same.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 14/35] list_lru: per-node API
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (10 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 13/35] vmscan: per-node deferred work Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 15/35] fs: convert inode and dentry shrinking to be node aware Glauber Costa
                     ` (18 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner

This patch adapts the list_lru API to accept an optional node argument,
to be used by NUMA aware shrinking functions. Code that does not care
about the NUMA placement of objects can still call into the very same
functions as before. They will simply iterate over all nodes.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 include/linux/list_lru.h | 35 ++++++++++++++++++++++++++++++++---
 lib/list_lru.c           | 41 +++++++++--------------------------------
 2 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 668f1f1..cf59a8a 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -42,15 +42,44 @@ struct list_lru {
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-unsigned long list_lru_count(struct list_lru *lru);
+
+unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+static inline unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	return count;
+}
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
 
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long nr_to_walk);
+
+unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
+				 list_lru_walk_cb isolate, void *cb_arg,
+				 unsigned long *nr_to_walk);
+
+static inline unsigned long
+list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+	      void *cb_arg, unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+	return isolated;
+}
 
 unsigned long
 list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 7611df7..dae13d6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -54,25 +54,21 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 unsigned long
-list_lru_count(struct list_lru *lru)
+list_lru_count_node(struct list_lru *lru, int nid)
 {
 	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_node *nlru = &lru->node[nid];
 
-		spin_lock(&nlru->lock);
-		BUG_ON(nlru->nr_items < 0);
-		count += nlru->nr_items;
-		spin_unlock(&nlru->lock);
-	}
+	spin_lock(&nlru->lock);
+	BUG_ON(nlru->nr_items < 0);
+	count += nlru->nr_items;
+	spin_unlock(&nlru->lock);
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count);
+EXPORT_SYMBOL_GPL(list_lru_count_node);
 
-static unsigned long
+unsigned long
 list_lru_walk_node(
 	struct list_lru		*lru,
 	int			nid,
@@ -118,26 +114,7 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-
-unsigned long
-list_lru_walk(
-	struct list_lru	*lru,
-	list_lru_walk_cb isolate,
-	void		*cb_arg,
-	unsigned long	nr_to_walk)
-{
-	long isolated = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
-}
-EXPORT_SYMBOL_GPL(list_lru_walk);
+EXPORT_SYMBOL_GPL(list_lru_walk_node);
 
 static unsigned long
 list_lru_dispose_all_node(
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 15/35] fs: convert inode and dentry shrinking to be node aware
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (11 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 14/35] list_lru: per-node API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 16/35] xfs: convert buftarg LRU to generic code Glauber Costa
                     ` (17 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Now that the shrinker is passing a node in the scan control
structure, we can pass this to the the generic LRU list code to
isolate reclaim to the lists on matching nodes.

v7: refactoring of the LRU list API in a separate patch
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 fs/dcache.c        |  8 +++++---
 fs/inode.c         |  7 ++++---
 fs/internal.h      |  6 ++++--
 fs/super.c         | 23 ++++++++++++++---------
 fs/xfs/xfs_super.c |  6 ++++--
 include/linux/fs.h |  4 ++--
 6 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 30731d3..e07aa73 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -879,6 +879,7 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
  * @nr_to_scan : number of entries to try to free
+ * @nodes_to_walk: which nodes to scan for freeable entities
  *
  * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
  * done when we need more memory an called from the superblock shrinker
@@ -887,13 +888,14 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan)
+long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
+		     int nid)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
-			      &dispose, nr_to_scan);
+	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
+				       &dispose, &nr_to_scan);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 5d85521..00b804e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -746,13 +746,14 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan)
+long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
+		     int nid)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
-						&freeable, nr_to_scan);
+	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
+				       &freeable, &nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index ea43c89..8902d56 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,7 +110,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan);
+extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
+			    int nid);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -126,7 +127,8 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan);
+extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
+			    int nid);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index 8d8a62c..adbbb1a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,10 +75,10 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 		return -1;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count(&sb->s_inode_lru);
-	dentries = list_lru_count(&sb->s_dentry_lru);
+	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
+	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,13 +89,14 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries);
-	freed += prune_icache_sb(sb, inodes);
+	freed = prune_dcache_sb(sb, dentries, sc->nid);
+	freed += prune_icache_sb(sb, inodes, sc->nid);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
 								total_objects);
-		freed += sb->s_op->free_cached_objects(sb, fs_objects);
+		freed += sb->s_op->free_cached_objects(sb, fs_objects,
+						       sc->nid);
 	}
 
 	drop_super(sb);
@@ -113,10 +114,13 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb);
+		total_objects = sb->s_op->nr_cached_objects(sb,
+						 sc->nid);
 
-	total_objects += list_lru_count(&sb->s_dentry_lru);
-	total_objects += list_lru_count(&sb->s_inode_lru);
+	total_objects += list_lru_count_node(&sb->s_dentry_lru,
+						 sc->nid);
+	total_objects += list_lru_count_node(&sb->s_inode_lru,
+						 sc->nid);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -232,6 +236,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_shrink.scan_objects = super_cache_scan;
 		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
+		s->s_shrink.flags = SHRINKER_NUMA_AWARE;
 	}
 out:
 	return s;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1ff991b..fef5e68 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1525,7 +1525,8 @@ xfs_fs_mount(
 
 static long
 xfs_fs_nr_cached_objects(
-	struct super_block	*sb)
+	struct super_block	*sb,
+	int			nid)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1533,7 +1534,8 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan)
+	long			nr_to_scan,
+	int			nid)
 {
 	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0d05a98..d752df5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1609,8 +1609,8 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *);
-	long (*free_cached_objects)(struct super_block *, long);
+	long (*nr_cached_objects)(struct super_block *, int);
+	long (*free_cached_objects)(struct super_block *, long, int);
 };
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 16/35] xfs: convert buftarg LRU to generic code
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (12 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 15/35] fs: convert inode and dentry shrinking to be node aware Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 17/35] xfs: rework buffer dispose list tracking Glauber Costa
                     ` (16 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Convert the buftarg LRU to use the new generic LRU list and take
advantage of the functionality it supplies to make the buffer cache
shrinker node aware.

* v7: Add NUMA aware flag

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 fs/xfs/xfs_buf.c | 170 ++++++++++++++++++++++++++-----------------------------
 fs/xfs/xfs_buf.h |   5 +-
 2 files changed, 82 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e8610aa..dcf3c0d 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
  * The LRU takes a new reference to the buffer so that it will only be freed
  * once the shrinker takes the buffer off the LRU.
  */
-STATIC void
+static void
 xfs_buf_lru_add(
 	struct xfs_buf	*bp)
 {
-	struct xfs_buftarg *btp = bp->b_target;
-
-	spin_lock(&btp->bt_lru_lock);
-	if (list_empty(&bp->b_lru)) {
-		atomic_inc(&bp->b_hold);
-		list_add_tail(&bp->b_lru, &btp->bt_lru);
-		btp->bt_lru_nr++;
+	if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
 		bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+		atomic_inc(&bp->b_hold);
 	}
-	spin_unlock(&btp->bt_lru_lock);
 }
 
 /*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
  * The unlocked check is safe here because it only occurs when there are not
  * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
  * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
  */
-STATIC void
+static void
 xfs_buf_lru_del(
 	struct xfs_buf	*bp)
 {
-	struct xfs_buftarg *btp = bp->b_target;
-
-	if (list_empty(&bp->b_lru))
-		return;
-
-	spin_lock(&btp->bt_lru_lock);
-	if (!list_empty(&bp->b_lru)) {
-		list_del_init(&bp->b_lru);
-		btp->bt_lru_nr--;
-	}
-	spin_unlock(&btp->bt_lru_lock);
+	list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
 }
 
 /*
@@ -151,18 +134,10 @@ xfs_buf_stale(
 	bp->b_flags &= ~_XBF_DELWRI_Q;
 
 	atomic_set(&(bp)->b_lru_ref, 0);
-	if (!list_empty(&bp->b_lru)) {
-		struct xfs_buftarg *btp = bp->b_target;
-
-		spin_lock(&btp->bt_lru_lock);
-		if (!list_empty(&bp->b_lru) &&
-		    !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
-			list_del_init(&bp->b_lru);
-			btp->bt_lru_nr--;
-			atomic_dec(&bp->b_hold);
-		}
-		spin_unlock(&btp->bt_lru_lock);
-	}
+	if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+	    (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
+		atomic_dec(&bp->b_hold);
+
 	ASSERT(atomic_read(&bp->b_hold) >= 1);
 }
 
@@ -1502,83 +1477,97 @@ xfs_buf_iomove(
  * returned. These buffers will have an elevated hold count, so wait on those
  * while freeing all the buffers only held by the LRU.
  */
-void
-xfs_wait_buftarg(
-	struct xfs_buftarg	*btp)
+static enum lru_status
+xfs_buftarg_wait_rele(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+
 {
-	struct xfs_buf		*bp;
+	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
 
-restart:
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-		if (atomic_read(&bp->b_hold) > 1) {
-			trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-			list_move_tail(&bp->b_lru, &btp->bt_lru);
-			spin_unlock(&btp->bt_lru_lock);
-			delay(100);
-			goto restart;
-		}
+	if (atomic_read(&bp->b_hold) > 1) {
+		/* need to wait */
+		trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
+		spin_unlock(lru_lock);
+		delay(100);
+	} else {
 		/*
 		 * clear the LRU reference count so the buffer doesn't get
 		 * ignored in xfs_buf_rele().
 		 */
 		atomic_set(&bp->b_lru_ref, 0);
-		spin_unlock(&btp->bt_lru_lock);
+		spin_unlock(lru_lock);
 		xfs_buf_rele(bp);
-		spin_lock(&btp->bt_lru_lock);
 	}
-	spin_unlock(&btp->bt_lru_lock);
+
+	spin_lock(lru_lock);
+	return LRU_RETRY;
 }
 
-int
-xfs_buftarg_shrink(
+void
+xfs_wait_buftarg(
+	struct xfs_buftarg	*btp)
+{
+	while (list_lru_count(&btp->bt_lru))
+		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
+			      NULL, LONG_MAX);
+}
+
+static enum lru_status
+xfs_buftarg_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+{
+	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
+	struct list_head	*dispose = arg;
+
+	/*
+	 * Decrement the b_lru_ref count unless the value is already
+	 * zero. If the value is already zero, we need to reclaim the
+	 * buffer, otherwise it gets another trip through the LRU.
+	 */
+	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0))
+		return LRU_ROTATE;
+
+	bp->b_lru_flags |= _XBF_LRU_DISPOSE;
+	list_move(item, dispose);
+	return LRU_REMOVED;
+}
+
+static long
+xfs_buftarg_shrink_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	struct xfs_buf		*bp;
-	int nr_to_scan = sc->nr_to_scan;
 	LIST_HEAD(dispose);
+	long			freed;
+	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	if (!nr_to_scan)
-		return btp->bt_lru_nr;
-
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		if (nr_to_scan-- <= 0)
-			break;
-
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-
-		/*
-		 * Decrement the b_lru_ref count unless the value is already
-		 * zero. If the value is already zero, we need to reclaim the
-		 * buffer, otherwise it gets another trip through the LRU.
-		 */
-		if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
-			list_move_tail(&bp->b_lru, &btp->bt_lru);
-			continue;
-		}
-
-		/*
-		 * remove the buffer from the LRU now to avoid needing another
-		 * lock round trip inside xfs_buf_rele().
-		 */
-		list_move(&bp->b_lru, &dispose);
-		btp->bt_lru_nr--;
-		bp->b_lru_flags |= _XBF_LRU_DISPOSE;
-	}
-	spin_unlock(&btp->bt_lru_lock);
+	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
+				       &dispose, &nr_to_scan);
 
 	while (!list_empty(&dispose)) {
+		struct xfs_buf *bp;
 		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
 		list_del_init(&bp->b_lru);
 		xfs_buf_rele(bp);
 	}
 
-	return btp->bt_lru_nr;
+	return freed;
+}
+
+static long
+xfs_buftarg_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_buftarg	*btp = container_of(shrink,
+					struct xfs_buftarg, bt_shrinker);
+	return list_lru_count_node(&btp->bt_lru, sc->nid);
 }
 
 void
@@ -1660,12 +1649,13 @@ xfs_alloc_buftarg(
 	if (!btp->bt_bdi)
 		goto error;
 
-	INIT_LIST_HEAD(&btp->bt_lru);
-	spin_lock_init(&btp->bt_lru_lock);
+	list_lru_init(&btp->bt_lru);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
+	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
+	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
+	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
 	register_shrinker(&btp->bt_shrinker);
 	return btp;
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 433a12e..5ec7d35 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -25,6 +25,7 @@
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/uio.h>
+#include <linux/list_lru.h>
 
 /*
  *	Base types
@@ -92,9 +93,7 @@ typedef struct xfs_buftarg {
 
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
-	struct list_head	bt_lru;
-	spinlock_t		bt_lru_lock;
-	unsigned int		bt_lru_nr;
+	struct list_lru		bt_lru;
 } xfs_buftarg_t;
 
 struct xfs_buf;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 17/35] xfs: rework buffer dispose list tracking
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (13 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 16/35] xfs: convert buftarg LRU to generic code Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 18/35] xfs: convert dquot cache lru to list_lru Glauber Costa
                     ` (15 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

In converting the buffer lru lists to use the generic code, the
locking for marking the buffers as on the dispose list was lost.
This results in confusion in LRU buffer tracking and acocunting,
resulting in reference counts being mucked up and filesystem beig
unmountable.

To fix this, introduce an internal buffer spinlock to protect the
state field that holds the dispose list information. Because there
is now locking needed around xfs_buf_lru_add/del, and they are used
in exactly one place each two lines apart, get rid of the wrappers
and code the logic directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and
repeat the walk that fills the dispose list until the LRU is empty.
Thi avoids needing to drop and regain the LRU lock for every item
being freed, and allows the same logic as the shrinker isolate call
to be used. Simpler, easier to understand.

Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/xfs/xfs_buf.c | 125 +++++++++++++++++++++++++++++++------------------------
 fs/xfs/xfs_buf.h |  12 ++++--
 2 files changed, 79 insertions(+), 58 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index dcf3c0d..0d7a619 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,37 +80,6 @@ xfs_buf_vmap_len(
 }
 
 /*
- * xfs_buf_lru_add - add a buffer to the LRU.
- *
- * The LRU takes a new reference to the buffer so that it will only be freed
- * once the shrinker takes the buffer off the LRU.
- */
-static void
-xfs_buf_lru_add(
-	struct xfs_buf	*bp)
-{
-	if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
-		bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
-		atomic_inc(&bp->b_hold);
-	}
-}
-
-/*
- * xfs_buf_lru_del - remove a buffer from the LRU
- *
- * The unlocked check is safe here because it only occurs when there are not
- * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
- * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free().
- */
-static void
-xfs_buf_lru_del(
-	struct xfs_buf	*bp)
-{
-	list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
-}
-
-/*
  * When we mark a buffer stale, we remove the buffer from the LRU and clear the
  * b_lru_ref count so that the buffer is freed immediately when the buffer
  * reference count falls to zero. If the buffer is already on the LRU, we need
@@ -133,12 +102,14 @@ xfs_buf_stale(
 	 */
 	bp->b_flags &= ~_XBF_DELWRI_Q;
 
-	atomic_set(&(bp)->b_lru_ref, 0);
-	if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+	spin_lock(&bp->b_lock);
+	atomic_set(&bp->b_lru_ref, 0);
+	if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
 	    (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
 		atomic_dec(&bp->b_hold);
 
 	ASSERT(atomic_read(&bp->b_hold) >= 1);
+	spin_unlock(&bp->b_lock);
 }
 
 static int
@@ -202,6 +173,7 @@ _xfs_buf_alloc(
 	INIT_LIST_HEAD(&bp->b_list);
 	RB_CLEAR_NODE(&bp->b_rbnode);
 	sema_init(&bp->b_sema, 0); /* held, no waiters */
+	spin_lock_init(&bp->b_lock);
 	XB_SET_OWNER(bp);
 	bp->b_target = target;
 	bp->b_flags = flags;
@@ -891,12 +863,33 @@ xfs_buf_rele(
 
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
-		if (!(bp->b_flags & XBF_STALE) &&
-			   atomic_read(&bp->b_lru_ref)) {
-			xfs_buf_lru_add(bp);
+		spin_lock(&bp->b_lock);
+		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
+			/*
+			 * If the buffer is added to the LRU take a new
+			 * reference to the buffer for the LRU and clear the
+			 * (now stale) dispose list state flag
+			 */
+			if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
+				bp->b_state &= ~XFS_BSTATE_DISPOSE;
+				atomic_inc(&bp->b_hold);
+			}
+			spin_unlock(&bp->b_lock);
 			spin_unlock(&pag->pag_buf_lock);
 		} else {
-			xfs_buf_lru_del(bp);
+			/*
+			 * most of the time buffers will already be removed from
+			 * the LRU, so optimise that case by checking for the
+			 * XFS_BSTATE_DISPOSE flag indicating the last list the
+			 * buffer was on was the disposal list
+			 */
+			if (!(bp->b_state & XFS_BSTATE_DISPOSE)) {
+				list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
+			} else {
+				ASSERT(list_empty(&bp->b_lru));
+			}
+			spin_unlock(&bp->b_lock);
+
 			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 			spin_unlock(&pag->pag_buf_lock);
@@ -1485,33 +1478,48 @@ xfs_buftarg_wait_rele(
 
 {
 	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
+	struct list_head	*dispose = arg;
 
 	if (atomic_read(&bp->b_hold) > 1) {
-		/* need to wait */
+		/* need to wait, so skip it this pass */
 		trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-		spin_unlock(lru_lock);
-		delay(100);
-	} else {
-		/*
-		 * clear the LRU reference count so the buffer doesn't get
-		 * ignored in xfs_buf_rele().
-		 */
-		atomic_set(&bp->b_lru_ref, 0);
-		spin_unlock(lru_lock);
-		xfs_buf_rele(bp);
+		return LRU_SKIP;
 	}
+	if (!spin_trylock(&bp->b_lock))
+		return LRU_SKIP;
 
-	spin_lock(lru_lock);
-	return LRU_RETRY;
+	/*
+	 * clear the LRU reference count so the buffer doesn't get
+	 * ignored in xfs_buf_rele().
+	 */
+	atomic_set(&bp->b_lru_ref, 0);
+	bp->b_state |= XFS_BSTATE_DISPOSE;
+	list_move(item, dispose);
+	spin_unlock(&bp->b_lock);
+	return LRU_REMOVED;
 }
 
 void
 xfs_wait_buftarg(
 	struct xfs_buftarg	*btp)
 {
-	while (list_lru_count(&btp->bt_lru))
+	LIST_HEAD(dispose);
+	int loop = 0;
+
+	/* loop until there is nothing left on the lru list. */
+	while (list_lru_count(&btp->bt_lru)) {
 		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
-			      NULL, LONG_MAX);
+			      &dispose, LONG_MAX);
+
+		while (!list_empty(&dispose)) {
+			struct xfs_buf *bp;
+			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
+			list_del_init(&bp->b_lru);
+			xfs_buf_rele(bp);
+		}
+		if (loop++ != 0)
+			delay(100);
+	}
 }
 
 static enum lru_status
@@ -1524,15 +1532,24 @@ xfs_buftarg_isolate(
 	struct list_head	*dispose = arg;
 
 	/*
+	 * we are inverting the lru lock/bp->b_lock here, so use a trylock.
+	 * If we fail to get the lock, just skip it.
+	 */
+	if (!spin_trylock(&bp->b_lock))
+		return LRU_SKIP;
+	/*
 	 * Decrement the b_lru_ref count unless the value is already
 	 * zero. If the value is already zero, we need to reclaim the
 	 * buffer, otherwise it gets another trip through the LRU.
 	 */
-	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0))
+	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
+		spin_unlock(&bp->b_lock);
 		return LRU_ROTATE;
+	}
 
-	bp->b_lru_flags |= _XBF_LRU_DISPOSE;
+	bp->b_state |= XFS_BSTATE_DISPOSE;
 	list_move(item, dispose);
+	spin_unlock(&bp->b_lock);
 	return LRU_REMOVED;
 }
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 5ec7d35..e656833 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -60,7 +60,6 @@ typedef enum {
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 #define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
-#define _XBF_LRU_DISPOSE (1 << 24)/* buffer being discarded */
 
 typedef unsigned int xfs_buf_flags_t;
 
@@ -79,8 +78,12 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
-	{ _XBF_COMPOUND,	"COMPOUND" }, \
-	{ _XBF_LRU_DISPOSE,	"LRU_DISPOSE" }
+	{ _XBF_COMPOUND,	"COMPOUND" }
+
+/*
+ * Internal state flags.
+ */
+#define XFS_BSTATE_DISPOSE	 (1 << 0)	/* buffer being discarded */
 
 typedef struct xfs_buftarg {
 	dev_t			bt_dev;
@@ -136,7 +139,8 @@ typedef struct xfs_buf {
 	 * bt_lru_lock and not by b_sema
 	 */
 	struct list_head	b_lru;		/* lru list */
-	xfs_buf_flags_t		b_lru_flags;	/* internal lru status flags */
+	spinlock_t		b_lock;		/* internal state lock */
+	unsigned int		b_state;	/* internal state flags */
 	wait_queue_head_t	b_waiters;	/* unpin waiters */
 	struct list_head	b_list;
 	struct xfs_perag	*b_pag;		/* contains rbtree root */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 18/35] xfs: convert dquot cache lru to list_lru
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (14 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 17/35] xfs: rework buffer dispose list tracking Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 19/35] fs: convert fs shrinkers to new scan/count API Glauber Costa
                     ` (14 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Convert the XFS dquot lru to use the list_lru construct and convert
the shrinker to being node aware.

* v7: Add NUMA aware flag
[ glommer: edited for conflicts + warning fixes ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/xfs/xfs_dquot.c |   7 +-
 fs/xfs/xfs_qm.c    | 277 +++++++++++++++++++++++++++--------------------------
 fs/xfs/xfs_qm.h    |   4 +-
 3 files changed, 144 insertions(+), 144 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index a41f8bf..4e9178d 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -950,13 +950,8 @@ xfs_qm_dqput_final(
 
 	trace_xfs_dqput_free(dqp);
 
-	mutex_lock(&qi->qi_lru_lock);
-	if (list_empty(&dqp->q_lru)) {
-		list_add_tail(&dqp->q_lru, &qi->qi_lru_list);
-		qi->qi_lru_count++;
+	if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
 		XFS_STATS_INC(xs_qm_dquot_unused);
-	}
-	mutex_unlock(&qi->qi_lru_lock);
 
 	/*
 	 * If we just added a udquot to the freelist, then we want to release
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 7ade175..85ca39e 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -50,8 +50,9 @@
  */
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(struct shrinker *, struct shrink_control *);
 
+
+STATIC void	xfs_qm_dqfree_one(struct xfs_dquot *dqp);
 /*
  * We use the batch lookup interface to iterate over the dquots as it
  * currently is the only interface into the radix tree code that allows
@@ -196,12 +197,9 @@ xfs_qm_dqpurge(
 	 * We move dquots to the freelist as soon as their reference count
 	 * hits zero, so it really should be on the freelist here.
 	 */
-	mutex_lock(&qi->qi_lru_lock);
 	ASSERT(!list_empty(&dqp->q_lru));
-	list_del_init(&dqp->q_lru);
-	qi->qi_lru_count--;
+	list_lru_del(&qi->qi_lru, &dqp->q_lru);
 	XFS_STATS_DEC(xs_qm_dquot_unused);
-	mutex_unlock(&qi->qi_lru_lock);
 
 	xfs_qm_dqdestroy(dqp);
 
@@ -631,6 +629,141 @@ xfs_qm_calc_dquots_per_chunk(
 	return ndquots;
 }
 
+struct xfs_qm_isolate {
+	struct list_head	buffers;
+	struct list_head	dispose;
+};
+
+static enum lru_status
+xfs_qm_dquot_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+{
+	struct xfs_dquot	*dqp = container_of(item,
+						struct xfs_dquot, q_lru);
+	struct xfs_qm_isolate	*isol = arg;
+
+	if (!xfs_dqlock_nowait(dqp))
+		goto out_miss_busy;
+
+	/*
+	 * This dquot has acquired a reference in the meantime remove it from
+	 * the freelist and try again.
+	 */
+	if (dqp->q_nrefs) {
+		xfs_dqunlock(dqp);
+		XFS_STATS_INC(xs_qm_dqwants);
+
+		trace_xfs_dqreclaim_want(dqp);
+		list_del_init(&dqp->q_lru);
+		XFS_STATS_DEC(xs_qm_dquot_unused);
+		return 0;
+	}
+
+	/*
+	 * If the dquot is dirty, flush it. If it's already being flushed, just
+	 * skip it so there is time for the IO to complete before we try to
+	 * reclaim it again on the next LRU pass.
+	 */
+	if (!xfs_dqflock_nowait(dqp)) {
+		xfs_dqunlock(dqp);
+		goto out_miss_busy;
+	}
+
+	if (XFS_DQ_IS_DIRTY(dqp)) {
+		struct xfs_buf	*bp = NULL;
+		int		error;
+
+		trace_xfs_dqreclaim_dirty(dqp);
+
+		/* we have to drop the LRU lock to flush the dquot */
+		spin_unlock(lru_lock);
+
+		error = xfs_qm_dqflush(dqp, &bp);
+		if (error) {
+			xfs_warn(dqp->q_mount, "%s: dquot %p flush failed",
+				 __func__, dqp);
+			goto out_unlock_dirty;
+		}
+
+		xfs_buf_delwri_queue(bp, &isol->buffers);
+		xfs_buf_relse(bp);
+		goto out_unlock_dirty;
+	}
+	xfs_dqfunlock(dqp);
+
+	/*
+	 * Prevent lookups now that we are past the point of no return.
+	 */
+	dqp->dq_flags |= XFS_DQ_FREEING;
+	xfs_dqunlock(dqp);
+
+	ASSERT(dqp->q_nrefs == 0);
+	list_move_tail(&dqp->q_lru, &isol->dispose);
+	XFS_STATS_DEC(xs_qm_dquot_unused);
+	trace_xfs_dqreclaim_done(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaims);
+	return 0;
+
+out_miss_busy:
+	trace_xfs_dqreclaim_busy(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaim_misses);
+	return 2;
+
+out_unlock_dirty:
+	trace_xfs_dqreclaim_busy(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaim_misses);
+	return 3;
+}
+
+static long
+xfs_qm_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_quotainfo	*qi = container_of(shrink,
+					struct xfs_quotainfo, qi_shrinker);
+	struct xfs_qm_isolate	isol;
+	long			freed;
+	int			error;
+	unsigned long		nr_to_scan = sc->nr_to_scan;
+
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+		return 0;
+
+	INIT_LIST_HEAD(&isol.buffers);
+	INIT_LIST_HEAD(&isol.dispose);
+
+	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
+					&nr_to_scan);
+
+	error = xfs_buf_delwri_submit(&isol.buffers);
+	if (error)
+		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
+
+	while (!list_empty(&isol.dispose)) {
+		struct xfs_dquot	*dqp;
+
+		dqp = list_first_entry(&isol.dispose, struct xfs_dquot, q_lru);
+		list_del_init(&dqp->q_lru);
+		xfs_qm_dqfree_one(dqp);
+	}
+
+	return freed;
+}
+
+static long
+xfs_qm_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_quotainfo	*qi = container_of(shrink,
+					struct xfs_quotainfo, qi_shrinker);
+
+	return list_lru_count_node(&qi->qi_lru, sc->nid);
+}
+
 /*
  * This initializes all the quota information that's kept in the
  * mount structure
@@ -661,9 +794,7 @@ xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
 
-	INIT_LIST_HEAD(&qinf->qi_lru_list);
-	qinf->qi_lru_count = 0;
-	mutex_init(&qinf->qi_lru_lock);
+	list_lru_init(&qinf->qi_lru);
 
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
@@ -729,8 +860,10 @@ xfs_qm_init_quotainfo(
 		qinf->qi_rtbwarnlimit = XFS_QM_RTBWARNLIMIT;
 	}
 
-	qinf->qi_shrinker.shrink = xfs_qm_shake;
+	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
+	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
 	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
+	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
 	register_shrinker(&qinf->qi_shrinker);
 	return 0;
 }
@@ -1462,132 +1595,6 @@ xfs_qm_dqfree_one(
 	xfs_qm_dqdestroy(dqp);
 }
 
-STATIC void
-xfs_qm_dqreclaim_one(
-	struct xfs_dquot	*dqp,
-	struct list_head	*buffer_list,
-	struct list_head	*dispose_list)
-{
-	struct xfs_mount	*mp = dqp->q_mount;
-	struct xfs_quotainfo	*qi = mp->m_quotainfo;
-	int			error;
-
-	if (!xfs_dqlock_nowait(dqp))
-		goto out_move_tail;
-
-	/*
-	 * This dquot has acquired a reference in the meantime remove it from
-	 * the freelist and try again.
-	 */
-	if (dqp->q_nrefs) {
-		xfs_dqunlock(dqp);
-
-		trace_xfs_dqreclaim_want(dqp);
-		XFS_STATS_INC(xs_qm_dqwants);
-
-		list_del_init(&dqp->q_lru);
-		qi->qi_lru_count--;
-		XFS_STATS_DEC(xs_qm_dquot_unused);
-		return;
-	}
-
-	/*
-	 * Try to grab the flush lock. If this dquot is in the process of
-	 * getting flushed to disk, we don't want to reclaim it.
-	 */
-	if (!xfs_dqflock_nowait(dqp))
-		goto out_unlock_move_tail;
-
-	if (XFS_DQ_IS_DIRTY(dqp)) {
-		struct xfs_buf	*bp = NULL;
-
-		trace_xfs_dqreclaim_dirty(dqp);
-
-		error = xfs_qm_dqflush(dqp, &bp);
-		if (error) {
-			xfs_warn(mp, "%s: dquot %p flush failed",
-				 __func__, dqp);
-			goto out_unlock_move_tail;
-		}
-
-		xfs_buf_delwri_queue(bp, buffer_list);
-		xfs_buf_relse(bp);
-		/*
-		 * Give the dquot another try on the freelist, as the
-		 * flushing will take some time.
-		 */
-		goto out_unlock_move_tail;
-	}
-	xfs_dqfunlock(dqp);
-
-	/*
-	 * Prevent lookups now that we are past the point of no return.
-	 */
-	dqp->dq_flags |= XFS_DQ_FREEING;
-	xfs_dqunlock(dqp);
-
-	ASSERT(dqp->q_nrefs == 0);
-	list_move_tail(&dqp->q_lru, dispose_list);
-	qi->qi_lru_count--;
-	XFS_STATS_DEC(xs_qm_dquot_unused);
-
-	trace_xfs_dqreclaim_done(dqp);
-	XFS_STATS_INC(xs_qm_dqreclaims);
-	return;
-
-	/*
-	 * Move the dquot to the tail of the list so that we don't spin on it.
-	 */
-out_unlock_move_tail:
-	xfs_dqunlock(dqp);
-out_move_tail:
-	list_move_tail(&dqp->q_lru, &qi->qi_lru_list);
-	trace_xfs_dqreclaim_busy(dqp);
-	XFS_STATS_INC(xs_qm_dqreclaim_misses);
-}
-
-STATIC int
-xfs_qm_shake(
-	struct shrinker		*shrink,
-	struct shrink_control	*sc)
-{
-	struct xfs_quotainfo	*qi =
-		container_of(shrink, struct xfs_quotainfo, qi_shrinker);
-	int			nr_to_scan = sc->nr_to_scan;
-	LIST_HEAD		(buffer_list);
-	LIST_HEAD		(dispose_list);
-	struct xfs_dquot	*dqp;
-	int			error;
-
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
-		return 0;
-	if (!nr_to_scan)
-		goto out;
-
-	mutex_lock(&qi->qi_lru_lock);
-	while (!list_empty(&qi->qi_lru_list)) {
-		if (nr_to_scan-- <= 0)
-			break;
-		dqp = list_first_entry(&qi->qi_lru_list, struct xfs_dquot,
-				       q_lru);
-		xfs_qm_dqreclaim_one(dqp, &buffer_list, &dispose_list);
-	}
-	mutex_unlock(&qi->qi_lru_lock);
-
-	error = xfs_buf_delwri_submit(&buffer_list);
-	if (error)
-		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
-
-	while (!list_empty(&dispose_list)) {
-		dqp = list_first_entry(&dispose_list, struct xfs_dquot, q_lru);
-		list_del_init(&dqp->q_lru);
-		xfs_qm_dqfree_one(dqp);
-	}
-
-out:
-	return vfs_pressure_ratio(qi->qi_lru_count);
-}
-
 /*
  * Start a transaction and write the incore superblock changes to
  * disk. flags parameter indicates which fields have changed.
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 5d16a6e..8173b5e 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -47,9 +47,7 @@ typedef struct xfs_quotainfo {
 	struct mutex qi_tree_lock;
 	xfs_inode_t	*qi_uquotaip;	 /* user quota inode */
 	xfs_inode_t	*qi_gquotaip;	 /* group quota inode */
-	struct list_head qi_lru_list;
-	struct mutex	 qi_lru_lock;
-	int		 qi_lru_count;
+	struct list_lru	 qi_lru;
 	int		 qi_dquots;
 	time_t		 qi_btimelimit;	 /* limit for blks timer */
 	time_t		 qi_itimelimit;	 /* limit for inodes timer */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 19/35] fs: convert fs shrinkers to new scan/count API
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (15 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 18/35] xfs: convert dquot cache lru to list_lru Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 21/35] i915: bail out earlier when shrinker cannot acquire mutex Glauber Costa
                     ` (13 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa, Adrian Hunter

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Convert the filesystem shrinkers to use the new API, and standardise
some of the behaviours of the shrinkers at the same time. For
example, nr_to_scan means the number of objects to scan, not the
number of objects to free.

I refactored the CIFS idmap shrinker a little - it really needs to
be broken up into a shrinker per tree and keep an item count with
the tree root so that we don't need to walk the tree every time the
shrinker needs to count the number of objects in the tree (i.e.
all the time under memory pressure).

[ glommer: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are
  needed mainly due to new code merged in the tree ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Acked-by: Artem Bityutskiy <artem.bityutskiy-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Steven Whitehouse <swhiteho-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Adrian Hunter <adrian.hunter-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/ext4/extents_status.c | 30 ++++++++++++++++------------
 fs/gfs2/glock.c          | 28 +++++++++++++++-----------
 fs/gfs2/main.c           |  3 ++-
 fs/gfs2/quota.c          | 12 +++++++-----
 fs/gfs2/quota.h          |  4 +++-
 fs/mbcache.c             | 51 ++++++++++++++++++++++++++++--------------------
 fs/nfs/dir.c             | 18 ++++++++++++++---
 fs/nfs/internal.h        |  4 +++-
 fs/nfs/super.c           |  3 ++-
 fs/nfsd/nfscache.c       | 31 ++++++++++++++++++++---------
 fs/quota/dquot.c         | 34 +++++++++++++++-----------------
 fs/ubifs/shrinker.c      | 20 +++++++++++--------
 fs/ubifs/super.c         |  3 ++-
 fs/ubifs/ubifs.h         |  3 ++-
 14 files changed, 151 insertions(+), 93 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e6941e6..4bce4f0 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -878,20 +878,26 @@ int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
 				     EXTENT_STATUS_WRITTEN);
 }
 
-static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
+
+static long ext4_es_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	long nr;
+	struct ext4_sb_info *sbi = container_of(shrink,
+					struct ext4_sb_info, s_es_shrinker);
+
+	nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+	trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
+	return nr;
+}
+
+static long ext4_es_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct ext4_sb_info *sbi = container_of(shrink,
 					struct ext4_sb_info, s_es_shrinker);
 	struct ext4_inode_info *ei;
 	struct list_head *cur, *tmp, scanned;
 	int nr_to_scan = sc->nr_to_scan;
-	int ret, nr_shrunk = 0;
-
-	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
-	trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
-
-	if (!nr_to_scan)
-		return ret;
+	int ret = 0, nr_shrunk = 0;
 
 	INIT_LIST_HEAD(&scanned);
 
@@ -920,9 +926,8 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 	list_splice_tail(&scanned, &sbi->s_es_lru);
 	spin_unlock(&sbi->s_es_lru_lock);
 
-	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
 	trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
-	return ret;
+	return nr_shrunk;
 }
 
 void ext4_es_register_shrinker(struct super_block *sb)
@@ -932,7 +937,8 @@ void ext4_es_register_shrinker(struct super_block *sb)
 	sbi = EXT4_SB(sb);
 	INIT_LIST_HEAD(&sbi->s_es_lru);
 	spin_lock_init(&sbi->s_es_lru_lock);
-	sbi->s_es_shrinker.shrink = ext4_es_shrink;
+	sbi->s_es_shrinker.scan_objects = ext4_es_scan;
+	sbi->s_es_shrinker.count_objects = ext4_es_count;
 	sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&sbi->s_es_shrinker);
 }
@@ -973,7 +979,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	struct ext4_es_tree *tree = &ei->i_es_tree;
 	struct rb_node *node;
 	struct extent_status *es;
-	int nr_shrunk = 0;
+	long nr_shrunk = 0;
 
 	if (ei->i_es_lru_nr == 0)
 		return 0;
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 3bd2748..4ddbccb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1428,21 +1428,22 @@ __acquires(&lru_lock)
  * gfs2_dispose_glock_lru() above.
  */
 
-static void gfs2_scan_glock_lru(int nr)
+static long gfs2_scan_glock_lru(int nr)
 {
 	struct gfs2_glock *gl;
 	LIST_HEAD(skipped);
 	LIST_HEAD(dispose);
+	long freed = 0;
 
 	spin_lock(&lru_lock);
-	while(nr && !list_empty(&lru_list)) {
+	while ((nr-- >= 0) && !list_empty(&lru_list)) {
 		gl = list_entry(lru_list.next, struct gfs2_glock, gl_lru);
 
 		/* Test for being demotable */
 		if (!test_and_set_bit(GLF_LOCK, &gl->gl_flags)) {
 			list_move(&gl->gl_lru, &dispose);
 			atomic_dec(&lru_count);
-			nr--;
+			freed++;
 			continue;
 		}
 
@@ -1452,23 +1453,28 @@ static void gfs2_scan_glock_lru(int nr)
 	if (!list_empty(&dispose))
 		gfs2_dispose_glock_lru(&dispose);
 	spin_unlock(&lru_lock);
+
+	return freed;
 }
 
-static int gfs2_shrink_glock_memory(struct shrinker *shrink,
-				    struct shrink_control *sc)
+static long gfs2_glock_shrink_scan(struct shrinker *shrink,
+				   struct shrink_control *sc)
 {
-	if (sc->nr_to_scan) {
-		if (!(sc->gfp_mask & __GFP_FS))
-			return -1;
-		gfs2_scan_glock_lru(sc->nr_to_scan);
-	}
+	if (!(sc->gfp_mask & __GFP_FS))
+		return -1;
+	return gfs2_scan_glock_lru(sc->nr_to_scan);
+}
 
+static long gfs2_glock_shrink_count(struct shrinker *shrink,
+				    struct shrink_control *sc)
+{
 	return vfs_pressure_ratio(atomic_read(&lru_count));
 }
 
 static struct shrinker glock_shrinker = {
-	.shrink = gfs2_shrink_glock_memory,
 	.seeks = DEFAULT_SEEKS,
+	.count_objects = gfs2_glock_shrink_count,
+	.scan_objects = gfs2_glock_shrink_scan,
 };
 
 /**
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index e04d0e0..a105d84 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -32,7 +32,8 @@
 struct workqueue_struct *gfs2_control_wq;
 
 static struct shrinker qd_shrinker = {
-	.shrink = gfs2_shrink_qd_memory,
+	.count_objects = gfs2_qd_shrink_count,
+	.scan_objects = gfs2_qd_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index f9f4077..0e832ce 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -75,14 +75,12 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
+long gfs2_qd_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
 	int nr_to_scan = sc->nr_to_scan;
-
-	if (nr_to_scan == 0)
-		goto out;
+	long freed = 0;
 
 	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
@@ -110,10 +108,14 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
 		kmem_cache_free(gfs2_quotad_cachep, qd);
 		spin_lock(&qd_lru_lock);
 		nr_to_scan--;
+		freed++;
 	}
 	spin_unlock(&qd_lru_lock);
+	return freed;
+}
 
-out:
+long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
 	return vfs_pressure_ratio(atomic_read(&qd_lru_count));
 }
 
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 4f5e6e4..4f61708 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -53,7 +53,9 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(struct shrinker *shrink,
+extern long gfs2_qd_shrink_count(struct shrinker *shrink,
+				 struct shrink_control *sc);
+extern long gfs2_qd_shrink_scan(struct shrinker *shrink,
 				 struct shrink_control *sc);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 5eb0476..009a463 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -86,18 +86,6 @@ static LIST_HEAD(mb_cache_list);
 static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
-/*
- * What the mbcache registers as to get shrunk dynamically.
- */
-
-static int mb_cache_shrink_fn(struct shrinker *shrink,
-			      struct shrink_control *sc);
-
-static struct shrinker mb_cache_shrinker = {
-	.shrink = mb_cache_shrink_fn,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static inline int
 __mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
 {
@@ -151,7 +139,7 @@ forget:
 
 
 /*
- * mb_cache_shrink_fn()  memory pressure callback
+ * mb_cache_shrink_scan()  memory pressure callback
  *
  * This function is called by the kernel memory management when memory
  * gets low.
@@ -159,17 +147,18 @@ forget:
  * @shrink: (ignored)
  * @sc: shrink_control passed from reclaim
  *
- * Returns the number of objects which are present in the cache.
+ * Returns the number of objects freed.
  */
-static int
-mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
+static long
+mb_cache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	LIST_HEAD(free_list);
-	struct mb_cache *cache;
 	struct mb_cache_entry *entry, *tmp;
-	int count = 0;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	long freed = 0;
 
 	mb_debug("trying to free %d entries", nr_to_scan);
 	spin_lock(&mb_cache_spinlock);
@@ -179,19 +168,39 @@ mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 				   struct mb_cache_entry, e_lru_list);
 		list_move_tail(&ce->e_lru_list, &free_list);
 		__mb_cache_entry_unhash(ce);
+		freed++;
+	}
+	spin_unlock(&mb_cache_spinlock);
+	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
+		__mb_cache_entry_forget(entry, gfp_mask);
 	}
+	return freed;
+}
+
+static long
+mb_cache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct mb_cache *cache;
+	long count = 0;
+
+	spin_lock(&mb_cache_spinlock);
 	list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
 		mb_debug("cache %s (%d)", cache->c_name,
 			  atomic_read(&cache->c_entry_count));
 		count += atomic_read(&cache->c_entry_count);
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
-		__mb_cache_entry_forget(entry, gfp_mask);
-	}
+
 	return vfs_pressure_ratio(count);
 }
 
+static struct shrinker mb_cache_shrinker = {
+	.count_objects = mb_cache_shrink_count,
+	.scan_objects = mb_cache_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+};
 
 /*
  * mb_cache_create()  create a new cache
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a6a3d05..36d66d4 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1938,17 +1938,20 @@ static void nfs_access_free_list(struct list_head *head)
 	}
 }
 
-int nfs_access_cache_shrinker(struct shrinker *shrink,
-			      struct shrink_control *sc)
+long
+nfs_access_cache_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi, *next;
 	struct nfs_access_entry *cache;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	long freed = 0;
 
 	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return (nr_to_scan == 0) ? 0 : -1;
+		return -1;
 
 	spin_lock(&nfs_access_lru_lock);
 	list_for_each_entry_safe(nfsi, next, &nfs_access_lru_list, access_cache_inode_lru) {
@@ -1964,6 +1967,7 @@ int nfs_access_cache_shrinker(struct shrinker *shrink,
 				struct nfs_access_entry, lru);
 		list_move(&cache->lru, &head);
 		rb_erase(&cache->rb_node, &nfsi->access_cache);
+		freed++;
 		if (!list_empty(&nfsi->access_cache_entry_lru))
 			list_move_tail(&nfsi->access_cache_inode_lru,
 					&nfs_access_lru_list);
@@ -1978,6 +1982,14 @@ remove_lru_entry:
 	}
 	spin_unlock(&nfs_access_lru_lock);
 	nfs_access_free_list(&head);
+	return freed;
+}
+
+long
+nfs_access_cache_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
 	return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 91e59a3..9651e20 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -269,7 +269,9 @@ extern struct nfs_client *nfs_init_client(struct nfs_client *clp,
 			   const char *ip_addr, rpc_authflavor_t authflavour);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(struct shrinker *shrink,
+extern long nfs_access_cache_count(struct shrinker *shrink,
+					struct shrink_control *sc);
+extern long nfs_access_cache_scan(struct shrinker *shrink,
 					struct shrink_control *sc);
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 int nfs_create(struct inode *, struct dentry *, umode_t, bool);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index a366107..2fed70f 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -359,7 +359,8 @@ static void unregister_nfs4_fs(void)
 #endif
 
 static struct shrinker acl_shrinker = {
-	.shrink		= nfs_access_cache_shrinker,
+	.count_objects	= nfs_access_cache_count,
+	.scan_objects	= nfs_access_cache_scan,
 	.seeks		= DEFAULT_SEEKS,
 };
 
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index e76244e..5564c38 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -59,11 +59,14 @@ static unsigned int		longest_chain_cachesize;
 
 static int	nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
 static void	cache_cleaner_func(struct work_struct *unused);
-static int 	nfsd_reply_cache_shrink(struct shrinker *shrink,
-					struct shrink_control *sc);
+static long	nfsd_reply_cache_count(struct shrinker *shrink,
+				       struct shrink_control *sc);
+static long	nfsd_reply_cache_scan(struct shrinker *shrink,
+				      struct shrink_control *sc);
 
 static struct shrinker nfsd_reply_cache_shrinker = {
-	.shrink	= nfsd_reply_cache_shrink,
+	.scan_objects = nfsd_reply_cache_scan,
+	.count_objects = nfsd_reply_cache_count,
 	.seeks	= 1,
 };
 
@@ -232,16 +235,18 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
  * Walk the LRU list and prune off entries that are older than RC_EXPIRE.
  * Also prune the oldest ones when the total exceeds the max number of entries.
  */
-static void
+static long
 prune_cache_entries(void)
 {
 	struct svc_cacherep *rp, *tmp;
+	long freed = 0;
 
 	list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
 		if (!nfsd_cache_entry_expired(rp) &&
 		    num_drc_entries <= max_drc_entries)
 			break;
 		nfsd_reply_cache_free_locked(rp);
+		freed++;
 	}
 
 	/*
@@ -254,6 +259,7 @@ prune_cache_entries(void)
 		cancel_delayed_work(&cache_cleaner);
 	else
 		mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
+	return freed;
 }
 
 static void
@@ -264,20 +270,27 @@ cache_cleaner_func(struct work_struct *unused)
 	spin_unlock(&cache_lock);
 }
 
-static int
-nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
+static long
+nfsd_reply_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 {
-	unsigned int num;
+	long num;
 
 	spin_lock(&cache_lock);
-	if (sc->nr_to_scan)
-		prune_cache_entries();
 	num = num_drc_entries;
 	spin_unlock(&cache_lock);
 
 	return num;
 }
 
+static long
+nfsd_reply_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	long freed;
+	spin_lock(&cache_lock);
+	freed = prune_cache_entries();
+	spin_unlock(&cache_lock);
+	return freed;
+}
 /*
  * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
  */
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 762b09c..fd6b762 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -687,44 +687,42 @@ int dquot_quota_sync(struct super_block *sb, int type)
 }
 EXPORT_SYMBOL(dquot_quota_sync);
 
-/* Free unused dquots from cache */
-static void prune_dqcache(int count)
+static long
+dqcache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct list_head *head;
 	struct dquot *dquot;
+	long freed = 0;
 
 	head = free_dquots.prev;
-	while (head != &free_dquots && count) {
+	while (head != &free_dquots && sc->nr_to_scan) {
 		dquot = list_entry(head, struct dquot, dq_free);
 		remove_dquot_hash(dquot);
 		remove_free_dquot(dquot);
 		remove_inuse(dquot);
 		do_destroy_dquot(dquot);
-		count--;
+		sc->nr_to_scan--;
+		freed++;
 		head = free_dquots.prev;
 	}
+	return freed;
 }
 
-/*
- * This is called from kswapd when we think we need some
- * more memory
- */
-static int shrink_dqcache_memory(struct shrinker *shrink,
-				 struct shrink_control *sc)
-{
-	int nr = sc->nr_to_scan;
+static long
+dqcache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 
-	if (nr) {
-		spin_lock(&dq_list_lock);
-		prune_dqcache(nr);
-		spin_unlock(&dq_list_lock);
-	}
+{
 	return vfs_pressure_ratio(
 	percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
 }
 
 static struct shrinker dqcache_shrinker = {
-	.shrink = shrink_dqcache_memory,
+	.count_objects = dqcache_shrink_count,
+	.scan_objects = dqcache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index 9e1d056..669d8c0 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -277,19 +277,23 @@ static int kick_a_thread(void)
 	return 0;
 }
 
-int ubifs_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+long ubifs_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
+
+	/*
+	 * Due to the way UBIFS updates the clean znode counter it may
+	 * temporarily be negative.
+	 */
+	return clean_zn_cnt >= 0 ? clean_zn_cnt : 1;
+}
+
+long ubifs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	int nr = sc->nr_to_scan;
 	int freed, contention = 0;
 	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
 
-	if (nr == 0)
-		/*
-		 * Due to the way UBIFS updates the clean znode counter it may
-		 * temporarily be negative.
-		 */
-		return clean_zn_cnt >= 0 ? clean_zn_cnt : 1;
-
 	if (!clean_zn_cnt) {
 		/*
 		 * No clean znodes, nothing to reap. All we can do in this case
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index f21acf0..ff357e0 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -49,7 +49,8 @@ struct kmem_cache *ubifs_inode_slab;
 
 /* UBIFS TNC shrinker description */
 static struct shrinker ubifs_shrinker_info = {
-	.shrink = ubifs_shrinker,
+	.scan_objects = ubifs_shrink_scan,
+	.count_objects = ubifs_shrink_count,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index b2babce..bcdafcc 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1624,7 +1624,8 @@ int ubifs_tnc_start_commit(struct ubifs_info *c, struct ubifs_zbranch *zroot);
 int ubifs_tnc_end_commit(struct ubifs_info *c);
 
 /* shrinker.c */
-int ubifs_shrinker(struct shrinker *shrink, struct shrink_control *sc);
+long ubifs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc);
+long ubifs_shrink_count(struct shrinker *shrink, struct shrink_control *sc);
 
 /* commit.c */
 int ubifs_bg_thread(void *info);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 21/35] i915: bail out earlier when shrinker cannot acquire mutex
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (16 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 19/35] fs: convert fs shrinkers to new scan/count API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API Glauber Costa
                     ` (12 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Kent Overstreet

The main shrinker driver will keep trying for a while to free objects if
the returned value from the shrink scan procedure is 0.  That means "no
objects now", but a retry could very well succeed.

A negative value has a different meaning. It means it is impossible to
shrink, and we would better bail out soon. We find this behavior more
appropriate for the case where the lock cannot be taken. Specially given
the hammer behavior of the i915: if another thread is already shrinking,
we are likely not to be able to shrink anything anyway when we finally
acquire the mutex.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Acked-by: Daniel Vetter <daniel.vetter-/w4YWyX8dFk@public.gmane.org>
CC: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
CC: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/i915/i915_gem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e360031..72a05ee 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4490,10 +4490,10 @@ i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
 
 	if (!mutex_trylock(&dev->struct_mutex)) {
 		if (!mutex_is_locked_by(&dev->struct_mutex, current))
-			return 0;
+			return -1;
 
 		if (dev_priv->mm.shrinker_no_lock_stealing)
-			return 0;
+			return -1;
 
 		unlock = false;
 	}
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (17 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 21/35] i915: bail out earlier when shrinker cannot acquire mutex Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 23/35] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
                     ` (11 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa, Marcelo Tosatti, Gleb Natapov,
	Chuck Lever, J. Bruce Fields, Trond Myklebust

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Convert the remaining couple of random shrinkers in the tree to the
new API.

Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
CC: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Gleb Natapov <gleb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
CC: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
---
 arch/x86/kvm/mmu.c | 28 +++++++++++++++++++++-------
 net/sunrpc/auth.c  | 45 +++++++++++++++++++++++++++++++--------------
 2 files changed, 52 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f8ca2f3..d2ce14c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4213,13 +4213,14 @@ restart:
 	spin_unlock(&kvm->mmu_lock);
 }
 
-static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
+static long
+mmu_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct kvm *kvm;
 	int nr_to_scan = sc->nr_to_scan;
-
-	if (nr_to_scan == 0)
-		goto out;
+	long freed = 0;
 
 	raw_spin_lock(&kvm_lock);
 
@@ -4247,24 +4248,37 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
 		idx = srcu_read_lock(&kvm->srcu);
 		spin_lock(&kvm->mmu_lock);
 
-		prepare_zap_oldest_mmu_page(kvm, &invalid_list);
+		freed += prepare_zap_oldest_mmu_page(kvm, &invalid_list);
 		kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
 		spin_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, idx);
 
+		/*
+		 * unfair on small ones
+		 * per-vm shrinkers cry out
+		 * sadness comes quickly
+		 */
 		list_move_tail(&kvm->vm_list, &vm_list);
 		break;
 	}
 
 	raw_spin_unlock(&kvm_lock);
+	return freed;
 
-out:
+}
+
+static long
+mmu_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
 	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
 }
 
 static struct shrinker mmu_shrinker = {
-	.shrink = mmu_shrink,
+	.count_objects = mmu_shrink_count,
+	.scan_objects = mmu_shrink_scan,
 	.seeks = DEFAULT_SEEKS * 10,
 };
 
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index ed2fdd2..9ce0976 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -413,12 +413,13 @@ EXPORT_SYMBOL_GPL(rpcauth_destroy_credcache);
 /*
  * Remove stale credentials. Avoid sleeping inside the loop.
  */
-static int
+static long
 rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 {
 	spinlock_t *cache_lock;
 	struct rpc_cred *cred, *next;
 	unsigned long expired = jiffies - RPC_AUTH_EXPIRY_MORATORIUM;
+	long freed = 0;
 
 	list_for_each_entry_safe(cred, next, &cred_unused, cr_lru) {
 
@@ -430,10 +431,11 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 		 */
 		if (time_in_range(cred->cr_expire, expired, jiffies) &&
 		    test_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags) != 0)
-			return 0;
+			break;
 
 		list_del_init(&cred->cr_lru);
 		number_cred_unused--;
+		freed++;
 		if (atomic_read(&cred->cr_count) != 0)
 			continue;
 
@@ -446,29 +448,43 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 		}
 		spin_unlock(cache_lock);
 	}
-	return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
+	return freed;
 }
 
 /*
  * Run memory cache shrinker.
  */
-static int
-rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+rpcauth_cache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+
 {
 	LIST_HEAD(free);
-	int res;
-	int nr_to_scan = sc->nr_to_scan;
-	gfp_t gfp_mask = sc->gfp_mask;
+	long freed;
+
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		return -1;
 
-	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return (nr_to_scan == 0) ? 0 : -1;
+	/* nothing left, don't come back */
 	if (list_empty(&cred_unused))
-		return 0;
+		return -1;
+
 	spin_lock(&rpc_credcache_lock);
-	res = rpcauth_prune_expired(&free, nr_to_scan);
+	freed = rpcauth_prune_expired(&free, sc->nr_to_scan);
 	spin_unlock(&rpc_credcache_lock);
 	rpcauth_destroy_credlist(&free);
-	return res;
+
+	return freed;
+}
+
+static long
+rpcauth_cache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+
+{
+	return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 /*
@@ -784,7 +800,8 @@ rpcauth_uptodatecred(struct rpc_task *task)
 }
 
 static struct shrinker rpc_cred_shrinker = {
-	.shrink = rpcauth_cache_shrinker,
+	.count_objects = rpcauth_cache_shrink_count,
+	.scan_objects = rpcauth_cache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API
  2013-06-03 19:29   ` [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
       [not found]       ` <20130605160821.59adf9ad4efe48144fd9e237-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Marcelo Tosatti, Gleb Natapov,
	Chuck Lever, J. Bruce Fields, Trond Myklebust

On Mon,  3 Jun 2013 23:29:51 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> Convert the remaining couple of random shrinkers in the tree to the
> new API.

Gee we have a lot of shrinkers.

> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4213,13 +4213,14 @@ restart:
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> -static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
> +static long
> +mmu_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
>
> ...
>
> --- a/net/sunrpc/auth.c
> +++ b/net/sunrpc/auth.c
> -static int
> -rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
> +static long
> +rpcauth_cache_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
> +

It is pretty poor form to switch other people's code into this very
non-standard XFSish coding style.  The maintainers are just going to
have to go wtf and switch it back one day.

Really, it would be best if you were to go through the entire patchset
and undo all this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160821.59adf9ad4efe48144fd9e237-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API
       [not found]       ` <20130605160821.59adf9ad4efe48144fd9e237-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  3:41         ` Dave Chinner
  2013-06-06  8:27           ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  3:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Marcelo Tosatti, Gleb Natapov, Chuck Lever,
	J. Bruce Fields, Trond Myklebust

On Wed, Jun 05, 2013 at 04:08:21PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:51 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> 
> > From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > 
> > Convert the remaining couple of random shrinkers in the tree to the
> > new API.
> 
> Gee we have a lot of shrinkers.

And a large number of them are busted in some way, too :/

> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -4213,13 +4213,14 @@ restart:
> >  	spin_unlock(&kvm->mmu_lock);
> >  }
> >  
> > -static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > +static long
> > +mmu_shrink_scan(
> > +	struct shrinker		*shrink,
> > +	struct shrink_control	*sc)
> >
> > ...
> >
> > --- a/net/sunrpc/auth.c
> > +++ b/net/sunrpc/auth.c
> > -static int
> > -rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
> > +static long
> > +rpcauth_cache_shrink_scan(
> > +	struct shrinker		*shrink,
> > +	struct shrink_control	*sc)
> > +
> 
> It is pretty poor form to switch other people's code into this very
> non-standard XFSish coding style.  The maintainers are just going to
> have to go wtf and switch it back one day.

My bad.  That's left over from when I was originally developing the
the patch set passed a couple more parameters to the shrinkers
pushing every single declaration to well over the line length
limits. I never converted them back as I removed the extra
parameters, because it's far easier to just have delete a line that
delete a variable and reformat the entire function declaration....

> Really, it would be best if you were to go through the entire patchset
> and undo all this.

Sure, that can be done.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API
  2013-06-06  3:41         ` Dave Chinner
@ 2013-06-06  8:27           ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Marcelo Tosatti, Gleb Natapov,
	Chuck Lever, J. Bruce Fields, Trond Myklebust

On 06/06/2013 07:41 AM, Dave Chinner wrote:
>> Really, it would be best if you were to go through the entire patchset
>> > and undo all this.
> Sure, that can be done.
There is a lot to do, a lot to rebase, and many conflicts to fix.
Since I will be the one resending this anyway, let me just go ahead and
fix them.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 23/35] hugepage: convert huge zero page shrinker to new shrinker API
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (18 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 24/35] shrinker: Kill old ->shrink API Glauber Costa
                     ` (10 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner

It consists of:

* returning long instead of int
* separating count from scan
* returning the number of freed entities in scan

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
CC: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/huge_memory.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 243e710..8dc36f5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -211,24 +211,29 @@ static void put_huge_zero_page(void)
 	BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
 }
 
-static int shrink_huge_zero_page(struct shrinker *shrink,
-		struct shrink_control *sc)
+static long shrink_huge_zero_page_count(struct shrinker *shrink,
+					struct shrink_control *sc)
 {
-	if (!sc->nr_to_scan)
-		/* we can free zero page only if last reference remains */
-		return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+	/* we can free zero page only if last reference remains */
+	return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+}
 
+static long shrink_huge_zero_page_scan(struct shrinker *shrink,
+				       struct shrink_control *sc)
+{
 	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
 		struct page *zero_page = xchg(&huge_zero_page, NULL);
 		BUG_ON(zero_page == NULL);
 		__free_page(zero_page);
+		return HPAGE_PMD_NR;
 	}
 
 	return 0;
 }
 
 static struct shrinker huge_zero_page_shrinker = {
-	.shrink = shrink_huge_zero_page,
+	.count_objects = shrink_huge_zero_page_count,
+	.scan_objects = shrink_huge_zero_page_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 24/35] shrinker: Kill old ->shrink API.
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (19 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 23/35] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 25/35] vmscan: also shrink slab in memcg pressure Glauber Costa
                     ` (9 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Glauber Costa

From: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

There are no more users of this API, so kill it dead, dead, dead and
quietly bury the corpse in a shallow, unmarked grave in a dark
forest deep in the hills...

[ glommer: added flowers to the grave ]
Signed-off-by: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 include/linux/shrinker.h      | 15 +++++----------
 include/trace/events/vmscan.h |  4 ++--
 mm/vmscan.c                   | 38 ++++++++------------------------------
 3 files changed, 15 insertions(+), 42 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index d70b123..0786394 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -7,14 +7,15 @@
  *
  * The 'gfpmask' refers to the allocation we are currently trying to
  * fulfil.
- *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
 	gfp_t gfp_mask;
 
-	/* How many slab objects shrinker() should scan and try to reclaim */
+	/*
+	 * How many objects scan_objects should scan and try to reclaim.
+	 * This is reset before every call, so it is safe for callees
+	 * to modify.
+	 */
 	long nr_to_scan;
 
 	/* shrink from these nodes */
@@ -26,11 +27,6 @@ struct shrink_control {
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * @shrink() should look through the least-recently-used 'nr_to_scan' entries
- * and attempt to free them up.  It should return the number of objects which
- * remain in the cache.  If it returns -1, it means it cannot do any scanning at
- * this time (eg. there is a risk of deadlock).
- *
  * @count_objects should return the number of freeable items in the cache. If
  * there are no objects to free or the number of freeable items cannot be
  * determined, it should return 0. No deadlock checks should be done during the
@@ -48,7 +44,6 @@ struct shrink_control {
  * @flags determine the shrinker abilities, like numa awareness 
  */
 struct shrinker {
-	int (*shrink)(struct shrinker *, struct shrink_control *sc);
 	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
 	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 63cfccc..132a985 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -202,7 +202,7 @@ TRACE_EVENT(mm_shrink_slab_start,
 
 	TP_fast_assign(
 		__entry->shr = shr;
-		__entry->shrink = shr->shrink;
+		__entry->shrink = shr->scan_objects;
 		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
 		__entry->gfp_flags = sc->gfp_mask;
 		__entry->pgs_scanned = pgs_scanned;
@@ -241,7 +241,7 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 	TP_fast_assign(
 		__entry->shr = shr;
-		__entry->shrink = shr->shrink;
+		__entry->shrink = shr->scan_objects;
 		__entry->unused_scan = unused_scan_cnt;
 		__entry->new_scan = new_scan_cnt;
 		__entry->retval = shrinker_retval;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 08eec9d..7641614 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -199,14 +199,6 @@ void unregister_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-static inline int do_shrinker_shrink(struct shrinker *shrinker,
-				     struct shrink_control *sc,
-				     unsigned long nr_to_scan)
-{
-	sc->nr_to_scan = nr_to_scan;
-	return (*shrinker->shrink)(shrinker, sc);
-}
-
 #define SHRINK_BATCH 128
 
 static unsigned long
@@ -223,11 +215,8 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
 
-	if (shrinker->scan_objects) {
-		max_pass = shrinker->count_objects(shrinker, shrinkctl);
-		WARN_ON(max_pass < 0);
-	} else
-		max_pass = do_shrinker_shrink(shrinker, shrinkctl, 0);
+	max_pass = shrinker->count_objects(shrinker, shrinkctl);
+	WARN_ON(max_pass < 0);
 	if (max_pass <= 0)
 		return 0;
 
@@ -246,7 +235,7 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	if (total_scan < 0) {
 		printk(KERN_ERR
 		"shrink_slab: %pF negative objects to delete nr=%ld\n",
-		       shrinker->shrink, total_scan);
+		       shrinker->scan_objects, total_scan);
 		total_scan = max_pass;
 	}
 
@@ -280,23 +269,12 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	while (total_scan >= batch_size) {
 		long ret;
 
-		if (shrinker->scan_objects) {
-			shrinkctl->nr_to_scan = batch_size;
-			ret = shrinker->scan_objects(shrinker, shrinkctl);
+		shrinkctl->nr_to_scan = batch_size;
+		ret = shrinker->scan_objects(shrinker, shrinkctl);
 
-			if (ret == -1)
-				break;
-			freed += ret;
-		} else {
-			int nr_before;
-			nr_before = do_shrinker_shrink(shrinker, shrinkctl, 0);
-			ret = do_shrinker_shrink(shrinker, shrinkctl,
-							batch_size);
-			if (ret == -1)
-				break;
-			if (ret < nr_before)
-				freed += nr_before - ret;
-		}
+		if (ret == -1)
+			break;
+		freed += ret;
 
 		count_vm_events(SLABS_SCANNED, batch_size);
 		total_scan -= batch_size;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 25/35] vmscan: also shrink slab in memcg pressure
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (20 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 24/35] shrinker: Kill old ->shrink API Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-03 19:29   ` [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
                     ` (8 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

Without the surrounding infrastructure, this patch is a bit of a hammer:
it will basically shrink objects from all memcgs under memcg pressure.
At least, however, we will keep the scan limited to the shrinkers marked
as per-memcg.

Future patches will implement the in-shrinker logic to filter objects
based on its memcg association.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/memcontrol.h | 17 +++++++++++++++++
 include/linux/shrinker.h   |  6 +++++-
 mm/memcontrol.c            | 16 +++++++++++++++-
 mm/vmscan.c                | 46 +++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..489c6d7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -378,6 +381,12 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
@@ -430,6 +439,8 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -563,6 +574,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
+
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0786394..bfa9666 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -22,6 +22,9 @@ struct shrink_control {
 	nodemask_t nodes_to_scan;
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* reclaim from this memcg only (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
 };
 
 /*
@@ -75,7 +78,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a1982ba..27af2d1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -465,7 +465,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1044,6 +1044,20 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 	return ret;
 }
 
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long val;
+
+	val = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL_FILE);
+	if (do_swap_account)
+		val += mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+						    LRU_ALL_ANON);
+	return val;
+}
+
 static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7641614..109a3bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,11 +139,42 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+/*
+ * kmem reclaim should usually not be triggered when we are doing targetted
+ * reclaim. It is only valid when global reclaim is triggered, or when the
+ * underlying memcg has kmem objects.
+ */
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return !sc->target_mem_cgroup ||
+		memcg_kmem_is_active(sc->target_mem_cgroup);
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	if (global_reclaim(sc))
+		return zone_reclaimable_pages(zone);
+	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
+}
+
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	return zone_reclaimable_pages(zone);
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -332,6 +363,14 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
+		/*
+		 * If we don't have a target mem cgroup, we scan them all.
+		 * Otherwise we will limit our scan to shrinkers marked as
+		 * memcg aware
+		 */
+		if (shrinkctl->target_mem_cgroup &&
+		    !(shrinker->flags & SHRINKER_MEMCG_AWARE))
+			continue;
 
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
 			shrinkctl->nid = 0;
@@ -2319,9 +2358,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
+		 * over limit cgroups, unless we know they have kmem objects
 		 */
-		if (global_reclaim(sc)) {
+		if (has_kmem_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 
 			nodes_clear(shrink->nodes_to_scan);
@@ -2330,7 +2369,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
-				lru_pages += zone_reclaimable_pages(zone);
+				lru_pages += zone_nr_reclaimable_pages(sc, zone);
 				node_set(zone_to_nid(zone),
 					 shrink->nodes_to_scan);
 			}
@@ -2598,6 +2637,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
+		.target_mem_cgroup = memcg,
 	};
 
 	/*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (21 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 25/35] vmscan: also shrink slab in memcg pressure Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 27/35] lru: add an element to a memcg list Glauber Costa
                     ` (7 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

When a new memcg is created, we need to open up room for its descriptors
in all of the list_lrus that are marked per-memcg. The process is quite
similar to the one we are using for the kmem caches: we initialize the
new structures in an array indexed by kmemcg_id, and grow the array if
needed. Key data like the size of the array will be shared between the
kmem cache code and the list_lru code (they basically describe the same
thing)

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/list_lru.h   |  48 +++++++++++++-
 include/linux/memcontrol.h |  12 ++++
 lib/list_lru.c             | 102 +++++++++++++++++++++++++++---
 mm/memcontrol.c            | 151 +++++++++++++++++++++++++++++++++++++++++++--
 mm/slab_common.c           |   1 -
 5 files changed, 297 insertions(+), 17 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index cf59a8a..57fe0e3 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -24,6 +24,23 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * This is supposed to be M x N matrix, where M is kmem-limited memcg, and N is
+ * the number of nodes. Both dimensions are likely to be very small, but are
+ * potentially very big. Therefore we will allocate or grow them dynamically.
+ *
+ * The size of M will increase as new memcgs appear and can be 0 if no memcgs
+ * are being used. This is done in mm/memcontrol.c in a way quite similar than
+ * the way we use for the slab cache management.
+ *
+ * The size o N can't be determined at compile time, but won't increase once we
+ * determine it. It is nr_node_ids, the firmware-provided maximum number of
+ * nodes in a system.
+ */
+struct list_lru_array {
+	struct list_lru_node node[1];
+};
+
 struct list_lru {
 	/*
 	 * Because we use a fixed-size array, this struct can be very big if
@@ -37,9 +54,38 @@ struct list_lru {
 	 */
 	struct list_lru_node	node[MAX_NUMNODES];
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	/* All memcg-aware LRUs will be chained in the lrus list */
+	struct list_head	lrus;
+	/* M x N matrix as described above */
+	struct list_lru_array	**memcg_lrus;
+#endif
 };
 
-int list_lru_init(struct list_lru *lru);
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_KMEM
+struct list_lru_array *lru_alloc_array(void);
+int memcg_update_all_lrus(unsigned long num);
+void memcg_destroy_all_lrus(struct mem_cgroup *memcg);
+void list_lru_destroy(struct list_lru *lru);
+int __memcg_init_lru(struct list_lru *lru);
+#else
+static inline void list_lru_destroy(struct list_lru *lru)
+{
+}
+#endif
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
+
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 489c6d7..3442eb9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/list_lru.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -470,6 +471,12 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
+int memcg_new_lru(struct list_lru *lru);
+int memcg_init_lru(struct list_lru *lru);
+
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru);
+
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
@@ -633,6 +640,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_init_lru(struct list_lru *lru)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index dae13d6..db35edc 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -2,12 +2,17 @@
  * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
  * Author: David Chinner
  *
+ * Memcg Awareness
+ * Copyright (C) 2013 Parallels Inc.
+ * Author: Glauber Costa
+ *
  * Generic LRU infrastructure
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/list_lru.h>
+#include <linux/memcontrol.h>
 
 int
 list_lru_add(
@@ -163,18 +168,97 @@ list_lru_dispose_all(
 	return total;
 }
 
-int
-list_lru_init(
-	struct list_lru	*lru)
+/*
+ * This protects the list of all LRU in the system. One only needs
+ * to take when registering an LRU, or when duplicating the list of lrus.
+ * Transversing an LRU can and should be done outside the lock
+ */
+static DEFINE_MUTEX(all_memcg_lrus_mutex);
+static LIST_HEAD(all_memcg_lrus);
+
+static void list_lru_init_one(struct list_lru_node *lru)
 {
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+}
+
+struct list_lru_array *lru_alloc_array(void)
+{
+	struct list_lru_array *lru_array;
 	int i;
 
-	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
+				GFP_KERNEL);
+	if (!lru_array)
+		return NULL;
+
+	for (i = 0; i < nr_node_ids; i++)
+		list_lru_init_one(&lru_array->node[i]);
+
+	return lru_array;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+int __memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&lru->lrus);
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_add(&lru->lrus, &all_memcg_lrus);
+	ret = memcg_new_lru(lru);
+	mutex_unlock(&all_memcg_lrus_mutex);
+	return ret;
+}
+
+int memcg_update_all_lrus(unsigned long num)
+{
+	int ret = 0;
+	struct list_lru *lru;
+
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
+		ret = memcg_kmem_update_lru_size(lru, num, false);
+		if (ret)
+			goto out;
+	}
+out:
+	mutex_unlock(&all_memcg_lrus_mutex);
+	return ret;
+}
+
+void list_lru_destroy(struct list_lru *lru)
+{
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_del(&lru->lrus);
+	mutex_unlock(&all_memcg_lrus_mutex);
+}
+
+void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	struct list_lru *lru;
+	mutex_lock(&all_memcg_lrus_mutex);
+	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
+		kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);
+		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
+		/* everybody must beaware that this memcg is no longer valid */
+		wmb();
 	}
+	mutex_unlock(&all_memcg_lrus_mutex);
+}
+#endif
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int i;
+
+	nodes_clear(lru->active_nodes);
+	for (i = 0; i < MAX_NUMNODES; i++)
+		list_lru_init_one(&lru->node[i]);
+
+	if (memcg_enabled)
+		return memcg_init_lru(lru);
 	return 0;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27af2d1..5d31b4a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3163,16 +3163,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out;
+
+	/*
+	 * We should make sure that the array size is not updated until we are
+	 * done; otherwise we have no easy way to know whether or not we should
+	 * grow the array.
+	 */
+	ret = memcg_update_all_lrus(num + 1);
+	if (ret)
+		goto out;
 
 	memcg->kmemcg_id = num;
+
+	memcg_update_array_size(num + 1);
+
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
 	mutex_init(&memcg->slab_caches_mutex);
+
 	return 0;
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
@@ -3254,6 +3268,129 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+/*
+ * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
+ *
+ * @lru: the lru we are operating with
+ * @num_groups: how many kmem-limited cgroups we have
+ * @new_lru: true if this is a new_lru being created, false if this
+ * was triggered from the memcg side
+ *
+ * Returns 0 on success, and an error code otherwise.
+ *
+ * This function can be called either when a new kmem-limited memcg appears,
+ * or when a new list_lru is created. The work is roughly the same in two cases,
+ * but in the later we never have to expand the array size.
+ *
+ * This is always protected by the all_lrus_mutex from the list_lru side.  But
+ * a race can still exists if a new memcg becomes kmem limited at the same time
+ * that we are registering a new memcg. Creation is protected by the
+ * memcg_mutex, so the creation of a new lru have to be protected by that as
+ * well.
+ *
+ * The lock ordering is that the memcg_mutex needs to be acquired before the
+ * lru-side mutex.
+ */
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru)
+{
+	struct list_lru_array **new_lru_array;
+	struct list_lru_array *lru_array;
+
+	lru_array = lru_alloc_array();
+	if (!lru_array)
+		return -ENOMEM;
+
+	/*
+	 * When a new LRU is created, we still need to update all data for that
+	 * LRU. The procedure for late LRUs and new memcgs are quite similar, we
+	 * only need to make sure we get into the loop even if num_groups <
+	 * memcg_limited_groups_array_size.
+	 */
+	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
+		int i;
+		struct list_lru_array **old_array;
+		size_t size = memcg_caches_array_size(num_groups);
+		int num_memcgs = memcg_limited_groups_array_size;
+
+		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
+		if (!new_lru_array) {
+			kfree(lru_array);
+			return -ENOMEM;
+		}
+
+		for (i = 0; lru->memcg_lrus && (i < num_memcgs); i++) {
+			if (lru->memcg_lrus && lru->memcg_lrus[i])
+				continue;
+			new_lru_array[i] =  lru->memcg_lrus[i];
+		}
+
+		old_array = lru->memcg_lrus;
+		lru->memcg_lrus = new_lru_array;
+		/*
+		 * We don't need a barrier here because we are just copying
+		 * information over. Anybody operating in memcg_lrus will
+		 * either follow the new array or the old one and they contain
+		 * exactly the same information. The new space in the end is
+		 * always empty anyway.
+		 */
+		if (lru->memcg_lrus)
+			kfree(old_array);
+	}
+
+	if (lru->memcg_lrus) {
+		lru->memcg_lrus[num_groups - 1] = lru_array;
+		/*
+		 * Here we do need the barrier, because of the state transition
+		 * implied by the assignment of the array. All users should be
+		 * able to see it
+		 */
+		wmb();
+	}
+	return 0;
+}
+
+/*
+ * This is called with the LRU-mutex being held.
+ */
+int memcg_new_lru(struct list_lru *lru)
+{
+	struct mem_cgroup *iter;
+
+	if (!memcg_kmem_enabled())
+		return 0;
+
+	for_each_mem_cgroup(iter) {
+		int ret;
+		int memcg_id = memcg_cache_id(iter);
+		if (memcg_id < 0)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
+		if (ret) {
+			mem_cgroup_iter_break(root_mem_cgroup, iter);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+/*
+ * We need to call back and forth from memcg to LRU because of the lock
+ * ordering.  This complicates the flow a little bit, but since the memcg mutex
+ * is held through the whole duration of memcg creation, we need to hold it
+ * before we hold the LRU-side mutex in the case of a new list creation as
+ * well.
+ */
+int memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+	mutex_lock(&memcg_create_mutex);
+	ret = __memcg_init_lru(lru);
+	mutex_unlock(&memcg_create_mutex);
+	return ret;
+}
+
 int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
 			 struct kmem_cache *root_cache)
 {
@@ -6063,8 +6200,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
 	 * possible that the charges went down to 0 between mark_dead and the
 	 * res_counter read, so in that case, we don't need the put
 	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
+	if (memcg_kmem_test_and_clear_dead(memcg)) {
+		memcg_destroy_all_lrus(memcg);
 		mem_cgroup_put(memcg);
+	}
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ff3218a..b729c53 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
 			goto out;
 	}
 
-	memcg_update_array_size(num_memcgs);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-06-03 19:29   ` [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
       [not found]       ` <20130605160828.1ec9f3538258d9a6d6c74083-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Mon,  3 Jun 2013 23:29:55 +0400 Glauber Costa <glommer@openvz.org> wrote:

> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)

Gee this is a big patchset.

>
> ...
>
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -24,6 +24,23 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +/*
> + * This is supposed to be M x N matrix, where M is kmem-limited memcg, and N is
> + * the number of nodes. Both dimensions are likely to be very small, but are
> + * potentially very big. Therefore we will allocate or grow them dynamically.
> + *
> + * The size of M will increase as new memcgs appear and can be 0 if no memcgs
> + * are being used. This is done in mm/memcontrol.c in a way quite similar than

"similar to"

> + * the way we use for the slab cache management.
> + *
> + * The size o N can't be determined at compile time, but won't increase once we

"value of N"

> + * determine it. It is nr_node_ids, the firmware-provided maximum number of
> + * nodes in a system.


> + */
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};
> +
>  struct list_lru {
>  	/*
>  	 * Because we use a fixed-size array, this struct can be very big if
> @@ -37,9 +54,38 @@ struct list_lru {
>  	 */
>  	struct list_lru_node	node[MAX_NUMNODES];
>  	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	/* All memcg-aware LRUs will be chained in the lrus list */
> +	struct list_head	lrus;
> +	/* M x N matrix as described above */
> +	struct list_lru_array	**memcg_lrus;
> +#endif
>  };

It's here where I decided "this code shouldn't be in lib/" ;)

> -int list_lru_init(struct list_lru *lru);
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +struct list_lru_array *lru_alloc_array(void);

Experience teaches it that it is often a mistake for callees to assume
they will always be called in GFP_KERNEL context.  For high-level init
code we can usually get away with it, but I do think that the decision
to not provide a gfp_t argument should be justfied up-front, and that
this restriction should be mentioned in the interface documentation
(when it is written ;)).

>
> ...
>
> @@ -163,18 +168,97 @@ list_lru_dispose_all(
>  	return total;
>  }
>  
> -int
> -list_lru_init(
> -	struct list_lru	*lru)
> +/*
> + * This protects the list of all LRU in the system. One only needs
> + * to take when registering an LRU, or when duplicating the list of lrus.

That isn't very grammatical.

> + * Transversing an LRU can and should be done outside the lock
> + */
> +static DEFINE_MUTEX(all_memcg_lrus_mutex);
> +static LIST_HEAD(all_memcg_lrus);
> +
> +static void list_lru_init_one(struct list_lru_node *lru)
>  {
> +	spin_lock_init(&lru->lock);
> +	INIT_LIST_HEAD(&lru->list);
> +	lru->nr_items = 0;
> +}
> +
> +struct list_lru_array *lru_alloc_array(void)
> +{
> +	struct list_lru_array *lru_array;
>  	int i;
>  
> -	nodes_clear(lru->active_nodes);
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		spin_lock_init(&lru->node[i].lock);
> -		INIT_LIST_HEAD(&lru->node[i].list);
> -		lru->node[i].nr_items = 0;
> +	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
> +				GFP_KERNEL);

Could use kcalloc() here.

> +	if (!lru_array)
> +		return NULL;
> +
> +	for (i = 0; i < nr_node_ids; i++)
> +		list_lru_init_one(&lru_array->node[i]);
> +
> +	return lru_array;
> +}
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int __memcg_init_lru(struct list_lru *lru)
> +{
> +	int ret;
> +
> +	INIT_LIST_HEAD(&lru->lrus);
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_add(&lru->lrus, &all_memcg_lrus);
> +	ret = memcg_new_lru(lru);
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +	return ret;
> +}
> +
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
> +		ret = memcg_kmem_update_lru_size(lru, num, false);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +	return ret;
> +}
> +
> +void list_lru_destroy(struct list_lru *lru)

This is a memcg-specific function (which lives in lib/list_lru.c!) and
hence should be called, say, memcg_list_lru_destroy().

> +{
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_del(&lru->lrus);
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +}
> +
> +void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
> +{
> +	struct list_lru *lru;
> +	mutex_lock(&all_memcg_lrus_mutex);
> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
> +		kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);
> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;

Some common-subexpression-elimination-by-hand would probably improve
the output code here.

> +		/* everybody must beaware that this memcg is no longer valid */

"be aware"

> +		wmb();

The code implies that other code paths can come in here and start
playing with the pointer without taking all_memcg_lrus_mutex?  If so,
where, how why, etc?

I'd be more confortable if the sequence was something like

	lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
	wmb();
	kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);

but that still has holes and is still scary.


What's going on here?

>  	}
> +	mutex_unlock(&all_memcg_lrus_mutex);
> +}
> +#endif
> +
> +int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
> +{
> +	int i;
> +
> +	nodes_clear(lru->active_nodes);
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		list_lru_init_one(&lru->node[i]);
> +
> +	if (memcg_enabled)
> +		return memcg_init_lru(lru);

OK, this is weird.  list_lru.c calls into a memcg initialisation
function!  That memcg initialisation function then calls into
list_lru.c stuff, as expected.

Seems screwed up.  What's going on here?

>  	return 0;
>  }
> -EXPORT_SYMBOL_GPL(list_lru_init);
> +EXPORT_SYMBOL_GPL(__list_lru_init);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 27af2d1..5d31b4a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3163,16 +3163,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>  	memcg_kmem_set_activated(memcg);
>  
>  	ret = memcg_update_all_caches(num+1);
> -	if (ret) {
> -		ida_simple_remove(&kmem_limited_groups, num);
> -		memcg_kmem_clear_activated(memcg);
> -		return ret;
> -	}
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * We should make sure that the array size is not updated until we are
> +	 * done; otherwise we have no easy way to know whether or not we should
> +	 * grow the array.
> +	 */

What's the locking here, to prevent concurrent array-resizers?

> +	ret = memcg_update_all_lrus(num + 1);
> +	if (ret)
> +		goto out;
>  
>  	memcg->kmemcg_id = num;
> +
> +	memcg_update_array_size(num + 1);
> +
>  	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
>  	mutex_init(&memcg->slab_caches_mutex);
> +
>  	return 0;
> +out:
> +	ida_simple_remove(&kmem_limited_groups, num);
> +	memcg_kmem_clear_activated(memcg);
> +	return ret;
>  }
>  
>  static size_t memcg_caches_array_size(int num_groups)
> @@ -3254,6 +3268,129 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
>  	return 0;
>  }
>  
> +/*
> + * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
> + *
> + * @lru: the lru we are operating with
> + * @num_groups: how many kmem-limited cgroups we have
> + * @new_lru: true if this is a new_lru being created, false if this
> + * was triggered from the memcg side
> + *
> + * Returns 0 on success, and an error code otherwise.
> + *
> + * This function can be called either when a new kmem-limited memcg appears,
> + * or when a new list_lru is created. The work is roughly the same in two cases,

"both cases"

> + * but in the later we never have to expand the array size.

"latter"

> + *
> + * This is always protected by the all_lrus_mutex from the list_lru side.  But
> + * a race can still exists if a new memcg becomes kmem limited at the same time

"exist"

> + * that we are registering a new memcg. Creation is protected by the
> + * memcg_mutex, so the creation of a new lru have to be protected by that as

"has"

> + * well.
> + *
> + * The lock ordering is that the memcg_mutex needs to be acquired before the
> + * lru-side mutex.

It's nice to provide the C name of this "lru-side mutex".

> + */

This purports to be a kerneldoc comment, but it doesn't start with the
kerneldoc /** token.  Please review the entire patchset for this
(common) oddity.

> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
> +			       bool new_lru)
> +{
> +	struct list_lru_array **new_lru_array;
> +	struct list_lru_array *lru_array;
> +
> +	lru_array = lru_alloc_array();
> +	if (!lru_array)
> +		return -ENOMEM;
> +
> +	/*
> +	 * When a new LRU is created, we still need to update all data for that
> +	 * LRU. The procedure for late LRUs and new memcgs are quite similar, we

"procedures"

> +	 * only need to make sure we get into the loop even if num_groups <
> +	 * memcg_limited_groups_array_size.

This sentence is hard to follow.  Particularly the "even if" part. 
Rework it?

> +	 */
> +	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
> +		int i;
> +		struct list_lru_array **old_array;
> +		size_t size = memcg_caches_array_size(num_groups);
> +		int num_memcgs = memcg_limited_groups_array_size;
> +
> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);

Could use kcalloc().

What are the implications of that GFP_KERNEL?  That we cannot take
memcg_mutex and "the lru-side mutex" on the direct reclaim -> shrink
codepaths.  Is that honoured?  Any other potential problems here?

> +		if (!new_lru_array) {
> +			kfree(lru_array);
> +			return -ENOMEM;
> +		}
> +
> +		for (i = 0; lru->memcg_lrus && (i < num_memcgs); i++) {
> +			if (lru->memcg_lrus && lru->memcg_lrus[i])
> +				continue;
> +			new_lru_array[i] =  lru->memcg_lrus[i];
> +		}
> +
> +		old_array = lru->memcg_lrus;
> +		lru->memcg_lrus = new_lru_array;
> +		/*
> +		 * We don't need a barrier here because we are just copying
> +		 * information over. Anybody operating in memcg_lrus will

s/in/on/

> +		 * either follow the new array or the old one and they contain
> +		 * exactly the same information. The new space in the end is

s/in/at/

> +		 * always empty anyway.
> +		 */
> +		if (lru->memcg_lrus)
> +			kfree(old_array);
> +	}
> +
> +	if (lru->memcg_lrus) {
> +		lru->memcg_lrus[num_groups - 1] = lru_array;
> +		/*
> +		 * Here we do need the barrier, because of the state transition
> +		 * implied by the assignment of the array. All users should be
> +		 * able to see it
> +		 */
> +		wmb();

Am worried about this lockless concurrency stuff.  Perhaps putting a
description of the overall design somewhere would be sensible.

> +	}
> +	return 0;
> +}
> +
> +/*
> + * This is called with the LRU-mutex being held.

That's "all_memcg_lrus_mutex", yes?  Not "all_lrus_mutex".  Clear as mud :(

> + */
> +int memcg_new_lru(struct list_lru *lru)
> +{
> +	struct mem_cgroup *iter;
> +
> +	if (!memcg_kmem_enabled())
> +		return 0;

So the caller took all_memcg_lrus_mutex needlessly in this case.  Could
be optimised.

> +	for_each_mem_cgroup(iter) {
> +		int ret;
> +		int memcg_id = memcg_cache_id(iter);
> +		if (memcg_id < 0)
> +			continue;
> +
> +		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
> +		if (ret) {
> +			mem_cgroup_iter_break(root_mem_cgroup, iter);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/*
> + * We need to call back and forth from memcg to LRU because of the lock
> + * ordering.  This complicates the flow a little bit, but since the memcg mutex

"the memcg mutex" is named... what?

> + * is held through the whole duration of memcg creation, we need to hold it
> + * before we hold the LRU-side mutex in the case of a new list creation as

"LRU-side mutex" has a name?

> + * well.
> + */
>
> ...
>


^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160828.1ec9f3538258d9a6d6c74083-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation
       [not found]       ` <20130605160828.1ec9f3538258d9a6d6c74083-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  8:52         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Rik van Riel

On 06/06/2013 03:08 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:55 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> 
>> When a new memcg is created, we need to open up room for its descriptors
>> in all of the list_lrus that are marked per-memcg. The process is quite
>> similar to the one we are using for the kmem caches: we initialize the
>> new structures in an array indexed by kmemcg_id, and grow the array if
>> needed. Key data like the size of the array will be shared between the
>> kmem cache code and the list_lru code (they basically describe the same
>> thing)
> 
> Gee this is a big patchset.
> 
>>
>> ...
>>
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -24,6 +24,23 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +/*
>> + * This is supposed to be M x N matrix, where M is kmem-limited memcg, and N is
>> + * the number of nodes. Both dimensions are likely to be very small, but are
>> + * potentially very big. Therefore we will allocate or grow them dynamically.
>> + *
>> + * The size of M will increase as new memcgs appear and can be 0 if no memcgs
>> + * are being used. This is done in mm/memcontrol.c in a way quite similar than
> 
> "similar to"
> 
>> + * the way we use for the slab cache management.
>> + *
>> + * The size o N can't be determined at compile time, but won't increase once we
> 
> "value of N"
> 
>> + * determine it. It is nr_node_ids, the firmware-provided maximum number of
>> + * nodes in a system.
> 
> 
>> + */
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
>> +
>>  struct list_lru {
>>  	/*
>>  	 * Because we use a fixed-size array, this struct can be very big if
>> @@ -37,9 +54,38 @@ struct list_lru {
>>  	 */
>>  	struct list_lru_node	node[MAX_NUMNODES];
>>  	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	/* All memcg-aware LRUs will be chained in the lrus list */
>> +	struct list_head	lrus;
>> +	/* M x N matrix as described above */
>> +	struct list_lru_array	**memcg_lrus;
>> +#endif
>>  };
> 
> It's here where I decided "this code shouldn't be in lib/" ;)
> 
>> -int list_lru_init(struct list_lru *lru);
>> +struct mem_cgroup;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +struct list_lru_array *lru_alloc_array(void);
> 
> Experience teaches it that it is often a mistake for callees to assume
> they will always be called in GFP_KERNEL context.  For high-level init
> code we can usually get away with it, but I do think that the decision
> to not provide a gfp_t argument should be justfied up-front, and that
> this restriction should be mentioned in the interface documentation
> (when it is written ;)).
> 
>>
>> ...
>>
>> @@ -163,18 +168,97 @@ list_lru_dispose_all(
>>  	return total;
>>  }
>>  
>> -int
>> -list_lru_init(
>> -	struct list_lru	*lru)
>> +/*
>> + * This protects the list of all LRU in the system. One only needs
>> + * to take when registering an LRU, or when duplicating the list of lrus.
> 
> That isn't very grammatical.
> 
>> + * Transversing an LRU can and should be done outside the lock
>> + */
>> +static DEFINE_MUTEX(all_memcg_lrus_mutex);
>> +static LIST_HEAD(all_memcg_lrus);
>> +
>> +static void list_lru_init_one(struct list_lru_node *lru)
>>  {
>> +	spin_lock_init(&lru->lock);
>> +	INIT_LIST_HEAD(&lru->list);
>> +	lru->nr_items = 0;
>> +}
>> +
>> +struct list_lru_array *lru_alloc_array(void)
>> +{
>> +	struct list_lru_array *lru_array;
>>  	int i;
>>  
>> -	nodes_clear(lru->active_nodes);
>> -	for (i = 0; i < MAX_NUMNODES; i++) {
>> -		spin_lock_init(&lru->node[i].lock);
>> -		INIT_LIST_HEAD(&lru->node[i].list);
>> -		lru->node[i].nr_items = 0;
>> +	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
>> +				GFP_KERNEL);
> 
> Could use kcalloc() here.
> 
>> +	if (!lru_array)
>> +		return NULL;
>> +
>> +	for (i = 0; i < nr_node_ids; i++)
>> +		list_lru_init_one(&lru_array->node[i]);
>> +
>> +	return lru_array;
>> +}
>> +
>> +#ifdef CONFIG_MEMCG_KMEM
>> +int __memcg_init_lru(struct list_lru *lru)
>> +{
>> +	int ret;
>> +
>> +	INIT_LIST_HEAD(&lru->lrus);
>> +	mutex_lock(&all_memcg_lrus_mutex);
>> +	list_add(&lru->lrus, &all_memcg_lrus);
>> +	ret = memcg_new_lru(lru);
>> +	mutex_unlock(&all_memcg_lrus_mutex);
>> +	return ret;
>> +}
>> +
>> +int memcg_update_all_lrus(unsigned long num)
>> +{
>> +	int ret = 0;
>> +	struct list_lru *lru;
>> +
>> +	mutex_lock(&all_memcg_lrus_mutex);
>> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
>> +		ret = memcg_kmem_update_lru_size(lru, num, false);
>> +		if (ret)
>> +			goto out;
>> +	}
>> +out:
>> +	mutex_unlock(&all_memcg_lrus_mutex);
>> +	return ret;
>> +}
>> +
>> +void list_lru_destroy(struct list_lru *lru)
> 
> This is a memcg-specific function (which lives in lib/list_lru.c!) and
> hence should be called, say, memcg_list_lru_destroy().
> 
>> +{
>> +	mutex_lock(&all_memcg_lrus_mutex);
>> +	list_del(&lru->lrus);
>> +	mutex_unlock(&all_memcg_lrus_mutex);
>> +}
>> +
>> +void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
>> +{
>> +	struct list_lru *lru;
>> +	mutex_lock(&all_memcg_lrus_mutex);
>> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
>> +		kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);
>> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
> 
> Some common-subexpression-elimination-by-hand would probably improve
> the output code here.
> 
>> +		/* everybody must beaware that this memcg is no longer valid */
> 
> "be aware"
> 
>> +		wmb();
> 
> The code implies that other code paths can come in here and start
> playing with the pointer without taking all_memcg_lrus_mutex?  If so,
> where, how why, etc?
> 
> I'd be more confortable if the sequence was something like
> 
> 	lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
> 	wmb();
> 	kfree(lru->memcg_lrus[memcg_cache_id(memcg)]);
> 
> but that still has holes and is still scary.
> 
> 
> What's going on here?
> 
>>  	}
>> +	mutex_unlock(&all_memcg_lrus_mutex);
>> +}
>> +#endif
>> +
>> +int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
>> +{
>> +	int i;
>> +
>> +	nodes_clear(lru->active_nodes);
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		list_lru_init_one(&lru->node[i]);
>> +
>> +	if (memcg_enabled)
>> +		return memcg_init_lru(lru);
> 
> OK, this is weird.  list_lru.c calls into a memcg initialisation
> function!  That memcg initialisation function then calls into
> list_lru.c stuff, as expected.
> 
> Seems screwed up.  What's going on here?
> 

I documented this in the memcg side.

/*
 * We need to call back and forth from memcg to LRU because of the lock
 * ordering.  This complicates the flow a little bit, but since the
memcg mutex
 * is held through the whole duration of memcg creation, we need to hold it
 * before we hold the LRU-side mutex in the case of a new list creation as
 * well.
 */

>>  	return 0;
>>  }
>> -EXPORT_SYMBOL_GPL(list_lru_init);
>> +EXPORT_SYMBOL_GPL(__list_lru_init);
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 27af2d1..5d31b4a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3163,16 +3163,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>>  	memcg_kmem_set_activated(memcg);
>>  
>>  	ret = memcg_update_all_caches(num+1);
>> -	if (ret) {
>> -		ida_simple_remove(&kmem_limited_groups, num);
>> -		memcg_kmem_clear_activated(memcg);
>> -		return ret;
>> -	}
>> +	if (ret)
>> +		goto out;
>> +
>> +	/*
>> +	 * We should make sure that the array size is not updated until we are
>> +	 * done; otherwise we have no easy way to know whether or not we should
>> +	 * grow the array.
>> +	 */
> 
> What's the locking here, to prevent concurrent array-resizers?
> 

the hammer-like set limit mutex is protecting all of this.

>> +	ret = memcg_update_all_lrus(num + 1);
>> +	if (ret)
>> +		goto out;
>>  
>>  	memcg->kmemcg_id = num;
>> +
>> +	memcg_update_array_size(num + 1);
>> +
>>  	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
>>  	mutex_init(&memcg->slab_caches_mutex);
>> +
>>  	return 0;
>> +out:
>> +	ida_simple_remove(&kmem_limited_groups, num);
>> +	memcg_kmem_clear_activated(memcg);
>> +	return ret;
>>  }
>>  
>>  static size_t memcg_caches_array_size(int num_groups)
>> @@ -3254,6 +3268,129 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
>> + *
>> + * @lru: the lru we are operating with
>> + * @num_groups: how many kmem-limited cgroups we have
>> + * @new_lru: true if this is a new_lru being created, false if this
>> + * was triggered from the memcg side
>> + *
>> + * Returns 0 on success, and an error code otherwise.
>> + *
>> + * This function can be called either when a new kmem-limited memcg appears,
>> + * or when a new list_lru is created. The work is roughly the same in two cases,
> 
> "both cases"
> 
>> + * but in the later we never have to expand the array size.
> 
> "latter"
> 
>> + *
>> + * This is always protected by the all_lrus_mutex from the list_lru side.  But
>> + * a race can still exists if a new memcg becomes kmem limited at the same time
> 
> "exist"
> 
>> + * that we are registering a new memcg. Creation is protected by the
>> + * memcg_mutex, so the creation of a new lru have to be protected by that as
> 
> "has"
> 
>> + * well.
>> + *
>> + * The lock ordering is that the memcg_mutex needs to be acquired before the
>> + * lru-side mutex.
> 
> It's nice to provide the C name of this "lru-side mutex".
> 
>> + */
> 
> This purports to be a kerneldoc comment, but it doesn't start with the
> kerneldoc /** token.  Please review the entire patchset for this
> (common) oddity.
> 
>> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
>> +			       bool new_lru)
>> +{
>> +	struct list_lru_array **new_lru_array;
>> +	struct list_lru_array *lru_array;
>> +
>> +	lru_array = lru_alloc_array();
>> +	if (!lru_array)
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * When a new LRU is created, we still need to update all data for that
>> +	 * LRU. The procedure for late LRUs and new memcgs are quite similar, we
> 
> "procedures"
> 
>> +	 * only need to make sure we get into the loop even if num_groups <
>> +	 * memcg_limited_groups_array_size.
> 
> This sentence is hard to follow.  Particularly the "even if" part. 
> Rework it?
> 
>> +	 */
>> +	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
>> +		int i;
>> +		struct list_lru_array **old_array;
>> +		size_t size = memcg_caches_array_size(num_groups);
>> +		int num_memcgs = memcg_limited_groups_array_size;
>> +
>> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
> 
> Could use kcalloc().
> 
> What are the implications of that GFP_KERNEL?  That we cannot take
> memcg_mutex and "the lru-side mutex" on the direct reclaim -> shrink
> codepaths.  Is that honoured?  Any other potential problems here?
> 

This has nothing to do with the mutex. You cannot register a new LRU,
and you cannot create a new memcg.

Both of those operations are always done - at least so far - in
GFP_KERNEL contexts.

>> +		if (!new_lru_array) {
>> +			kfree(lru_array);
>> +			return -ENOMEM;
>> +		}
>> +
>> +		for (i = 0; lru->memcg_lrus && (i < num_memcgs); i++) {
>> +			if (lru->memcg_lrus && lru->memcg_lrus[i])
>> +				continue;
>> +			new_lru_array[i] =  lru->memcg_lrus[i];
>> +		}
>> +
>> +		old_array = lru->memcg_lrus;
>> +		lru->memcg_lrus = new_lru_array;
>> +		/*
>> +		 * We don't need a barrier here because we are just copying
>> +		 * information over. Anybody operating in memcg_lrus will
> 
> s/in/on/
> 
>> +		 * either follow the new array or the old one and they contain
>> +		 * exactly the same information. The new space in the end is
> 
> s/in/at/
> 
>> +		 * always empty anyway.
>> +		 */
>> +		if (lru->memcg_lrus)
>> +			kfree(old_array);
>> +	}
>> +
>> +	if (lru->memcg_lrus) {
>> +		lru->memcg_lrus[num_groups - 1] = lru_array;
>> +		/*
>> +		 * Here we do need the barrier, because of the state transition
>> +		 * implied by the assignment of the array. All users should be
>> +		 * able to see it
>> +		 */
>> +		wmb();
> 
> Am worried about this lockless concurrency stuff.  Perhaps putting a
> description of the overall design somewhere would be sensible.
> 

I can do that. But in here it is really not a substitute for a lock, as
Tejun has been complaining. We would just like to make sure that the
change is immediately visible.
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * This is called with the LRU-mutex being held.
> 
> That's "all_memcg_lrus_mutex", yes?  Not "all_lrus_mutex".  Clear as mud :(
> 
yes.

>> + */
>> +int memcg_new_lru(struct list_lru *lru)
>> +{
>> +	struct mem_cgroup *iter;
>> +
>> +	if (!memcg_kmem_enabled())
>> +		return 0;
> 
> So the caller took all_memcg_lrus_mutex needlessly in this case.  Could
> be optimised.
> 
ok, but this is a registering function, is hardly worth it.


>> +	for_each_mem_cgroup(iter) {
>> +		int ret;
>> +		int memcg_id = memcg_cache_id(iter);
>> +		if (memcg_id < 0)
>> +			continue;
>> +
>> +		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
>> +		if (ret) {
>> +			mem_cgroup_iter_break(root_mem_cgroup, iter);
>> +			return ret;
>> +		}
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * We need to call back and forth from memcg to LRU because of the lock
>> + * ordering.  This complicates the flow a little bit, but since the memcg mutex
> 
> "the memcg mutex" is named... what?
> 
As you have noticed, I have been avoiding using the names of the
mutexes, because they are internal to "the other" file (lru.c in the
case of memcg, memcontrol.c in the case of lru).

It is so easy to get this out of sync, and lead to an even more
confusing "wth is this memcg_mutex that does not exist??", that I
decided to write a generic "memcg side mutex" instead.

I will of course flip it, if you prefer, master.

>> + * is held through the whole duration of memcg creation, we need to hold it
>> + * before we hold the LRU-side mutex in the case of a new list creation as
> 
> "LRU-side mutex" has a name?
> 
Yes, it has.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 27/35] lru: add an element to a memcg list
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (22 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 28/35] list_lru: per-memcg walks Glauber Costa
                     ` (6 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

With the infrastructure we now have, we can add an element to a memcg
LRU list instead of the global list. The memcg lists are still
per-node.

Technically, we will never trigger per-node shrinking in the memcg is
short of memory. Therefore an alternative to this would be to add the
element to *both* a single-node memcg array and a per-node global array.

There are two main reasons for this design choice:

1) adding an extra list_head to each of the objects would waste 16-bytes
per object, always remembering that we are talking about 1 dentry + 1
inode in the common case. This means a close to 10 % increase in the
dentry size, and a lower yet significant increase in the inode size. In
terms of total memory, this design pays 32-byte per-superblock-per-node
(size of struct list_lru_node), which means that in any scenario where
we have more than 10 dentries + inodes, we would already be paying more
memory in the two-list-heads approach than we will here with 1 node x 10
superblocks. The turning point of course depends on the workload, but I
hope the figures above would convince you that the memory footprint is
in my side in any workload that matters.

2) The main drawback of this, namely, that we loose global LRU order, is
not really seen by me as a disadvantage: if we are using memcg to
isolate the workloads, global pressure should try to balance the amount
reclaimed from all memcgs the same way the shrinkers will already
naturally balance the amount reclaimed from each superblock. (This
patchset needs some love in this regard, btw).

To help us easily tracking down which nodes have and which nodes doesn't
have elements in the list, we will count on an auxiliary node bitmap in
the global level.

[ v8: create LRUs before creating caches, and avoid races in which
  elements are added to a non existing LRU ]
[ v2: move memcg_kmem_lru_of_page to list_lru.c and then unpublish the
  auxiliary functions it uses ]
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/list_lru.h   |  11 +++++
 include/linux/memcontrol.h |   8 ++++
 lib/list_lru.c             | 112 +++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c            |  43 +++++++++++++----
 4 files changed, 156 insertions(+), 18 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 57fe0e3..3b8c301 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -53,12 +53,23 @@ struct list_lru {
 	 * structure, we may very well fail.
 	 */
 	struct list_lru_node	node[MAX_NUMNODES];
+	atomic_long_t		node_totals[MAX_NUMNODES];
 	nodemask_t		active_nodes;
 #ifdef CONFIG_MEMCG_KMEM
 	/* All memcg-aware LRUs will be chained in the lrus list */
 	struct list_head	lrus;
 	/* M x N matrix as described above */
 	struct list_lru_array	**memcg_lrus;
+	/*
+	 * The memcg_lrus is RCU protected, so we need to keep the previous
+	 * array around when we update it. But we can only do that after
+	 * synchronize_rcu(). A typical system has many LRUs, which means
+	 * that if we call synchronize_rcu after each LRU update, this
+	 * will become very expensive. We add this pointer here, and then
+	 * after all LRUs are update, we call synchronize_rcu() once, and
+	 * free all the old_arrays.
+	 */
+	void *old_array;
 #endif
 };
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3442eb9..50f199f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
 #include <linux/list_lru.h>
+#include <linux/mm.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -474,6 +475,8 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int memcg_new_lru(struct list_lru *lru);
 int memcg_init_lru(struct list_lru *lru);
 
+struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page);
+
 int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 			       bool new_lru);
 
@@ -645,6 +648,11 @@ static inline int memcg_init_lru(struct list_lru *lru)
 {
 	return 0;
 }
+
+static inline struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
+{
+	return NULL;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index db35edc..128af5e 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -14,19 +14,93 @@
 #include <linux/list_lru.h>
 #include <linux/memcontrol.h>
 
+/*
+ * lru_node_of_index - returns the node-lru of a specific lru
+ * @lru: the global lru we are operating at
+ * @index: if positive, the memcg id. If negative, means global lru.
+ * @nid: node id of the corresponding node we want to manipulate
+ */
+static struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_lru_node *nlru;
+
+	if (index < 0)
+		return &lru->node[nid];
+
+	/*
+	 * If we reach here with index >= 0, it means the page where the object
+	 * comes from is associated with a memcg. Because memcg_lrus is
+	 * populated before the caches, we can be sure that this request is
+	 * truly for a LRU list that does not have memcg caches.
+	 */
+	if (!lru->memcg_lrus)
+		return &lru->node[nid];
+
+	/*
+	 * Because we will only ever free the memcg_lrus after synchronize_rcu,
+	 * we are safe with the rcu lock here: even if we are operating in the
+	 * stale version of the array, the data is still valid and we are not
+	 * risking anything.
+	 *
+	 * The read barrier is needed to make sure that we see the pointer
+	 * assigment for the specific memcg
+	 */
+	rcu_read_lock();
+	rmb();
+	/*
+	 * The array exist, but the particular memcg does not. That is an
+	 * impossible situation: it would mean we are trying to add to a list
+	 * belonging to a memcg that does not exist. Either wasn't created or
+	 * has been already freed. In both cases it should no longer have
+	 * objects. BUG_ON to avoid a NULL dereference.
+	 */
+	BUG_ON(!lru->memcg_lrus[index]);
+	nlru = &lru->memcg_lrus[index]->node[nid];
+	rcu_read_unlock();
+	return nlru;
+#else
+	BUG_ON(index >= 0); /* nobody should be passing index < 0 with !KMEM */
+	return &lru->node[nid];
+#endif
+}
+
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem_page(page);
+	int nid = page_to_nid(page);
+	int memcg_id;
+
+	if (!memcg || !memcg_kmem_is_active(memcg))
+		return &lru->node[nid];
+
+	memcg_id = memcg_cache_id(memcg);
+	return lru_node_of_index(lru, memcg_id, nid);
+}
+
 int
 list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		nlru->nr_items++;
+		/*
+		 * We only consider a node active or inactive based on the
+		 * total figure for all involved children.
+		 */
+		if (atomic_long_add_return(1, &lru->node_totals[nid]) == 1)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -41,14 +115,20 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		nlru->nr_items--;
+
+		if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 			node_clear(nid, lru->active_nodes);
+
 		BUG_ON(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -93,9 +173,10 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
-				node_clear(nid, lru->active_nodes);
+			nlru->nr_items--;
 			BUG_ON(nlru->nr_items < 0);
+			if (atomic_long_dec_and_test(&lru->node_totals[nid]))
+				node_clear(nid, lru->active_nodes);
 			isolated++;
 			break;
 		case LRU_ROTATE:
@@ -224,6 +305,17 @@ int memcg_update_all_lrus(unsigned long num)
 			goto out;
 	}
 out:
+	/*
+	 * Even if we were to use call_rcu, we still have to keep the old array
+	 * pointer somewhere. It is easier for us to just synchronize rcu here
+	 * since we are in a fine context. Now we guarantee that there are no
+	 * more users of old_array, and proceed freeing it for all LRUs
+	 */
+	synchronize_rcu();
+	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
+		kfree(lru->old_array);
+		lru->old_array = NULL;
+	}
 	mutex_unlock(&all_memcg_lrus_mutex);
 	return ret;
 }
@@ -254,8 +346,10 @@ int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++)
+	for (i = 0; i < MAX_NUMNODES; i++) {
 		list_lru_init_one(&lru->node[i]);
+		atomic_long_set(&lru->node_totals[i], 0);
+	}
 
 	if (memcg_enabled)
 		return memcg_init_lru(lru);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d31b4a..846c82c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3162,19 +3162,22 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	 */
 	memcg_kmem_set_activated(memcg);
 
-	ret = memcg_update_all_caches(num+1);
-	if (ret)
-		goto out;
-
 	/*
-	 * We should make sure that the array size is not updated until we are
-	 * done; otherwise we have no easy way to know whether or not we should
-	 * grow the array.
+	 * We have to make absolutely sure that we update the LRUs before we
+	 * update the caches. Once the caches are updated, they will be able to
+	 * start hosting objects. If a cache is created very quickly, and and
+	 * element is used and disposed to the LRU quickly as well, we may end
+	 * up with a NULL pointer in list_lru_add because the lists are not yet
+	 * ready.
 	 */
 	ret = memcg_update_all_lrus(num + 1);
 	if (ret)
 		goto out;
 
+	ret = memcg_update_all_caches(num+1);
+	if (ret)
+		goto out;
+
 	memcg->kmemcg_id = num;
 
 	memcg_update_array_size(num + 1);
@@ -3320,7 +3323,7 @@ int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 		}
 
 		for (i = 0; lru->memcg_lrus && (i < num_memcgs); i++) {
-			if (lru->memcg_lrus && lru->memcg_lrus[i])
+			if (lru->memcg_lrus && !lru->memcg_lrus[i])
 				continue;
 			new_lru_array[i] =  lru->memcg_lrus[i];
 		}
@@ -3333,9 +3336,15 @@ int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 		 * either follow the new array or the old one and they contain
 		 * exactly the same information. The new space in the end is
 		 * always empty anyway.
+		 *
+		 * We do have to make sure that no more users of the old
+		 * memcg_lrus array exist before we free, and this is achieved
+		 * by rcu. Since it would be too slow to synchronize RCU for
+		 * every LRU, we store the pointer and let the LRU code free
+		 * all of them when all LRUs are updated.
 		 */
 		if (lru->memcg_lrus)
-			kfree(old_array);
+			lru->old_array = old_array;
 	}
 
 	if (lru->memcg_lrus) {
@@ -3479,6 +3488,22 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return NULL;
+
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc))
+		memcg = pc->mem_cgroup;
+	unlock_page_cgroup(pc);
+	return memcg;
+}
+
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 27/35] lru: add an element to a memcg list
  2013-06-03 19:29   ` [PATCH v10 27/35] lru: add an element to a memcg list Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
  2013-06-06  8:44       ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Mon,  3 Jun 2013 23:29:56 +0400 Glauber Costa <glommer@openvz.org> wrote:

> With the infrastructure we now have, we can add an element to a memcg
> LRU list instead of the global list. The memcg lists are still
> per-node.
> 
> Technically, we will never trigger per-node shrinking in the memcg is

"if the memcg".

> short of memory.

hm, why?  If the memcg is short on memory, we *want* to trigger pernode
shrinking?  Is this sentence describing a design feature or is it
describing a shortcoming which is about to be overcome?

> Therefore an alternative to this would be to add the
> element to *both* a single-node memcg array and a per-node global array.

The latter, I think.

> There are two main reasons for this design choice:
> 
> 1) adding an extra list_head to each of the objects would waste 16-bytes
> per object, always remembering that we are talking about 1 dentry + 1
> inode in the common case. This means a close to 10 % increase in the
> dentry size, and a lower yet significant increase in the inode size. In
> terms of total memory, this design pays 32-byte per-superblock-per-node
> (size of struct list_lru_node), which means that in any scenario where
> we have more than 10 dentries + inodes, we would already be paying more
> memory in the two-list-heads approach than we will here with 1 node x 10
> superblocks. The turning point of course depends on the workload, but I
> hope the figures above would convince you that the memory footprint is
> in my side in any workload that matters.

yup.  Assume the number of dentries and inodes is huge.

> 2) The main drawback of this, namely, that we loose global LRU order, is

"lose"

> not really seen by me as a disadvantage: if we are using memcg to
> isolate the workloads, global pressure should try to balance the amount
> reclaimed from all memcgs the same way the shrinkers will already
> naturally balance the amount reclaimed from each superblock. (This
> patchset needs some love in this regard, btw).
> 
> To help us easily tracking down which nodes have and which nodes doesn't

"track"

"don't"

> have elements in the list, we will count on an auxiliary node bitmap in

"use an auxiliary node bitmap at"

> the global level.
> 
> ...
>
> @@ -53,12 +53,23 @@ struct list_lru {
>  	 * structure, we may very well fail.
>  	 */
>  	struct list_lru_node	node[MAX_NUMNODES];
> +	atomic_long_t		node_totals[MAX_NUMNODES];
>  	nodemask_t		active_nodes;
>  #ifdef CONFIG_MEMCG_KMEM
>  	/* All memcg-aware LRUs will be chained in the lrus list */
>  	struct list_head	lrus;
>  	/* M x N matrix as described above */
>  	struct list_lru_array	**memcg_lrus;
> +	/*
> +	 * The memcg_lrus is RCU protected

It is?  I don't recall seeing that in the earlier patches.  Is some
description missing?

>	  , so we need to keep the previous
> +	 * array around when we update it. But we can only do that after
> +	 * synchronize_rcu(). A typical system has many LRUs, which means
> +	 * that if we call synchronize_rcu after each LRU update, this
> +	 * will become very expensive. We add this pointer here, and then
> +	 * after all LRUs are update, we call synchronize_rcu() once, and

"updated"

> +	 * free all the old_arrays.
> +	 */
> +	void *old_array;
>  #endif
>  };
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3442eb9..50f199f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -24,6 +24,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/jump_label.h>
>  #include <linux/list_lru.h>
> +#include <linux/mm.h>

erk.  There's a good chance that mm.h already includes memcontrol.h, or
vice versa, by some circuitous path.  Expect problems from this.

afaict the include is only needed for struct page?  If so, simply
adding a forward declaration for that would be prudent.

>  struct mem_cgroup;
>  struct page_cgroup;
> 
> ...
>
> +static struct list_lru_node *
> +lru_node_of_index(struct list_lru *lru, int index, int nid)
> +{
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru_node *nlru;
> +
> +	if (index < 0)
> +		return &lru->node[nid];
> +
> +	/*
> +	 * If we reach here with index >= 0, it means the page where the object
> +	 * comes from is associated with a memcg. Because memcg_lrus is
> +	 * populated before the caches, we can be sure that this request is
> +	 * truly for a LRU list that does not have memcg caches.

"an LRU" :)

> +	 */
> +	if (!lru->memcg_lrus)
> +		return &lru->node[nid];
> +
> +	/*
> +	 * Because we will only ever free the memcg_lrus after synchronize_rcu,
> +	 * we are safe with the rcu lock here: even if we are operating in the
> +	 * stale version of the array, the data is still valid and we are not
> +	 * risking anything.
> +	 *
> +	 * The read barrier is needed to make sure that we see the pointer
> +	 * assigment for the specific memcg
> +	 */
> +	rcu_read_lock();
> +	rmb();
> +	/*
> +	 * The array exist, but the particular memcg does not. That is an

"exists"

> +	 * impossible situation: it would mean we are trying to add to a list
> +	 * belonging to a memcg that does not exist. Either wasn't created or

"it wasn't created or it"

> +	 * has been already freed. In both cases it should no longer have
> +	 * objects. BUG_ON to avoid a NULL dereference.

Well.  We could jsut permit the NULL reference - that provides the same
info.  But an explicit BUG_ON does show that it has been thought through!

> +	 */
> +	BUG_ON(!lru->memcg_lrus[index]);
> +	nlru = &lru->memcg_lrus[index]->node[nid];
> +	rcu_read_unlock();
> +	return nlru;
> +#else
> +	BUG_ON(index >= 0); /* nobody should be passing index < 0 with !KMEM */
> +	return &lru->node[nid];
> +#endif
> +}
> +
> 
> ...
>
>  int
>  list_lru_add(
>  	struct list_lru	*lru,
>  	struct list_head *item)
>  {
> -	int nid = page_to_nid(virt_to_page(item));
> -	struct list_lru_node *nlru = &lru->node[nid];
> +	struct page *page = virt_to_page(item);
> +	struct list_lru_node *nlru;
> +	int nid = page_to_nid(page);
> +
> +	nlru = memcg_kmem_lru_of_page(lru, page);
>  
>  	spin_lock(&nlru->lock);
>  	BUG_ON(nlru->nr_items < 0);
>  	if (list_empty(item)) {
>  		list_add_tail(item, &nlru->list);
> -		if (nlru->nr_items++ == 0)
> +		nlru->nr_items++;
> +		/*
> +		 * We only consider a node active or inactive based on the
> +		 * total figure for all involved children.

Is "children" an appropriate term in this context?  Where would one go
to understand the overall object hierarchy here?

> +		 */
> +		if (atomic_long_add_return(1, &lru->node_totals[nid]) == 1)
>  			node_set(nid, lru->active_nodes);
>  		spin_unlock(&nlru->lock);
>  		return 1;
> 
> ...
>
> +	/*
> +	 * Even if we were to use call_rcu, we still have to keep the old array
> +	 * pointer somewhere. It is easier for us to just synchronize rcu here
> +	 * since we are in a fine context. Now we guarantee that there are no
> +	 * more users of old_array, and proceed freeing it for all LRUs

"a fine context" is a fine term, but it's unclear what is meant by it ;)

> +	 */
> +	synchronize_rcu();
> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
> +		kfree(lru->old_array);
> +		lru->old_array = NULL;
> +	}
>  	mutex_unlock(&all_memcg_lrus_mutex);
>  	return ret;
>  }
> 
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3162,19 +3162,22 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>  	 */
>  	memcg_kmem_set_activated(memcg);
>  
> -	ret = memcg_update_all_caches(num+1);
> -	if (ret)
> -		goto out;
> -
>  	/*
> -	 * We should make sure that the array size is not updated until we are
> -	 * done; otherwise we have no easy way to know whether or not we should
> -	 * grow the array.
> +	 * We have to make absolutely sure that we update the LRUs before we
> +	 * update the caches. Once the caches are updated, they will be able to
> +	 * start hosting objects. If a cache is created very quickly, and and

s/and/an/

> +	 * element is used and disposed to the LRU quickly as well, we may end
> +	 * up with a NULL pointer in list_lru_add because the lists are not yet
> +	 * ready.
>  	 */
>  	ret = memcg_update_all_lrus(num + 1);
>  	if (ret)
>  		goto out;
>  
> +	ret = memcg_update_all_caches(num+1);
> +	if (ret)
> +		goto out;
> +
>  	memcg->kmemcg_id = num;
>  
>  	memcg_update_array_size(num + 1);
> 
> ...
>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 27/35] lru: add an element to a memcg list
  2013-06-05 23:08     ` Andrew Morton
@ 2013-06-06  8:44       ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel


>> ...
>>
>> @@ -53,12 +53,23 @@ struct list_lru {
>>  	 * structure, we may very well fail.
>>  	 */
>>  	struct list_lru_node	node[MAX_NUMNODES];
>> +	atomic_long_t		node_totals[MAX_NUMNODES];
>>  	nodemask_t		active_nodes;
>>  #ifdef CONFIG_MEMCG_KMEM
>>  	/* All memcg-aware LRUs will be chained in the lrus list */
>>  	struct list_head	lrus;
>>  	/* M x N matrix as described above */
>>  	struct list_lru_array	**memcg_lrus;
>> +	/*
>> +	 * The memcg_lrus is RCU protected
> 
> It is?  I don't recall seeing that in the earlier patches.  Is some
> description missing?
> 

Yes, it is.

memcg_update_lrus will do synchronize_rcu(), and lru_node_of_index will
do the read locking.

>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3442eb9..50f199f 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -24,6 +24,7 @@
>>  #include <linux/hardirq.h>
>>  #include <linux/jump_label.h>
>>  #include <linux/list_lru.h>
>> +#include <linux/mm.h>
> 
> erk.  There's a good chance that mm.h already includes memcontrol.h, or
> vice versa, by some circuitous path.  Expect problems from this.
> 
> afaict the include is only needed for struct page?  If so, simply
> adding a forward declaration for that would be prudent.
> 
In fact, Rothwell had just already complained about this.
Funny, I have been running this with multiple configs for a while on 2
machines + kbot.

> 
>> +	 * impossible situation: it would mean we are trying to add to a list
>> +	 * belonging to a memcg that does not exist. Either wasn't created or
> 
> "it wasn't created or it"
> 
>> +	 * has been already freed. In both cases it should no longer have
>> +	 * objects. BUG_ON to avoid a NULL dereference.
> 
> Well.  We could jsut permit the NULL reference - that provides the same
> info.  But an explicit BUG_ON does show that it has been thought through!
> 
Actually this is one of the bugs I have to fix. *right now* this code is
correct, but later on is not. When we are unmounting for instance, we
loop through all indexes. I have updated this in the patch that does the
loop, but to avoid generating more mental effort than it is due, I can
just move it right here.

>> +	 */
>> +	BUG_ON(!lru->memcg_lrus[index]);
>> +	nlru = &lru->memcg_lrus[index]->node[nid];
>> +	rcu_read_unlock();
>> +	return nlru;
>> +#else
>> +	BUG_ON(index >= 0); /* nobody should be passing index < 0 with !KMEM */
>> +	return &lru->node[nid];
>> +#endif
>> +}
>> +
>>
>> ...
>>
>>  int
>>  list_lru_add(
>>  	struct list_lru	*lru,
>>  	struct list_head *item)
>>  {
>> -	int nid = page_to_nid(virt_to_page(item));
>> -	struct list_lru_node *nlru = &lru->node[nid];
>> +	struct page *page = virt_to_page(item);
>> +	struct list_lru_node *nlru;
>> +	int nid = page_to_nid(page);
>> +
>> +	nlru = memcg_kmem_lru_of_page(lru, page);
>>  
>>  	spin_lock(&nlru->lock);
>>  	BUG_ON(nlru->nr_items < 0);
>>  	if (list_empty(item)) {
>>  		list_add_tail(item, &nlru->list);
>> -		if (nlru->nr_items++ == 0)
>> +		nlru->nr_items++;
>> +		/*
>> +		 * We only consider a node active or inactive based on the
>> +		 * total figure for all involved children.
> 
> Is "children" an appropriate term in this context?  Where would one go
> to understand the overall object hierarchy here?
> 
children is always an appropriate term. Every time one mentions it
people go sentimental and are more likely to be helpful.

But that aside, I believe this could be changed to something else.

>> +		 */
>> +		if (atomic_long_add_return(1, &lru->node_totals[nid]) == 1)
>>  			node_set(nid, lru->active_nodes);
>>  		spin_unlock(&nlru->lock);
>>  		return 1;
>>
>> ...
>>
>> +	/*
>> +	 * Even if we were to use call_rcu, we still have to keep the old array
>> +	 * pointer somewhere. It is easier for us to just synchronize rcu here
>> +	 * since we are in a fine context. Now we guarantee that there are no
>> +	 * more users of old_array, and proceed freeing it for all LRUs
> 
> "a fine context" is a fine term, but it's unclear what is meant by it ;)
> 
fine!

I mean a synchronize_rcu friendly context (can sleep, etc)


>> +	 */
>> +	synchronize_rcu();
>> +	list_for_each_entry(lru, &all_memcg_lrus, lrus) {
>> +		kfree(lru->old_array);
>> +		lru->old_array = NULL;
>> +	}
>>  	mutex_unlock(&all_memcg_lrus_mutex);
>>  	return ret;
>>  }
>>

Here is the answer to your "is this really RCU protected?? " btw.

>> ...
>>
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3162,19 +3162,22 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>>  	 */
>>  	memcg_kmem_set_activated(memcg);
>>  
>> -	ret = memcg_update_all_caches(num+1);
>> -	if (ret)
>> -		goto out;
>> -
>>  	/*
>> -	 * We should make sure that the array size is not updated until we are
>> -	 * done; otherwise we have no easy way to know whether or not we should
>> -	 * grow the array.
>> +	 * We have to make absolutely sure that we update the LRUs before we
>> +	 * update the caches. Once the caches are updated, they will be able to
>> +	 * start hosting objects. If a cache is created very quickly, and and
> 
> s/and/an/
> 
>> +	 * element is used and disposed to the LRU quickly as well, we may end
>> +	 * up with a NULL pointer in list_lru_add because the lists are not yet
>> +	 * ready.
>>  	 */
>>  	ret = memcg_update_all_lrus(num + 1);
>>  	if (ret)
>>  		goto out;
>>  
>> +	ret = memcg_update_all_caches(num+1);
>> +	if (ret)
>> +		goto out;
>> +
>>  	memcg->kmemcg_id = num;
>>  
>>  	memcg_update_array_size(num + 1);
>>
>> ...
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 28/35] list_lru: per-memcg walks
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (23 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 27/35] lru: add an element to a memcg list Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 29/35] memcg: per-memcg kmem shrinking Glauber Costa
                     ` (5 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

This patch extend the list_lru interfaces to allow for a memcg
parameter. Because most of its users won't need it, instead of
modifying the function signatures we create a new set of _memcg()
functions and write the old API ontop of that.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
c: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/list_lru.h   | 26 +++++++++++--
 include/linux/memcontrol.h |  2 +
 lib/list_lru.c             | 97 ++++++++++++++++++++++++++++++++++------------
 3 files changed, 97 insertions(+), 28 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3b8c301..dcb67dc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -100,7 +100,15 @@ static inline int list_lru_init_memcg(struct list_lru *lru)
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
 
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_node_memcg(struct list_lru *lru, int nid,
+					struct mem_cgroup *memcg);
+
+static inline unsigned long
+list_lru_count_node(struct list_lru *lru, int nid)
+{
+	return list_lru_count_node_memcg(lru, nid, NULL);
+}
+
 static inline unsigned long list_lru_count(struct list_lru *lru)
 {
 	long count = 0;
@@ -118,9 +126,19 @@ typedef enum lru_status
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
 
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long
+list_lru_walk_node_memcg(struct list_lru *lru, int nid,
+			 list_lru_walk_cb isolate, void *cb_arg,
+			 unsigned long *nr_to_walk, struct mem_cgroup *memcg);
+
+static inline unsigned long
+list_lru_walk_node(struct list_lru *lru, int nid,
+		 list_lru_walk_cb isolate, void *cb_arg,
+		 unsigned long *nr_to_walk)
+{
+	return list_lru_walk_node_memcg(lru, nid, isolate, cb_arg,
+					nr_to_walk, NULL);
+}
 
 static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50f199f..3eeece8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -593,6 +593,8 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
+#define memcg_limited_groups_array_size 0
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 128af5e..f919f99 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -50,13 +50,16 @@ lru_node_of_index(struct list_lru *lru, int index, int nid)
 	rcu_read_lock();
 	rmb();
 	/*
-	 * The array exist, but the particular memcg does not. That is an
-	 * impossible situation: it would mean we are trying to add to a list
-	 * belonging to a memcg that does not exist. Either wasn't created or
-	 * has been already freed. In both cases it should no longer have
-	 * objects. BUG_ON to avoid a NULL dereference.
+	 * The array exist, but the particular memcg does not. This cannot
+	 * happen when we are called from memcg_kmem_lru_of_page with a
+	 * definite memcg, but it can happen when we are iterating over all
+	 * memcgs (for instance, when disposing all lists.
 	 */
-	BUG_ON(!lru->memcg_lrus[index]);
+	if (!lru->memcg_lrus[index]) {
+		rcu_read_unlock();
+		return NULL;
+	}
+
 	nlru = &lru->memcg_lrus[index]->node[nid];
 	rcu_read_unlock();
 	return nlru;
@@ -80,6 +83,23 @@ memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
 	return lru_node_of_index(lru, memcg_id, nid);
 }
 
+/*
+ * This helper will loop through all node-data in the LRU, either global or
+ * per-memcg.  If memcg is either not present or not used,
+ * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
+ * still be allowed to execute once.
+ *
+ * We convention that for _idx = -1, the global node info should be used.
+ * After that, we will go through each of the memcgs, starting at 0.
+ *
+ * We don't need any kind of locking for the loop because
+ * memcg_limited_groups_array_size can only grow, gaining new fields at the
+ * end. The old ones are just copied, and any interesting manipulation happen
+ * in the node list itself, and we already lock the list.
+ */
+#define for_each_memcg_lru_index(_idx)	\
+	for ((_idx) = -1; ((_idx) < memcg_limited_groups_array_size); (_idx)++)
+
 int
 list_lru_add(
 	struct list_lru	*lru,
@@ -139,10 +159,19 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+list_lru_count_node_memcg(struct list_lru *lru, int nid,
+			  struct mem_cgroup *memcg)
 {
 	long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	int memcg_id = -1;
+	struct list_lru_node *nlru;
+
+	if (memcg && memcg_kmem_is_active(memcg))
+		memcg_id = memcg_cache_id(memcg);
+
+	nlru = lru_node_of_index(lru, memcg_id, nid);
+	if (!nlru)
+		return 0;
 
 	spin_lock(&nlru->lock);
 	BUG_ON(nlru->nr_items < 0);
@@ -151,19 +180,28 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
 
 unsigned long
-list_lru_walk_node(
+list_lru_walk_node_memcg(
 	struct list_lru		*lru,
 	int			nid,
 	list_lru_walk_cb	isolate,
 	void			*cb_arg,
-	unsigned long		*nr_to_walk)
+	unsigned long		*nr_to_walk,
+	struct mem_cgroup	*memcg)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
+	struct list_lru_node *nlru;
+	int memcg_id = -1;
+
+	if (memcg && memcg_kmem_is_active(memcg))
+		memcg_id = memcg_cache_id(memcg);
+
+	nlru = lru_node_of_index(lru, memcg_id, nid);
+	if (!nlru)
+		return 0;
 
 	spin_lock(&nlru->lock);
 	list_for_each_safe(item, n, &nlru->list) {
@@ -200,7 +238,7 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
 
 static unsigned long
 list_lru_dispose_all_node(
@@ -208,23 +246,34 @@ list_lru_dispose_all_node(
 	int			nid,
 	list_lru_dispose_cb	dispose)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru;
 	LIST_HEAD(dispose_list);
 	unsigned long disposed = 0;
+	int idx;
 
-	spin_lock(&nlru->lock);
-	while (!list_empty(&nlru->list)) {
-		list_splice_init(&nlru->list, &dispose_list);
-		disposed += nlru->nr_items;
-		nlru->nr_items = 0;
-		node_clear(nid, lru->active_nodes);
-		spin_unlock(&nlru->lock);
-
-		dispose(&dispose_list);
+	for_each_memcg_lru_index(idx) {
+		nlru = lru_node_of_index(lru, idx, nid);
+		if (!nlru)
+			continue;
 
 		spin_lock(&nlru->lock);
+		while (!list_empty(&nlru->list)) {
+			list_splice_init(&nlru->list, &dispose_list);
+
+			if (atomic_long_sub_and_test(nlru->nr_items,
+							&lru->node_totals[nid]))
+				node_clear(nid, lru->active_nodes);
+			disposed += nlru->nr_items;
+			nlru->nr_items = 0;
+			spin_unlock(&nlru->lock);
+
+			dispose(&dispose_list);
+
+			spin_lock(&nlru->lock);
+		}
+		spin_unlock(&nlru->lock);
 	}
-	spin_unlock(&nlru->lock);
+
 	return disposed;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 28/35] list_lru: per-memcg walks
  2013-06-03 19:29   ` [PATCH v10 28/35] list_lru: per-memcg walks Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
       [not found]       ` <20130605160837.0d0a35fbd4b32d7ad02f7136-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Mon,  3 Jun 2013 23:29:57 +0400 Glauber Costa <glommer@openvz.org> wrote:

> This patch extend the list_lru interfaces to allow for a memcg

"extends"

> parameter. Because most of its users won't need it, instead of
> modifying the function signatures we create a new set of _memcg()
> functions and write the old API ontop of that.

"on top"

> Signed-off-by: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> c: Hugh Dickins <hughd@google.com>

I'd rate him a d, personally.

> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ...
>
> --- a/lib/list_lru.c
> +++ b/lib/list_lru.c
> @@ -50,13 +50,16 @@ lru_node_of_index(struct list_lru *lru, int index, int nid)
>  	rcu_read_lock();
>  	rmb();
>  	/*
> -	 * The array exist, but the particular memcg does not. That is an
> -	 * impossible situation: it would mean we are trying to add to a list
> -	 * belonging to a memcg that does not exist. Either wasn't created or
> -	 * has been already freed. In both cases it should no longer have
> -	 * objects. BUG_ON to avoid a NULL dereference.
> +	 * The array exist, but the particular memcg does not. This cannot
> +	 * happen when we are called from memcg_kmem_lru_of_page with a
> +	 * definite memcg, but it can happen when we are iterating over all
> +	 * memcgs (for instance, when disposing all lists.
>  	 */
> -	BUG_ON(!lru->memcg_lrus[index]);
> +	if (!lru->memcg_lrus[index]) {
> +		rcu_read_unlock();
> +		return NULL;
> +	}

It took 28 patches, but my head is now spinning and my vision is fading
in and out.

>  	nlru = &lru->memcg_lrus[index]->node[nid];
>  	rcu_read_unlock();
>  	return nlru;
> @@ -80,6 +83,23 @@ memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
>  	return lru_node_of_index(lru, memcg_id, nid);
>  }
>  
> +/*
> + * This helper will loop through all node-data in the LRU, either global or
> + * per-memcg.  If memcg is either not present or not used,
> + * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
> + * still be allowed to execute once.
> + *
> + * We convention that for _idx = -1, the global node info should be used.

I don't think that "convention" is a verb, but I rather like the way
it is used here.

> + * After that, we will go through each of the memcgs, starting at 0.
> + *
> + * We don't need any kind of locking for the loop because
> + * memcg_limited_groups_array_size can only grow, gaining new fields at the
> + * end. The old ones are just copied, and any interesting manipulation happen
> + * in the node list itself, and we already lock the list.

Might be worth mentioning what type _idx should have.  Although I suspect
the code will work OK if _idx has unsigned type.

> + */
> +#define for_each_memcg_lru_index(_idx)	\
> +	for ((_idx) = -1; ((_idx) < memcg_limited_groups_array_size); (_idx)++)
> +
>  int
>  list_lru_add(
>  	struct list_lru	*lru,
> @@ -139,10 +159,19 @@ list_lru_del(
>  EXPORT_SYMBOL_GPL(list_lru_del);
>  
>  unsigned long
> -list_lru_count_node(struct list_lru *lru, int nid)
> +list_lru_count_node_memcg(struct list_lru *lru, int nid,
> +			  struct mem_cgroup *memcg)
>  {
>  	long count = 0;

But this function returns unsigned long.

> -	struct list_lru_node *nlru = &lru->node[nid];
> +	int memcg_id = -1;
> +	struct list_lru_node *nlru;
> +
> +	if (memcg && memcg_kmem_is_active(memcg))
> +		memcg_id = memcg_cache_id(memcg);
> +
> +	nlru = lru_node_of_index(lru, memcg_id, nid);
> +	if (!nlru)
> +		return 0;
>  
>  	spin_lock(&nlru->lock);
>  	BUG_ON(nlru->nr_items < 0);
> @@ -151,19 +180,28 @@ list_lru_count_node(struct list_lru *lru, int nid)
>  
>  	return count;
>  }
> -EXPORT_SYMBOL_GPL(list_lru_count_node);
> +EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
>  
>  unsigned long
> -list_lru_walk_node(
> +list_lru_walk_node_memcg(
>  	struct list_lru		*lru,
>  	int			nid,
>  	list_lru_walk_cb	isolate,
>  	void			*cb_arg,
> -	unsigned long		*nr_to_walk)
> +	unsigned long		*nr_to_walk,
> +	struct mem_cgroup	*memcg)
>  {
> -	struct list_lru_node	*nlru = &lru->node[nid];
>  	struct list_head *item, *n;
>  	unsigned long isolated = 0;
> +	struct list_lru_node *nlru;
> +	int memcg_id = -1;
> +
> +	if (memcg && memcg_kmem_is_active(memcg))
> +		memcg_id = memcg_cache_id(memcg);

Could use a helper function for this I guess.  The nice thing about
this is that it gives one a logical place at which to describe what's
going on.

> +	nlru = lru_node_of_index(lru, memcg_id, nid);
> +	if (!nlru)
> +		return 0;
>  
>  	spin_lock(&nlru->lock);
>  	list_for_each_safe(item, n, &nlru->list) {
> @@ -200,7 +238,7 @@ restart:
>  	spin_unlock(&nlru->lock);
>  	return isolated;
>  }
> -EXPORT_SYMBOL_GPL(list_lru_walk_node);
> +EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
>  
>  static unsigned long
>  list_lru_dispose_all_node(
> 
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160837.0d0a35fbd4b32d7ad02f7136-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 28/35] list_lru: per-memcg walks
       [not found]       ` <20130605160837.0d0a35fbd4b32d7ad02f7136-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  8:37         ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Rik van Riel

>>  	/*
>> -	 * The array exist, but the particular memcg does not. That is an
>> -	 * impossible situation: it would mean we are trying to add to a list
>> -	 * belonging to a memcg that does not exist. Either wasn't created or
>> -	 * has been already freed. In both cases it should no longer have
>> -	 * objects. BUG_ON to avoid a NULL dereference.
>> +	 * The array exist, but the particular memcg does not. This cannot
>> +	 * happen when we are called from memcg_kmem_lru_of_page with a
>> +	 * definite memcg, but it can happen when we are iterating over all
>> +	 * memcgs (for instance, when disposing all lists.
>>  	 */
>> -	BUG_ON(!lru->memcg_lrus[index]);
>> +	if (!lru->memcg_lrus[index]) {
>> +		rcu_read_unlock();
>> +		return NULL;
>> +	}
> 
> It took 28 patches, but my head is now spinning and my vision is fading
> in and out.
> 
You're a hero.

>>  	nlru = &lru->memcg_lrus[index]->node[nid];
>>  	rcu_read_unlock();
>>  	return nlru;
>> @@ -80,6 +83,23 @@ memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
>>  	return lru_node_of_index(lru, memcg_id, nid);
>>  }
>>  
>> +/*
>> + * This helper will loop through all node-data in the LRU, either global or
>> + * per-memcg.  If memcg is either not present or not used,
>> + * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
>> + * still be allowed to execute once.
>> + *
>> + * We convention that for _idx = -1, the global node info should be used.
> 
> I don't think that "convention" is a verb, but I rather like the way
> it is used here.
> 
We can convention to do it this way from now on.

>> + * After that, we will go through each of the memcgs, starting at 0.
>> + *
>> + * We don't need any kind of locking for the loop because
>> + * memcg_limited_groups_array_size can only grow, gaining new fields at the
>> + * end. The old ones are just copied, and any interesting manipulation happen
>> + * in the node list itself, and we already lock the list.
> 
> Might be worth mentioning what type _idx should have.  Although I suspect
> the code will work OK if _idx has unsigned type.
> 

We convention -1 to be "no memcg", so it has to be an int.

>> + */
>> +#define for_each_memcg_lru_index(_idx)	\
>> +	for ((_idx) = -1; ((_idx) < memcg_limited_groups_array_size); (_idx)++)
>> +
>>  int
>>  list_lru_add(
>>  	struct list_lru	*lru,
>> @@ -139,10 +159,19 @@ list_lru_del(
>>  EXPORT_SYMBOL_GPL(list_lru_del);
>>  
>>  
>>  unsigned long
>> -list_lru_walk_node(
>> +list_lru_walk_node_memcg(
>>  	struct list_lru		*lru,
>>  	int			nid,
>>  	list_lru_walk_cb	isolate,
>>  	void			*cb_arg,
>> -	unsigned long		*nr_to_walk)
>> +	unsigned long		*nr_to_walk,
>> +	struct mem_cgroup	*memcg)
>>  {
>> -	struct list_lru_node	*nlru = &lru->node[nid];
>>  	struct list_head *item, *n;
>>  	unsigned long isolated = 0;
>> +	struct list_lru_node *nlru;
>> +	int memcg_id = -1;
>> +
>> +	if (memcg && memcg_kmem_is_active(memcg))
>> +		memcg_id = memcg_cache_id(memcg);
> 
> Could use a helper function for this I guess.  The nice thing about
> this is that it gives one a logical place at which to describe what's
> going on.
> 
Ok.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 29/35] memcg: per-memcg kmem shrinking
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (24 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 28/35] list_lru: per-memcg walks Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:29   ` [PATCH v10 30/35] memcg: scan cache objects hierarchically Glauber Costa
                     ` (4 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

If the kernel limit is smaller than the user limit, we will have
situations in which our allocations fail but freeing user pages will buy
us nothing.  In those, we would like to call a specialized memcg
reclaimer that only frees kernel memory and leave the user memory alone.
Those are also expected to fail when we account memcg->kmem, instead of
when we account memcg->res. Based on that, this patch implements a
memcg-specific reclaimer, that only shrinks kernel objects, withouth
touching user pages.

There might be situations in which there are plenty of objects to
shrink, but we can't do it because the __GFP_FS flag is not set.
Although they can happen with user pages, they are a lot more common
with fs-metadata: this is the case with almost all inode allocation.

Those allocations are, however, capable of waiting.  So we can just span
a worker, let it finish its job and proceed with the allocation. As slow
as it is, at this point we are already past any hopes anyway.

[ v2: moved congestion_wait call to vmscan.c ]
Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/swap.h |   2 +
 mm/memcontrol.c      | 180 ++++++++++++++++++++++++++++++++++++++++-----------
 mm/vmscan.c          |  44 ++++++++++++-
 3 files changed, 187 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca031f7..5a0ef45 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,6 +268,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
+						 gfp_t gfp_mask);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 846c82c..768a771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -382,7 +382,8 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
-
+	/* when kmem shrinkers can sleep but can't proceed due to context */
+	struct work_struct kmemcg_shrink_work;
 	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
@@ -448,11 +449,14 @@ static inline void memcg_dangling_free(struct mem_cgroup *memcg) {}
 static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
 #endif
 
+static DEFINE_MUTEX(set_limit_mutex);
+
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_MAY_SHRINK, /* kmem limit < mem limit, shrink kmem only */
 };
 
 /* We account when limit is on, but only after call sites are patched */
@@ -491,6 +495,31 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
 	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
 				  &memcg->kmem_account_flags);
 }
+
+/*
+ * If the kernel limit is smaller than the user limit, we will have situations
+ * in which our allocations fail but freeing user pages will buy us nothing.
+ * In those, we would like to call a specialized memcg reclaimer that only
+ * frees kernel memory and leave the user memory alone.
+ *
+ * This test exists so we can differentiate between those. Everytime one of the
+ * limits is updated, we need to run it. The set_limit_mutex must be held, so
+ * they don't change again.
+ */
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+	mutex_lock(&set_limit_mutex);
+	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
+		res_counter_read_u64(&memcg->res, RES_LIMIT))
+		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	else
+		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	mutex_unlock(&set_limit_mutex);
+}
+#else
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /* Stuffs for move charges at task migration. */
@@ -3013,8 +3042,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	memcg_check_events(memcg, page);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3056,16 +3083,91 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
 }
 #endif
 
+/*
+ * During the creation a new cache, we need to disable our accounting mechanism
+ * altogether. This is true even if we are not creating, but rather just
+ * enqueing new caches to be created.
+ *
+ * This is because that process will trigger allocations; some visible, like
+ * explicit kmallocs to auxiliary data structures, name strings and internal
+ * cache structures; some well concealed, like INIT_WORK() that can allocate
+ * objects during debug.
+ *
+ * If any allocation happens during memcg_kmem_get_cache, we will recurse back
+ * to it. This may not be a bounded recursion: since the first cache creation
+ * failed to complete (waiting on the allocation), we'll just try to create the
+ * cache again, failing at the same point.
+ *
+ * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
+ * memcg_kmem_skip_account. So we enclose anything that might allocate memory
+ * inside the following two functions.
+ */
+static inline void memcg_stop_kmem_account(void)
+{
+	VM_BUG_ON(!current->mm);
+	current->memcg_kmem_skip_account++;
+}
+
+static inline void memcg_resume_kmem_account(void)
+{
+	VM_BUG_ON(!current->mm);
+	current->memcg_kmem_skip_account--;
+}
+
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+	int retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	int ret;
+
+	do {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			return ret;
+
+		if (!(gfp & __GFP_WAIT))
+			return ret;
+
+		/*
+		 * We will try to shrink kernel memory present in caches. We
+		 * are sure that we can wait, so we will. The duration of our
+		 * wait is determined by congestion, the same way as vmscan.c
+		 *
+		 * If we are in FS context, though, then although we can wait,
+		 * we cannot call the shrinkers. Most fs shrinkers (which
+		 * comprises most of our kmem data) will not run without
+		 * __GFP_FS since they can deadlock. The solution is to
+		 * synchronously run that in a different context.
+		 */
+		if (!(gfp & __GFP_FS)) {
+			/*
+			 * we are already short on memory, every queue
+			 * allocation is likely to fail
+			 */
+			memcg_stop_kmem_account();
+			schedule_work(&memcg->kmemcg_shrink_work);
+			flush_work(&memcg->kmemcg_shrink_work);
+			memcg_resume_kmem_account();
+		} else
+			try_to_free_mem_cgroup_kmem(memcg, gfp);
+	} while (retries--);
+
+	return ret;
+}
+
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
 	struct mem_cgroup *_memcg;
 	int ret = 0;
 	bool may_oom;
+	bool kmem_first = test_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
-	if (ret)
-		return ret;
+	if (kmem_first) {
+		ret = memcg_try_charge_kmem(memcg, gfp, size);
+		if (ret)
+			return ret;
+	}
 
 	/*
 	 * Conditions under which we can wait for the oom_killer. Those are
@@ -3098,12 +3200,41 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 			res_counter_charge_nofail(&memcg->memsw, size,
 						  &fail_res);
 		ret = 0;
-	} else if (ret)
+		if (kmem_first)
+			res_counter_charge_nofail(&memcg->kmem, size, &fail_res);
+	} else if (ret && kmem_first)
 		res_counter_uncharge(&memcg->kmem, size);
 
+	if (!ret && !kmem_first) {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			return ret;
+
+		res_counter_uncharge(&memcg->res, size);
+		if (do_swap_account)
+			res_counter_uncharge(&memcg->memsw, size);
+	}
+
 	return ret;
 }
 
+/*
+ * There might be situations in which there are plenty of objects to shrink,
+ * but we can't do it because the __GFP_FS flag is not set.  This is the case
+ * with almost all inode allocation. They do are, however, capable of waiting.
+ * So we can just span a worker, let it finish its job and proceed with the
+ * allocation. As slow as it is, at this point we are already past any hopes
+ * anyway.
+ */
+static void kmemcg_shrink_work_fn(struct work_struct *w)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(w, struct mem_cgroup, kmemcg_shrink_work);
+	try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL);
+}
+
+
 static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 {
 	res_counter_uncharge(&memcg->res, size);
@@ -3183,6 +3314,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_update_array_size(num + 1);
 
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
 	mutex_init(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3457,37 +3589,6 @@ out:
 	kfree(s->memcg_params);
 }
 
-/*
- * During the creation a new cache, we need to disable our accounting mechanism
- * altogether. This is true even if we are not creating, but rather just
- * enqueing new caches to be created.
- *
- * This is because that process will trigger allocations; some visible, like
- * explicit kmallocs to auxiliary data structures, name strings and internal
- * cache structures; some well concealed, like INIT_WORK() that can allocate
- * objects during debug.
- *
- * If any allocation happens during memcg_kmem_get_cache, we will recurse back
- * to it. This may not be a bounded recursion: since the first cache creation
- * failed to complete (waiting on the allocation), we'll just try to create the
- * cache again, failing at the same point.
- *
- * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
- * memcg_kmem_skip_account. So we enclose anything that might allocate memory
- * inside the following two functions.
- */
-static inline void memcg_stop_kmem_account(void)
-{
-	VM_BUG_ON(!current->mm);
-	current->memcg_kmem_skip_account++;
-}
-
-static inline void memcg_resume_kmem_account(void)
-{
-	VM_BUG_ON(!current->mm);
-	current->memcg_kmem_skip_account--;
-}
-
 struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
 {
 	struct page_cgroup *pc;
@@ -5572,6 +5673,9 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			ret = memcg_update_kmem_limit(cont, val);
 		else
 			return -EINVAL;
+
+		if (!ret)
+			memcg_update_shrink_status(memcg);
 		break;
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 109a3bf..82793e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2659,7 +2659,49 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
-#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This function is called when we are under kmem-specific pressure.  It will
+ * only trigger in environments with kmem.limit_in_bytes < limit_in_bytes, IOW,
+ * with a lower kmem allowance than the memory allowance.
+ *
+ * In this situation, freeing user pages from the cgroup won't do us any good.
+ * What we really need is to call the memcg-aware shrinkers, in the hope of
+ * freeing pages holding kmem objects. It may also be that we won't be able to
+ * free any pages, but will get rid of old objects opening up space for new
+ * ones.
+ */
+unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
+					  gfp_t gfp_mask)
+{
+	long freed;
+
+	struct shrink_control shrink = {
+		.gfp_mask = gfp_mask,
+		.target_mem_cgroup = memcg,
+	};
+
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+
+	/*
+	 * memcg pressure is always global */
+	nodes_setall(shrink.nodes_to_scan);
+
+	/*
+	 * We haven't scanned any user LRU, so we basically come up with
+	 * crafted values of nr_scanned and LRU page (1 and 0 respectively).
+	 * This should be enough to tell shrink_slab that the freeing
+	 * responsibility is all on himself.
+	 */
+	freed = shrink_slab(&shrink, 1, 0);
+	if (!freed)
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	return freed;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_MEMCG */
 
 static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking
  2013-06-03 19:29   ` [PATCH v10 29/35] memcg: per-memcg kmem shrinking Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
       [not found]       ` <20130605160841.909420c06bfde62039489d2e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Mon,  3 Jun 2013 23:29:58 +0400 Glauber Costa <glommer@openvz.org> wrote:

> If the kernel limit is smaller than the user limit, we will have
> situations in which our allocations fail but freeing user pages will buy
> us nothing.  In those, we would like to call a specialized memcg
> reclaimer that only frees kernel memory and leave the user memory alone.
> Those are also expected to fail when we account memcg->kmem, instead of
> when we account memcg->res. Based on that, this patch implements a
> memcg-specific reclaimer, that only shrinks kernel objects, withouth
> touching user pages.
> 
> There might be situations in which there are plenty of objects to
> shrink, but we can't do it because the __GFP_FS flag is not set.
> Although they can happen with user pages, they are a lot more common
> with fs-metadata: this is the case with almost all inode allocation.
> 
> Those allocations are, however, capable of waiting.  So we can just span

"spawn"!

> a worker, let it finish its job and proceed with the allocation. As slow
> as it is, at this point we are already past any hopes anyway.

>
> ...
>
> + * If the kernel limit is smaller than the user limit, we will have situations
> + * in which our allocations fail but freeing user pages will buy us nothing.
> + * In those, we would like to call a specialized memcg reclaimer that only
> + * frees kernel memory and leave the user memory alone.

"leaves"

> + * This test exists so we can differentiate between those. Everytime one of the

"Every time"

> + * limits is updated, we need to run it. The set_limit_mutex must be held, so
> + * they don't change again.
> + */
> +static void memcg_update_shrink_status(struct mem_cgroup *memcg)
> +{
> +	mutex_lock(&set_limit_mutex);
> +	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
> +		res_counter_read_u64(&memcg->res, RES_LIMIT))
> +		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
> +	else
> +		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
> +	mutex_unlock(&set_limit_mutex);
> +}
> +#else
> +static void memcg_update_shrink_status(struct mem_cgroup *memcg)
> +{
> +}
>  #endif
>  
>  /* Stuffs for move charges at task migration. */
> @@ -3013,8 +3042,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>  	memcg_check_events(memcg, page);
>  }
>  
> -static DEFINE_MUTEX(set_limit_mutex);
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>  {
> @@ -3056,16 +3083,91 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
>  }
>  #endif
>  
> +/*
> + * During the creation a new cache, we need to disable our accounting mechanism

"creation of"

> + * altogether. This is true even if we are not creating, but rather just
> + * enqueing new caches to be created.
> + *
> + * This is because that process will trigger allocations; some visible, like
> + * explicit kmallocs to auxiliary data structures, name strings and internal
> + * cache structures; some well concealed, like INIT_WORK() that can allocate
> + * objects during debug.
> + *
> + * If any allocation happens during memcg_kmem_get_cache, we will recurse back
> + * to it. This may not be a bounded recursion: since the first cache creation
> + * failed to complete (waiting on the allocation), we'll just try to create the
> + * cache again, failing at the same point.
> + *
> + * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
> + * memcg_kmem_skip_account. So we enclose anything that might allocate memory
> + * inside the following two functions.

Please identify the type of caches we're talking about here.  slab
caches?  inode/dentry/anything-whcih-hash-a-shrinker?

(yes, these observations pertain to existing code)

> + */
> +static inline void memcg_stop_kmem_account(void)
> +{
> +	VM_BUG_ON(!current->mm);
> +	current->memcg_kmem_skip_account++;
> +}
> +
> +static inline void memcg_resume_kmem_account(void)
> +{
> +	VM_BUG_ON(!current->mm);
> +	current->memcg_kmem_skip_account--;
> +}
> +
> +static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
> +{
> +	int retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	struct res_counter *fail_res;
> +	int ret;
> +
> +	do {
> +		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> +		if (!ret)
> +			return ret;
> +
> +		if (!(gfp & __GFP_WAIT))
> +			return ret;
> +
> +		/*
> +		 * We will try to shrink kernel memory present in caches. We
> +		 * are sure that we can wait, so we will. The duration of our
> +		 * wait is determined by congestion, the same way as vmscan.c
> +		 *
> +		 * If we are in FS context, though, then although we can wait,
> +		 * we cannot call the shrinkers. Most fs shrinkers (which
> +		 * comprises most of our kmem data) will not run without
> +		 * __GFP_FS since they can deadlock. The solution is to
> +		 * synchronously run that in a different context.

But this is pointless.  Calling a function via a different thread and
then waiting for it to complete is equivalent to calling it directly.

> +		 */
> +		if (!(gfp & __GFP_FS)) {
> +			/*
> +			 * we are already short on memory, every queue
> +			 * allocation is likely to fail
> +			 */
> +			memcg_stop_kmem_account();
> +			schedule_work(&memcg->kmemcg_shrink_work);
> +			flush_work(&memcg->kmemcg_shrink_work);
> +			memcg_resume_kmem_account();
> +		} else
> +			try_to_free_mem_cgroup_kmem(memcg, gfp);
> +	} while (retries--);
> +
> +	return ret;
> +}
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160841.909420c06bfde62039489d2e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking
       [not found]       ` <20130605160841.909420c06bfde62039489d2e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  8:35         ` Glauber Costa
  2013-06-06  9:49           ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Rik van Riel

On 06/06/2013 03:08 AM, Andrew Morton wrote:
>> +
>> > +		/*
>> > +		 * We will try to shrink kernel memory present in caches. We
>> > +		 * are sure that we can wait, so we will. The duration of our
>> > +		 * wait is determined by congestion, the same way as vmscan.c
>> > +		 *
>> > +		 * If we are in FS context, though, then although we can wait,
>> > +		 * we cannot call the shrinkers. Most fs shrinkers (which
>> > +		 * comprises most of our kmem data) will not run without
>> > +		 * __GFP_FS since they can deadlock. The solution is to
>> > +		 * synchronously run that in a different context.
> But this is pointless.  Calling a function via a different thread and
> then waiting for it to complete is equivalent to calling it directly.
> 
Not in this case. We are in wait-capable context (we check for this
right before we reach this), but we are not in fs capable context.

So the reason we do this - which I tried to cover in the changelog, is
to escape from the GFP_FS limitation that our call chain has, not the
wait limitation.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking
  2013-06-06  8:35         ` Glauber Costa
@ 2013-06-06  9:49           ` Andrew Morton
       [not found]             ` <20130606024906.e5b85b28.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  9:49 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Thu, 6 Jun 2013 12:35:33 +0400 Glauber Costa <glommer@parallels.com> wrote:

> On 06/06/2013 03:08 AM, Andrew Morton wrote:
> >> +
> >> > +		/*
> >> > +		 * We will try to shrink kernel memory present in caches. We
> >> > +		 * are sure that we can wait, so we will. The duration of our
> >> > +		 * wait is determined by congestion, the same way as vmscan.c
> >> > +		 *
> >> > +		 * If we are in FS context, though, then although we can wait,
> >> > +		 * we cannot call the shrinkers. Most fs shrinkers (which
> >> > +		 * comprises most of our kmem data) will not run without
> >> > +		 * __GFP_FS since they can deadlock. The solution is to
> >> > +		 * synchronously run that in a different context.
> > But this is pointless.  Calling a function via a different thread and
> > then waiting for it to complete is equivalent to calling it directly.
> > 
> Not in this case. We are in wait-capable context (we check for this
> right before we reach this), but we are not in fs capable context.
> 
> So the reason we do this - which I tried to cover in the changelog, is
> to escape from the GFP_FS limitation that our call chain has, not the
> wait limitation.

But that's equivalent to calling the code directly.  Look:

some_fs_function()
{
	lock(some-fs-lock);
	...
}

some_other_fs_function()
{
	lock(some-fs-lock);
	alloc_pages(GFP_NOFS);
	->...
	  ->schedule_work(some_fs_function);
	    flush_scheduled_work();

that flush_scheduled_work() won't complete until some_fs_function() has
completed.  But some_fs_function() won't complete, because we're
holding some-fs-lock.


	

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130606024906.e5b85b28.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking
       [not found]             ` <20130606024906.e5b85b28.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06 12:09               ` Glauber Costa
       [not found]                 ` <51B07BEC.9010205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Rik van Riel

On 06/06/2013 01:49 PM, Andrew Morton wrote:
> On Thu, 6 Jun 2013 12:35:33 +0400 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> 
>> On 06/06/2013 03:08 AM, Andrew Morton wrote:
>>>> +
>>>>> +		/*
>>>>> +		 * We will try to shrink kernel memory present in caches. We
>>>>> +		 * are sure that we can wait, so we will. The duration of our
>>>>> +		 * wait is determined by congestion, the same way as vmscan.c
>>>>> +		 *
>>>>> +		 * If we are in FS context, though, then although we can wait,
>>>>> +		 * we cannot call the shrinkers. Most fs shrinkers (which
>>>>> +		 * comprises most of our kmem data) will not run without
>>>>> +		 * __GFP_FS since they can deadlock. The solution is to
>>>>> +		 * synchronously run that in a different context.
>>> But this is pointless.  Calling a function via a different thread and
>>> then waiting for it to complete is equivalent to calling it directly.
>>>
>> Not in this case. We are in wait-capable context (we check for this
>> right before we reach this), but we are not in fs capable context.
>>
>> So the reason we do this - which I tried to cover in the changelog, is
>> to escape from the GFP_FS limitation that our call chain has, not the
>> wait limitation.
> 
> But that's equivalent to calling the code directly.  Look:
> 
> some_fs_function()
> {
> 	lock(some-fs-lock);
> 	...
> }
> 
> some_other_fs_function()
> {
> 	lock(some-fs-lock);
> 	alloc_pages(GFP_NOFS);
> 	->...
> 	  ->schedule_work(some_fs_function);
> 	    flush_scheduled_work();
> 
> that flush_scheduled_work() won't complete until some_fs_function() has
> completed.  But some_fs_function() won't complete, because we're
> holding some-fs-lock.
> 

In my experience during this series, most of the kmem allocation here
will be filesystem related. This means that we will allocate that with
GFP_FS on. If we don't do anything like that, reclaim is almost
pointless since it will never free anything (only once here and there
when the allocation is not from fs).

It tend to work just fine like this. It may very well be because fs
people just mark everything as NOFS out of safety and we aren't *really*
holding any locks in common situations, but it will blow in our faces in
a subtle way (which none of us want).

That said, suggestions are more than welcome.

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <51B07BEC.9010205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking
       [not found]                 ` <51B07BEC.9010205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-06-06 22:23                   ` Andrew Morton
  2013-06-07  6:10                     ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06 22:23 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Dave Chinner, Rik van Riel

On Thu, 6 Jun 2013 16:09:16 +0400 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> >>> then waiting for it to complete is equivalent to calling it directly.
> >>>
> >> Not in this case. We are in wait-capable context (we check for this
> >> right before we reach this), but we are not in fs capable context.
> >>
> >> So the reason we do this - which I tried to cover in the changelog, is
> >> to escape from the GFP_FS limitation that our call chain has, not the
> >> wait limitation.
> > 
> > But that's equivalent to calling the code directly.  Look:
> > 
> > some_fs_function()
> > {
> > 	lock(some-fs-lock);
> > 	...
> > }
> > 
> > some_other_fs_function()
> > {
> > 	lock(some-fs-lock);
> > 	alloc_pages(GFP_NOFS);
> > 	->...
> > 	  ->schedule_work(some_fs_function);
> > 	    flush_scheduled_work();
> > 
> > that flush_scheduled_work() won't complete until some_fs_function() has
> > completed.  But some_fs_function() won't complete, because we're
> > holding some-fs-lock.
> > 
> 
> In my experience during this series, most of the kmem allocation here

"most"?

> will be filesystem related. This means that we will allocate that with
> GFP_FS on.

eh?  filesystems do a tremendous amount of GFP_NOFS allocation.  

akpm3:/usr/src/25> grep GFP_NOFS fs/*/*.c|wc -l
898

> If we don't do anything like that, reclaim is almost
> pointless since it will never free anything (only once here and there
> when the allocation is not from fs).

It depends what you mean by "reclaim".  There are a lot of things which
vmscan can do for a GFP_NOFS allocation.  Scraping clean pagecache,
clean swapcache, well-behaved (ahem) shrinkable caches.

> It tend to work just fine like this. It may very well be because fs
> people just mark everything as NOFS out of safety and we aren't *really*
> holding any locks in common situations, but it will blow in our faces in
> a subtle way (which none of us want).
> 
> That said, suggestions are more than welcome.

At a minimum we should remove all the schedule_work() stuff, call the
callback function synchronously and add

	/* This code is full of deadlocks */


Sorry, this part of the patchset is busted and needs a fundamental
rethink.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking
  2013-06-06 22:23                   ` Andrew Morton
@ 2013-06-07  6:10                     ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-07  6:10 UTC (permalink / raw)
  To: Andrew Morton, Dave Chinner
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Rik van Riel

On 06/07/2013 02:23 AM, Andrew Morton wrote:
> On Thu, 6 Jun 2013 16:09:16 +0400 Glauber Costa <glommer@parallels.com> wrote:
> 
>>>>> then waiting for it to complete is equivalent to calling it directly.
>>>>>
>>>> Not in this case. We are in wait-capable context (we check for this
>>>> right before we reach this), but we are not in fs capable context.
>>>>
>>>> So the reason we do this - which I tried to cover in the changelog, is
>>>> to escape from the GFP_FS limitation that our call chain has, not the
>>>> wait limitation.
>>>
>>> But that's equivalent to calling the code directly.  Look:
>>>
>>> some_fs_function()
>>> {
>>> 	lock(some-fs-lock);
>>> 	...
>>> }
>>>
>>> some_other_fs_function()
>>> {
>>> 	lock(some-fs-lock);
>>> 	alloc_pages(GFP_NOFS);
>>> 	->...
>>> 	  ->schedule_work(some_fs_function);
>>> 	    flush_scheduled_work();
>>>
>>> that flush_scheduled_work() won't complete until some_fs_function() has
>>> completed.  But some_fs_function() won't complete, because we're
>>> holding some-fs-lock.
>>>
>>
>> In my experience during this series, most of the kmem allocation here
> 
> "most"?
> 

Yes, dentrys, inodes, buffer_heads. They constitute the bulk of kmem
allocations. (Please note that I am talking about kmem allocations only)

>> will be filesystem related. This means that we will allocate that with
>> GFP_FS on.
> 
> eh?  filesystems do a tremendous amount of GFP_NOFS allocation.  
> 
> akpm3:/usr/src/25> grep GFP_NOFS fs/*/*.c|wc -l
> 898
> 

My bad, I thought one thing, wrote another. I meant GFP_FS off.

>> If we don't do anything like that, reclaim is almost
>> pointless since it will never free anything (only once here and there
>> when the allocation is not from fs).
> 
> It depends what you mean by "reclaim".  There are a lot of things which
> vmscan can do for a GFP_NOFS allocation.  Scraping clean pagecache,
> clean swapcache, well-behaved (ahem) shrinkable caches.

I mean exclusively shrinkable caches. This code is executed only when we
reach the kernel memory limit. Therefore, we know that depleting user
pages won't help. And now that we have targeted shrinking, we shrink
just the caches.

> 
>> It tend to work just fine like this. It may very well be because fs
>> people just mark everything as NOFS out of safety and we aren't *really*
>> holding any locks in common situations, but it will blow in our faces in
>> a subtle way (which none of us want).
>>
>> That said, suggestions are more than welcome.
> 
> At a minimum we should remove all the schedule_work() stuff, call the
> callback function synchronously and add
> 
> 	/* This code is full of deadlocks */
> 
> 
> Sorry, this part of the patchset is busted and needs a fundamental
> rethink.
> 
Okay, I will go back to it soon.

I am suspecting we may have no choice but to just let the shrinkers run
asynchronously, which will fail this allocation but at least save us up
to the next.

Dave Shrinkers, would you be so kind to look at this problem from the
top of your mighty filesystem knowledge and see if you have a better
suggestion ?


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 30/35] memcg: scan cache objects hierarchically
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (25 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 29/35] memcg: per-memcg kmem shrinking Glauber Costa
@ 2013-06-03 19:29   ` Glauber Costa
  2013-06-05 23:08     ` Andrew Morton
  2013-06-03 19:30   ` [PATCH v10 32/35] super: targeted memcg reclaim Glauber Costa
                     ` (3 subsequent siblings)
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

When reaching shrink_slab, we should descent in children memcg searching
for objects that could be shrunk. This is true even if the memcg does
not have kmem limits on, since the kmem res_counter will also be billed
against the user res_counter of the parent.

It is possible that we will free objects and not free any pages, that
will just harm the child groups without helping the parent group at all.
But at this point, we basically are prepared to pay the price.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/memcontrol.h |  6 ++++
 mm/memcontrol.c            | 13 ++++++++
 mm/vmscan.c                | 79 ++++++++++++++++++++++++++++++++++++----------
 3 files changed, 81 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3eeece8..c8b1412 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -441,6 +441,7 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
 bool memcg_kmem_is_active(struct mem_cgroup *memcg);
 
 /*
@@ -585,6 +586,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 }
 #else
 
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 768a771..aa6853f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3043,6 +3043,19 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+	}
+	return false;
+}
+
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
 	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 82793e1..a42c742 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -148,7 +148,7 @@ static bool global_reclaim(struct scan_control *sc)
 static bool has_kmem_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup ||
-		memcg_kmem_is_active(sc->target_mem_cgroup);
+		memcg_kmem_should_reclaim(sc->target_mem_cgroup);
 }
 
 static unsigned long
@@ -346,12 +346,39 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
  *
  * Returns the number of slab objects which we shrunk.
  */
+static unsigned long
+shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+		unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+		shrinkctl->nid = 0;
+
+		return shrink_slab_node(shrinkctl, shrinker,
+			 nr_pages_scanned, lru_pages,
+			 &shrinker->nr_deferred);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (!node_online(shrinkctl->nid))
+			continue;
+
+		freed += shrink_slab_node(shrinkctl, shrinker,
+			 nr_pages_scanned, lru_pages,
+			 &shrinker->nr_deferred_node[shrinkctl->nid]);
+	}
+
+	return freed;
+}
+
 unsigned long shrink_slab(struct shrink_control *shrinkctl,
 			  unsigned long nr_pages_scanned,
 			  unsigned long lru_pages)
 {
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
+	struct mem_cgroup *root = shrinkctl->target_mem_cgroup;
 
 	if (nr_pages_scanned == 0)
 		nr_pages_scanned = SWAP_CLUSTER_MAX;
@@ -368,27 +395,45 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 		 * Otherwise we will limit our scan to shrinkers marked as
 		 * memcg aware
 		 */
-		if (shrinkctl->target_mem_cgroup &&
-		    !(shrinker->flags & SHRINKER_MEMCG_AWARE))
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
+			shrinkctl->target_mem_cgroup)
 			continue;
 
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
+		/*
+		 * In a hierarchical chain, it might be that not all memcgs are
+		 * kmem active. kmemcg design mandates that when one memcg is
+		 * active, its children will be active as well. But it is
+		 * perfectly possible that its parent is not.
+		 *
+		 * We also need to make sure we scan at least once, for the
+		 * global case. So if we don't have a target memcg (saved in
+		 * root), we proceed normally and expect to break in the next
+		 * round.
+		 */
+		do {
+			struct mem_cgroup *memcg = shrinkctl->target_mem_cgroup;
 
-			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages,
-				 &shrinker->nr_deferred);
-			continue;
-		}
+			if (!memcg || memcg_kmem_is_active(memcg))
+				freed += shrink_slab_one(shrinkctl, shrinker,
+					 nr_pages_scanned, lru_pages);
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
+			/*
+			 * For non-memcg aware shrinkers, we will arrive here
+			 * at first pass because we need to scan the root
+			 * memcg.  We need to bail out, since exactly because
+			 * they are not memcg aware, instead of noticing they
+			 * have nothing to shrink, they will just shrink again,
+			 * and deplete too many objects.
+			 */
+			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+				break;
 
-			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages,
-				 &shrinker->nr_deferred_node[shrinkctl->nid]);
-		}
+			shrinkctl->target_mem_cgroup =
+				mem_cgroup_iter(root, memcg, NULL);
+		} while (shrinkctl->target_mem_cgroup);
+
+		/* restore original state */
+		shrinkctl->target_mem_cgroup = root;
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 30/35] memcg: scan cache objects hierarchically
  2013-06-03 19:29   ` [PATCH v10 30/35] memcg: scan cache objects hierarchically Glauber Costa
@ 2013-06-05 23:08     ` Andrew Morton
  0 siblings, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Mon,  3 Jun 2013 23:29:59 +0400 Glauber Costa <glommer@openvz.org> wrote:

> When reaching shrink_slab, we should descent in children memcg searching

"descend into child memcgs"

> for objects that could be shrunk. This is true even if the memcg does

"can be"

> not have kmem limits on, since the kmem res_counter will also be billed
> against the user res_counter of the parent.
> 
> It is possible that we will free objects and not free any pages, that
> will just harm the child groups without helping the parent group at all.
> But at this point, we basically are prepared to pay the price.
> 
> ...
>
>  #ifdef CONFIG_MEMCG_KMEM
> +bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		if (memcg_kmem_is_active(iter)) {
> +			mem_cgroup_iter_break(memcg, iter);
> +			return true;
> +		}
> +	}
> +	return false;
> +}

Locking requirements for this function?  Perhaps the
for_each_mem_cgroup_tree() definition site would be an appropriate
place to document this.

>  static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>  {
>  	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 32/35] super: targeted memcg reclaim
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (26 preceding siblings ...)
  2013-06-03 19:29   ` [PATCH v10 30/35] memcg: scan cache objects hierarchically Glauber Costa
@ 2013-06-03 19:30   ` Glauber Costa
  2013-06-03 19:30   ` [PATCH v10 33/35] memcg: move initialization to memcg creation Glauber Costa
                     ` (2 subsequent siblings)
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

We now have all our dentries and inodes placed in memcg-specific LRU
lists. All we have to do is restrict the reclaim to the said lists in
case of memcg pressure.

That can't be done so easily for the fs_objects part of the equation,
since this is heavily fs-specific. What we do is pass on the context,
and let the filesystems decide if they ever chose or want to. At this
time, we just don't shrink them in memcg pressure (none is supported),
leaving that for global pressure only.

Marking the superblock shrinker and its LRUs as memcg-aware will
guarantee that the shrinkers will get invoked during targetted reclaim.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 fs/dcache.c   |  7 ++++---
 fs/inode.c    |  7 ++++---
 fs/internal.h |  5 +++--
 fs/super.c    | 39 ++++++++++++++++++++++++++-------------
 4 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e07aa73..cace5cd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -889,13 +889,14 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * use.
  */
 long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+		     int nid, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_walk_node_memcg(&sb->s_dentry_lru, nid,
+					dentry_lru_isolate, &dispose,
+					&nr_to_scan, memcg);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 00b804e..b9a8125 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -747,13 +747,14 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * then are freed outside inode_lock by dispose_list().
  */
 long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+		     int nid, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_walk_node_memcg(&sb->s_inode_lru, nid,
+					inode_lru_isolate, &freeable,
+					&nr_to_scan, memcg);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 8902d56..601bd15 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct mem_cgroup;
 
 /*
  * block_dev.c
@@ -111,7 +112,7 @@ extern int open_check_o_direct(struct file *f);
  */
 extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+			    int nid, struct mem_cgroup *memcg);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -128,7 +129,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+			    int nid, struct mem_cgroup *memcg);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index adbbb1a..ff40e33 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
 #include <linux/cleancache.h>
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
+#include <linux/memcontrol.h>
 #include "internal.h"
 
 
@@ -56,6 +57,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 	long	fs_objects = 0;
 	long	total_objects;
 	long	freed = 0;
@@ -74,11 +76,12 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (!grab_super_passive(sb))
 		return -1;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_count_node_memcg(&sb->s_inode_lru, sc->nid, memcg);
+	dentries = list_lru_count_node_memcg(&sb->s_dentry_lru, sc->nid, memcg);
+
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,8 +92,8 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	freed = prune_dcache_sb(sb, dentries, sc->nid, memcg);
+	freed += prune_icache_sb(sb, inodes, sc->nid, memcg);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
@@ -107,20 +110,26 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 {
 	struct super_block *sb;
 	long	total_objects = 0;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	if (!grab_super_passive(sb))
 		return 0;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	/*
+	 * Ideally we would pass memcg to nr_cached_objects, and
+	 * let the underlying filesystem decide. Most likely the
+	 * path will be if (!memcg) return;, but even then.
+	 */
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_count_node_memcg(&sb->s_dentry_lru,
+						 sc->nid, memcg);
+	total_objects += list_lru_count_node_memcg(&sb->s_inode_lru,
+						 sc->nid, memcg);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -199,8 +208,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
-		list_lru_init(&s->s_dentry_lru);
-		list_lru_init(&s->s_inode_lru);
+
+		list_lru_init_memcg(&s->s_dentry_lru);
+		list_lru_init_memcg(&s->s_inode_lru);
+
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -236,7 +247,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_shrink.scan_objects = super_cache_scan;
 		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
-		s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+		s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	}
 out:
 	return s;
@@ -319,6 +330,8 @@ void deactivate_locked_super(struct super_block *s)
 
 		/* caches are now gone, we can safely kill the shrinker now */
 		unregister_shrinker(&s->s_shrink);
+		list_lru_destroy(&s->s_dentry_lru);
+		list_lru_destroy(&s->s_inode_lru);
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 33/35] memcg: move initialization to memcg creation
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (27 preceding siblings ...)
  2013-06-03 19:30   ` [PATCH v10 32/35] super: targeted memcg reclaim Glauber Costa
@ 2013-06-03 19:30   ` Glauber Costa
  2013-06-03 19:30   ` [PATCH v10 34/35] vmpressure: in-kernel notifications Glauber Costa
  2013-06-03 19:30   ` [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure Glauber Costa
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa6853f..c0e1113f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3326,9 +3326,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 
 	memcg_update_array_size(num + 1);
 
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
 	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	return 0;
 out:
@@ -6319,6 +6317,8 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 34/35] vmpressure: in-kernel notifications
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (28 preceding siblings ...)
  2013-06-03 19:30   ` [PATCH v10 33/35] memcg: move initialization to memcg creation Glauber Costa
@ 2013-06-03 19:30   ` Glauber Costa
  2013-06-03 19:30   ` [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure Glauber Costa
  30 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, John Stultz, Joonsoo Kim

From: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The later are usually people interested in
one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Acked-by: Anton Vorontsov <anton-9xeibp6oKSgdnm+yROfE0A@public.gmane.org>
Acked-by: Pekka Enberg <penberg-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Joonsoo Kim <js1304-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 include/linux/vmpressure.h |  6 ++++++
 mm/vmpressure.c            | 52 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 76be077..3131e72 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -36,6 +39,9 @@ extern struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css);
 extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+
+extern int vmpressure_register_kernel_event(struct cgroup *cg,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
 					struct eventfd_ctx *eventfd);
 #else
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..e16256e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -135,8 +135,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -152,12 +156,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -227,7 +234,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -244,8 +251,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	mutex_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win || work_pending(&vmpr->work))
 		return;
 	schedule_work(&vmpr->work);
@@ -328,6 +342,38 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @cg:		cgroup that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct cgroup *cg, void (*fn)(void))
+{
+	struct vmpressure *vmpr = cg_to_vmpressure(cg);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @cg:		cgroup handle
  * @cft:	cgroup control files handle
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure.
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (29 preceding siblings ...)
  2013-06-03 19:30   ` [PATCH v10 34/35] vmpressure: in-kernel notifications Glauber Costa
@ 2013-06-03 19:30   ` Glauber Costa
  2013-06-05 23:09     ` Andrew Morton
  30 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen,
	Glauber Costa, Dave Chinner, Rik van Riel

When we delete kmem-enabled memcgs, they can still be zombieing
around for a while. The reason is that the objects may still be alive,
and we won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The
shrinker interface, however, is not exactly tailored to our needs. It
could be a little bit better by using the API Dave Chinner proposed, but
it is still not ideal since we aren't really a count-and-scan event, but
more a one-off flush-all-you-can event that would have to abuse that
somehow.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c0e1113f..919fb24b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -400,7 +400,6 @@ static size_t memcg_size(void)
 		nr_node_ids * sizeof(struct mem_cgroup_per_node);
 }
 
-#ifdef CONFIG_MEMCG_DEBUG_ASYNC_DESTROY
 static LIST_HEAD(dangling_memcgs);
 static DEFINE_MUTEX(dangling_memcgs_mutex);
 
@@ -409,11 +408,14 @@ static inline void memcg_dangling_free(struct mem_cgroup *memcg)
 	mutex_lock(&dangling_memcgs_mutex);
 	list_del(&memcg->dead);
 	mutex_unlock(&dangling_memcgs_mutex);
+#ifdef CONFIG_MEMCG_DEBUG_ASYNC_DESTROY
 	free_pages((unsigned long)memcg->memcg_name, 0);
+#endif
 }
 
 static inline void memcg_dangling_add(struct mem_cgroup *memcg)
 {
+#ifdef CONFIG_MEMCG_DEBUG_ASYNC_DESTROY
 	/*
 	 * cgroup.c will do page-sized allocations most of the time,
 	 * so we'll just follow the pattern. Also, __get_free_pages
@@ -439,15 +441,12 @@ static inline void memcg_dangling_add(struct mem_cgroup *memcg)
 	}
 
 add_list:
+#endif
 	INIT_LIST_HEAD(&memcg->dead);
 	mutex_lock(&dangling_memcgs_mutex);
 	list_add(&memcg->dead, &dangling_memcgs);
 	mutex_unlock(&dangling_memcgs_mutex);
 }
-#else
-static inline void memcg_dangling_free(struct mem_cgroup *memcg) {}
-static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
-#endif
 
 static DEFINE_MUTEX(set_limit_mutex);
 
@@ -6313,6 +6312,41 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup *cont)
+{
+	vmpressure_register_kernel_event(cont, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6348,6 +6382,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
 	}
 }
 #else
+static inline void memcg_register_kmem_events(struct cgroup *cont)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6733,8 +6771,10 @@ mem_cgroup_css_online(struct cgroup *cont)
 	struct mem_cgroup *memcg, *parent;
 	int error = 0;
 
-	if (!cont->parent)
+	if (!cont->parent) {
+		memcg_register_kmem_events(cont);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 	memcg = mem_cgroup_from_cont(cont);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure.
  2013-06-03 19:30   ` [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure Glauber Costa
@ 2013-06-05 23:09     ` Andrew Morton
  2013-06-06  8:33       ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On Mon,  3 Jun 2013 23:30:04 +0400 Glauber Costa <glommer@openvz.org> wrote:

> When we delete kmem-enabled memcgs, they can still be zombieing
> around for a while. The reason is that the objects may still be alive,
> and we won't be able to delete them at destruction time.
> 
> The only entry point for that, though, are the shrinkers. The
> shrinker interface, however, is not exactly tailored to our needs. It
> could be a little bit better by using the API Dave Chinner proposed, but
> it is still not ideal since we aren't really a count-and-scan event, but
> more a one-off flush-all-you-can event that would have to abuse that
> somehow.

This patch is significantly dependent on
http://ozlabs.org/~akpm/mmots/broken-out/memcg-debugging-facility-to-access-dangling-memcgs.patch,
which was designated "mm only debug patch" when I merged it six months
ago.

We can go ahead and merge
memcg-debugging-facility-to-access-dangling-memcgs.patch upstream I
guess, but we shouldn't do that just because it makes the
patch-wrangling a bit easier!

Is memcg-debugging-facility-to-access-dangling-memcgs.patch worth merging in
its own right?  If so, what changed since our earlier decision?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure.
  2013-06-05 23:09     ` Andrew Morton
@ 2013-06-06  8:33       ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Rik van Riel

On 06/06/2013 03:09 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:30:04 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
>> When we delete kmem-enabled memcgs, they can still be zombieing
>> around for a while. The reason is that the objects may still be alive,
>> and we won't be able to delete them at destruction time.
>>
>> The only entry point for that, though, are the shrinkers. The
>> shrinker interface, however, is not exactly tailored to our needs. It
>> could be a little bit better by using the API Dave Chinner proposed, but
>> it is still not ideal since we aren't really a count-and-scan event, but
>> more a one-off flush-all-you-can event that would have to abuse that
>> somehow.
> 
> This patch is significantly dependent on
> http://ozlabs.org/~akpm/mmots/broken-out/memcg-debugging-facility-to-access-dangling-memcgs.patch,
> which was designated "mm only debug patch" when I merged it six months
> ago.
> 
> We can go ahead and merge
> memcg-debugging-facility-to-access-dangling-memcgs.patch upstream I
> guess, but we shouldn't do that just because it makes the
> patch-wrangling a bit easier!
> 
> Is memcg-debugging-facility-to-access-dangling-memcgs.patch worth merging in
> its own right?  If so, what changed since our earlier decision?
> 

I was under the impression that it *was* merged, even though it
shouldn't - it was showing up on -next, so I could be wrong. I am
basically using part of the infrastructure for this patch, but the rest
can go away.

If the patch isn't really merged and I was just confused (can happen),
what I would prefer to do is what I have done originally: I will append
part of that in this patch (the part the adds memcgs to the dangling
list), and leave the file part in a separate patch. I will then resend
you that patch as a debug-only patch.

To do that, it would be mostly helpful if you could remove that for your
tree temporarily.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 05/35] dcache: remove dentries from LRU before putting on dispose list
  2013-06-03 19:29 [PATCH v10 00/35] kmemcg shrinkers Glauber Costa
  2013-06-03 19:29 ` [PATCH v10 02/35] super: fix calculation of shrinkable objects for small numbers Glauber Costa
       [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2013-06-03 19:29 ` Glauber Costa
  2013-06-05 23:07   ` Andrew Morton
  2013-06-03 19:29 ` [PATCH v10 20/35] drivers: convert shrinkers to new count/scan API Glauber Costa
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

One of the big problems with modifying the way the dcache shrinker
and LRU implementation works is that the LRU is abused in several
ways. One of these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list
without doing any accounting changes, and then use dentry_lru_prune()
to remove it from what-ever list it is now on to do the LRU
accounting at that point.

This makes it -really hard- to change the LRU implementation. The
use of the per-sb LRU lock serialises movement of the dentries
between the different lists and the removal of them, and this is the
only reason that it works. If we want to break up the dentry LRU
lock and lists into, say, per-node lists, we remove the only
serialisation that allows this lru list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated
from the LRU list - dentries have to be removed from the LRU
*before* being placed on the dispose list. This means that the LRU
accounting and isolation is completed before disposal is started,
and that means we can change the LRU implementation freely in
future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST
when they are placed on the dispose list so that we don't think that
parent dentries found in try_prune_one_dentry() are on the LRU when
the are actually on the dispose list. This would result in
accounting the dentry to the LRU a second time. Hence
dentry_lru_del() has to handle the DCACHE_SHRINK_LIST case
differently because the dentry isn't on the LRU list.

[ v2: don't decrement nr unused twice, spotted by Sha Zhengju ]
[ v7: (dchinner)
- shrink list leaks dentries when inode/parent can't be locked in
  dentry_kill().
- remove the readdition of dentry_lru_prune(). ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 77 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9d8ec4a..03d0c21 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -315,7 +315,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 }
 
 /*
- * dentry_lru_(add|del|prune|move_tail) must be called with d_lock held.
+ * dentry_lru_(add|del|move_list) must be called with d_lock held.
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
@@ -331,16 +331,25 @@ static void dentry_lru_add(struct dentry *dentry)
 static void __dentry_lru_del(struct dentry *dentry)
 {
 	list_del_init(&dentry->d_lru);
-	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	dentry->d_sb->s_nr_dentry_unused--;
 	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
  * Remove a dentry with references from the LRU.
+ *
+ * If we are on the shrink list, then we can get to try_prune_one_dentry() and
+ * lose our last reference through the parent walk. In this case, we need to
+ * remove ourselves from the shrink list, not the LRU.
  */
 static void dentry_lru_del(struct dentry *dentry)
 {
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		list_del_init(&dentry->d_lru);
+		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+		return;
+	}
+
 	if (!list_empty(&dentry->d_lru)) {
 		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
@@ -350,13 +359,15 @@ static void dentry_lru_del(struct dentry *dentry)
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
+	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
 	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused++;
-		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
+		dentry->d_sb->s_nr_dentry_unused--;
+		this_cpu_dec(nr_dentry_unused);
 	}
 	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
@@ -454,7 +465,8 @@ EXPORT_SYMBOL(d_drop);
  * If ref is non-zero, then decrement the refcount too.
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
-static inline struct dentry *dentry_kill(struct dentry *dentry, int ref)
+static inline struct dentry *
+dentry_kill(struct dentry *dentry, int ref, int unlock_on_failure)
 	__releases(dentry->d_lock)
 {
 	struct inode *inode;
@@ -463,8 +475,10 @@ static inline struct dentry *dentry_kill(struct dentry *dentry, int ref)
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
-		spin_unlock(&dentry->d_lock);
-		cpu_relax();
+		if (unlock_on_failure) {
+			spin_unlock(&dentry->d_lock);
+			cpu_relax();
+		}
 		return dentry; /* try again with same dentry */
 	}
 	if (IS_ROOT(dentry))
@@ -551,7 +565,7 @@ repeat:
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry, 1);
+	dentry = dentry_kill(dentry, 1, 1);
 	if (dentry)
 		goto repeat;
 }
@@ -750,12 +764,12 @@ EXPORT_SYMBOL(d_prune_aliases);
  *
  * This may fail if locks cannot be acquired no problem, just try again.
  */
-static void try_prune_one_dentry(struct dentry *dentry)
+static struct dentry * try_prune_one_dentry(struct dentry *dentry)
 	__releases(dentry->d_lock)
 {
 	struct dentry *parent;
 
-	parent = dentry_kill(dentry, 0);
+	parent = dentry_kill(dentry, 0, 0);
 	/*
 	 * If dentry_kill returns NULL, we have nothing more to do.
 	 * if it returns the same dentry, trylocks failed. In either
@@ -767,9 +781,9 @@ static void try_prune_one_dentry(struct dentry *dentry)
 	 * fragmentation.
 	 */
 	if (!parent)
-		return;
+		return NULL;
 	if (parent == dentry)
-		return;
+		return dentry;
 
 	/* Prune ancestors. */
 	dentry = parent;
@@ -778,10 +792,11 @@ static void try_prune_one_dentry(struct dentry *dentry)
 		if (dentry->d_count > 1) {
 			dentry->d_count--;
 			spin_unlock(&dentry->d_lock);
-			return;
+			return NULL;
 		}
-		dentry = dentry_kill(dentry, 1);
+		dentry = dentry_kill(dentry, 1, 1);
 	}
+	return NULL;
 }
 
 static void shrink_dentry_list(struct list_head *list)
@@ -800,21 +815,31 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 
 		/*
+		 * The dispose list is isolated and dentries are not accounted
+		 * to the LRU here, so we can simply remove it from the list
+		 * here regardless of whether it is referenced or not.
+		 */
+		list_del_init(&dentry->d_lru);
+		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+
+		/*
 		 * We found an inuse dentry which was not removed from
-		 * the LRU because of laziness during lookup.  Do not free
-		 * it - just keep it off the LRU list.
+		 * the LRU because of laziness during lookup. Do not free it.
 		 */
 		if (dentry->d_count) {
-			dentry_lru_del(dentry);
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
-
 		rcu_read_unlock();
 
-		try_prune_one_dentry(dentry);
+		dentry = try_prune_one_dentry(dentry);
 
 		rcu_read_lock();
+		if (dentry) {
+			dentry->d_flags |= DCACHE_SHRINK_LIST;
+			list_add(&dentry->d_lru, list);
+			spin_unlock(&dentry->d_lock);
+		}
 	}
 	rcu_read_unlock();
 }
@@ -855,8 +880,10 @@ relock:
 			list_move(&dentry->d_lru, &referenced);
 			spin_unlock(&dentry->d_lock);
 		} else {
-			list_move_tail(&dentry->d_lru, &tmp);
+			list_move(&dentry->d_lru, &tmp);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
+			this_cpu_dec(nr_dentry_unused);
+			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
 			if (!--count)
 				break;
@@ -870,6 +897,27 @@ relock:
 	shrink_dentry_list(&tmp);
 }
 
+/*
+ * Mark all the dentries as on being the dispose list so we don't think they are
+ * still on the LRU if we try to kill them from ascending the parent chain in
+ * try_prune_one_dentry() rather than directly from the dispose list.
+ */
+static void
+shrink_dcache_list(
+	struct list_head *dispose)
+{
+	struct dentry *dentry;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(dentry, dispose, d_lru) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_SHRINK_LIST;
+		spin_unlock(&dentry->d_lock);
+	}
+	rcu_read_unlock();
+	shrink_dentry_list(dispose);
+}
+
 /**
  * shrink_dcache_sb - shrink dcache for a superblock
  * @sb: superblock
@@ -883,9 +931,17 @@ void shrink_dcache_sb(struct super_block *sb)
 
 	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
+		/*
+		 * account for removal here so we don't need to handle it later
+		 * even though the dentry is no longer on the lru list.
+		 */
 		list_splice_init(&sb->s_dentry_lru, &tmp);
+		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
+		sb->s_nr_dentry_unused = 0;
 		spin_unlock(&sb->s_dentry_lru_lock);
-		shrink_dentry_list(&tmp);
+
+		shrink_dcache_list(&tmp);
+
 		spin_lock(&sb->s_dentry_lru_lock);
 	}
 	spin_unlock(&sb->s_dentry_lru_lock);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 05/35] dcache: remove dentries from LRU before putting on dispose list
  2013-06-03 19:29 ` [PATCH v10 05/35] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
@ 2013-06-05 23:07   ` Andrew Morton
  2013-06-06  8:04     ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On Mon,  3 Jun 2013 23:29:34 +0400 Glauber Costa <glommer@openvz.org> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Several of these patches were missing your (Glauber's) Signed-off-by:. 
I added this in my copies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 05/35] dcache: remove dentries from LRU before putting on dispose list
  2013-06-05 23:07   ` Andrew Morton
@ 2013-06-06  8:04     ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  8:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner

On 06/06/2013 03:07 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:34 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Several of these patches were missing your (Glauber's) Signed-off-by:. 
> I added this in my copies.
> 
I remember updating them all when Mel complained.
This one in particular I might have missed, since Dave provided an
updated copy after that fact - and I may have forgot to update it.

Thanks for noting, I will go through them all again making sure they're
all there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v10 20/35] drivers: convert shrinkers to new count/scan API
  2013-06-03 19:29 [PATCH v10 00/35] kmemcg shrinkers Glauber Costa
                   ` (2 preceding siblings ...)
  2013-06-03 19:29 ` [PATCH v10 05/35] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
@ 2013-06-03 19:29 ` Glauber Costa
  2013-06-03 19:30 ` [PATCH v10 31/35] vmscan: take at least one pass with shrinkers Glauber Costa
  2013-06-05 23:07 ` [PATCH v10 00/35] kmemcg shrinkers Andrew Morton
  5 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Dave Chinner, Glauber Costa, Daniel Vetter,
	Kent Overstreet, Arve Hjønnevåg, John Stultz,
	David Rientjes, Jerome Glisse, Thomas Hellstrom

From: Dave Chinner <dchinner@redhat.com>

Convert the driver shrinkers to the new API. Most changes are
compile tested only because I either don't have the hardware or it's
staging stuff.

FWIW, the md and android code is pretty good, but the rest of it
makes me want to claw my eyes out.  The amount of broken code I just
encountered is mind boggling.  I've added comments explaining what
is broken, but I fear that some of the code would be best dealt with
by being dragged behind the bike shed, burying in mud up to it's
neck and then run over repeatedly with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers. They can't
co-exist in the build at the same time, they are under different
menu options in menuconfig, they only show up when you've got the
right set of mm subsystem options configured and so even compile
testing is an exercise in pulling teeth.  And that doesn't even take
into account the horrible, broken code...

[ glommer: fixes for i915, android lowmem, zcache, bcache ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
CC: Daniel Vetter <daniel.vetter@ffwll.ch>
CC: Kent Overstreet <koverstreet@google.com>
CC: Arve Hjønnevåg <arve@android.com>
CC: John Stultz <john.stultz@linaro.org>
CC: David Rientjes <rientjes@google.com>
CC: Jerome Glisse <jglisse@redhat.com>
CC: Thomas Hellstrom <thellstrom@vmware.com>
---
 drivers/gpu/drm/i915/i915_dma.c           |  4 +-
 drivers/gpu/drm/i915/i915_gem.c           | 67 ++++++++++++++++++++++---------
 drivers/gpu/drm/ttm/ttm_page_alloc.c      | 48 ++++++++++++++--------
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c  | 55 ++++++++++++++++---------
 drivers/md/bcache/btree.c                 | 43 ++++++++++----------
 drivers/md/bcache/sysfs.c                 |  2 +-
 drivers/md/dm-bufio.c                     | 65 +++++++++++++++++++-----------
 drivers/staging/android/ashmem.c          | 45 ++++++++++++++-------
 drivers/staging/android/lowmemorykiller.c | 40 ++++++++++--------
 drivers/staging/zcache/zcache-main.c      | 29 ++++++++-----
 10 files changed, 256 insertions(+), 142 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 3cd2b60..6d80cc4 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1676,7 +1676,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	return 0;
 
 out_gem_unload:
-	if (dev_priv->mm.inactive_shrinker.shrink)
+	if (dev_priv->mm.inactive_shrinker.scan_objects)
 		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
 	if (dev->pdev->msi_enabled)
@@ -1712,7 +1712,7 @@ int i915_driver_unload(struct drm_device *dev)
 
 	i915_teardown_sysfs(dev);
 
-	if (dev_priv->mm.inactive_shrinker.shrink)
+	if (dev_priv->mm.inactive_shrinker.scan_objects)
 		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
 	mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c605097..e360031 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -53,10 +53,12 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
 					 struct drm_i915_fence_reg *fence,
 					 bool enable);
 
-static int i915_gem_inactive_shrink(struct shrinker *shrinker,
+static long i915_gem_inactive_count(struct shrinker *shrinker,
 				    struct shrink_control *sc);
+static long i915_gem_inactive_scan(struct shrinker *shrinker,
+				   struct shrink_control *sc);
 static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
-static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
+static long i915_gem_shrink_all(struct drm_i915_private *dev_priv);
 static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
 
 static inline void i915_gem_object_fence_lost(struct drm_i915_gem_object *obj)
@@ -1729,15 +1731,20 @@ i915_gem_purge(struct drm_i915_private *dev_priv, long target)
 	return __i915_gem_shrink(dev_priv, target, true);
 }
 
-static void
+static long
 i915_gem_shrink_all(struct drm_i915_private *dev_priv)
 {
 	struct drm_i915_gem_object *obj, *next;
+	long freed = 0;
 
 	i915_gem_evict_everything(dev_priv->dev);
 
-	list_for_each_entry_safe(obj, next, &dev_priv->mm.unbound_list, gtt_list)
+	list_for_each_entry_safe(obj, next, &dev_priv->mm.unbound_list, gtt_list) {
+		if (obj->pages_pin_count == 0)
+			freed += obj->base.size >> PAGE_SHIFT;
 		i915_gem_object_put_pages(obj);
+	}
+	return freed;
 }
 
 static int
@@ -4245,7 +4252,8 @@ i915_gem_load(struct drm_device *dev)
 
 	dev_priv->mm.interruptible = true;
 
-	dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
+	dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
+	dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
 	dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&dev_priv->mm.inactive_shrinker);
 }
@@ -4468,8 +4476,8 @@ static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task)
 #endif
 }
 
-static int
-i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
+static long
+i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(shrinker,
@@ -4477,9 +4485,8 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
 			     mm.inactive_shrinker);
 	struct drm_device *dev = dev_priv->dev;
 	struct drm_i915_gem_object *obj;
-	int nr_to_scan = sc->nr_to_scan;
 	bool unlock = true;
-	int cnt;
+	long cnt;
 
 	if (!mutex_trylock(&dev->struct_mutex)) {
 		if (!mutex_is_locked_by(&dev->struct_mutex, current))
@@ -4491,15 +4498,6 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
 		unlock = false;
 	}
 
-	if (nr_to_scan) {
-		nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
-		if (nr_to_scan > 0)
-			nr_to_scan -= __i915_gem_shrink(dev_priv, nr_to_scan,
-							false);
-		if (nr_to_scan > 0)
-			i915_gem_shrink_all(dev_priv);
-	}
-
 	cnt = 0;
 	list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
 		if (obj->pages_pin_count == 0)
@@ -4512,3 +4510,36 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
 		mutex_unlock(&dev->struct_mutex);
 	return cnt;
 }
+static long
+i915_gem_inactive_scan(struct shrinker *shrinker, struct shrink_control *sc)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(shrinker,
+			     struct drm_i915_private,
+			     mm.inactive_shrinker);
+	struct drm_device *dev = dev_priv->dev;
+	int nr_to_scan = sc->nr_to_scan;
+	long freed;
+	bool unlock = true;
+
+	if (!mutex_trylock(&dev->struct_mutex)) {
+		if (!mutex_is_locked_by(&dev->struct_mutex, current))
+			return 0;
+
+		if (dev_priv->mm.shrinker_no_lock_stealing)
+			return 0;
+
+		unlock = false;
+	}
+
+	freed = i915_gem_purge(dev_priv, nr_to_scan);
+	if (freed < nr_to_scan)
+		freed += __i915_gem_shrink(dev_priv, nr_to_scan,
+							false);
+	if (freed < nr_to_scan)
+		freed += i915_gem_shrink_all(dev_priv);
+
+	if (unlock)
+		mutex_unlock(&dev->struct_mutex);
+	return freed;
+}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index bd2a3b4..83058a2 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -377,28 +377,28 @@ out:
 	return nr_free;
 }
 
-/* Get good estimation how many pages are free in pools */
-static int ttm_pool_get_num_unused_pages(void)
-{
-	unsigned i;
-	int total = 0;
-	for (i = 0; i < NUM_POOLS; ++i)
-		total += _manager->pools[i].npages;
-
-	return total;
-}
-
 /**
  * Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
+ * this can deadlock when called a sc->gfp_mask that is not equal to
+ * GFP_KERNEL.
+ *
+ * This code is crying out for a shrinker per pool....
  */
-static int ttm_pool_mm_shrink(struct shrinker *shrink,
-			      struct shrink_control *sc)
+static long
+ttm_pool_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static atomic_t start_pool = ATOMIC_INIT(0);
 	unsigned i;
 	unsigned pool_offset = atomic_add_return(1, &start_pool);
 	struct ttm_page_pool *pool;
 	int shrink_pages = sc->nr_to_scan;
+	long freed = 0;
 
 	pool_offset = pool_offset % NUM_POOLS;
 	/* select start pool in round robin fashion */
@@ -408,14 +408,30 @@ static int ttm_pool_mm_shrink(struct shrinker *shrink,
 			break;
 		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
 		shrink_pages = ttm_page_pool_free(pool, nr_free);
+		freed += nr_free - shrink_pages;
 	}
-	/* return estimated number of unused pages in pool */
-	return ttm_pool_get_num_unused_pages();
+	return freed;
+}
+
+
+static long
+ttm_pool_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	unsigned i;
+	long count = 0;
+
+	for (i = 0; i < NUM_POOLS; ++i)
+		count += _manager->pools[i].npages;
+
+	return count;
 }
 
 static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
 {
-	manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
+	manager->mm_shrink.count_objects = &ttm_pool_shrink_count;
+	manager->mm_shrink.scan_objects = &ttm_pool_shrink_scan;
 	manager->mm_shrink.seeks = 1;
 	register_shrinker(&manager->mm_shrink);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index b8b3943..b3b4f99 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -918,19 +918,6 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(ttm_dma_populate);
 
-/* Get good estimation how many pages are free in pools */
-static int ttm_dma_pool_get_num_unused_pages(void)
-{
-	struct device_pools *p;
-	unsigned total = 0;
-
-	mutex_lock(&_manager->lock);
-	list_for_each_entry(p, &_manager->pools, pools)
-		total += p->pool->npages_free;
-	mutex_unlock(&_manager->lock);
-	return total;
-}
-
 /* Put all pages in pages list to correct pool to wait for reuse */
 void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 {
@@ -1002,18 +989,31 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
 
 /**
  * Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
+ * needs to be paid to sc->gfp_mask to determine if this can be done or not.
+ * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
+ * bad.
+ *
+ * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
+ * shrinkers
  */
-static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
-				  struct shrink_control *sc)
+static long
+ttm_dma_pool_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static atomic_t start_pool = ATOMIC_INIT(0);
 	unsigned idx = 0;
 	unsigned pool_offset = atomic_add_return(1, &start_pool);
 	unsigned shrink_pages = sc->nr_to_scan;
 	struct device_pools *p;
+	long freed = 0;
 
 	if (list_empty(&_manager->pools))
-		return 0;
+		return -1;
 
 	mutex_lock(&_manager->lock);
 	pool_offset = pool_offset % _manager->npools;
@@ -1029,18 +1029,35 @@ static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
 			continue;
 		nr_free = shrink_pages;
 		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
+		freed += nr_free - shrink_pages;
+
 		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
 			 p->pool->dev_name, p->pool->name, current->pid,
 			 nr_free, shrink_pages);
 	}
 	mutex_unlock(&_manager->lock);
-	/* return estimated number of unused pages in pool */
-	return ttm_dma_pool_get_num_unused_pages();
+	return freed;
+}
+
+static long
+ttm_dma_pool_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct device_pools *p;
+	long count = 0;
+
+	mutex_lock(&_manager->lock);
+	list_for_each_entry(p, &_manager->pools, pools)
+		count += p->pool->npages_free;
+	mutex_unlock(&_manager->lock);
+	return count;
 }
 
 static void ttm_dma_pool_mm_shrink_init(struct ttm_pool_manager *manager)
 {
-	manager->mm_shrink.shrink = &ttm_dma_pool_mm_shrink;
+	manager->mm_shrink.count_objects = &ttm_dma_pool_shrink_count;
+	manager->mm_shrink.scan_objects = &ttm_dma_pool_shrink_scan;
 	manager->mm_shrink.seeks = 1;
 	register_shrinker(&manager->mm_shrink);
 }
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 36688d6..e305f96 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -599,24 +599,12 @@ static int mca_reap(struct btree *b, struct closure *cl, unsigned min_order)
 	return 0;
 }
 
-static int bch_mca_shrink(struct shrinker *shrink, struct shrink_control *sc)
+static long bch_mca_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct cache_set *c = container_of(shrink, struct cache_set, shrink);
 	struct btree *b, *t;
 	unsigned long i, nr = sc->nr_to_scan;
-
-	if (c->shrinker_disabled)
-		return 0;
-
-	if (c->try_harder)
-		return 0;
-
-	/*
-	 * If nr == 0, we're supposed to return the number of items we have
-	 * cached. Not allowed to return -1.
-	 */
-	if (!nr)
-		return mca_can_free(c) * c->btree_pages;
+	long freed = 0;
 
 	/* Return -1 if we can't do anything right now */
 	if (sc->gfp_mask & __GFP_WAIT)
@@ -629,14 +617,14 @@ static int bch_mca_shrink(struct shrinker *shrink, struct shrink_control *sc)
 
 	i = 0;
 	list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
-		if (!nr)
+		if (freed >= nr)
 			break;
 
 		if (++i > 3 &&
 		    !mca_reap(b, NULL, 0)) {
 			mca_data_free(b);
 			rw_unlock(true, b);
-			--nr;
+			freed++;
 		}
 	}
 
@@ -647,7 +635,7 @@ static int bch_mca_shrink(struct shrinker *shrink, struct shrink_control *sc)
 	if (list_empty(&c->btree_cache))
 		goto out;
 
-	for (i = 0; nr && i < c->bucket_cache_used; i++) {
+	for (i = 0; i < c->bucket_cache_used; i++) {
 		b = list_first_entry(&c->btree_cache, struct btree, list);
 		list_rotate_left(&c->btree_cache);
 
@@ -656,14 +644,26 @@ static int bch_mca_shrink(struct shrinker *shrink, struct shrink_control *sc)
 			mca_bucket_free(b);
 			mca_data_free(b);
 			rw_unlock(true, b);
-			--nr;
+			freed++;
 		} else
 			b->accessed = 0;
 	}
 out:
-	nr = mca_can_free(c) * c->btree_pages;
 	mutex_unlock(&c->bucket_lock);
-	return nr;
+	return freed;
+}
+
+static long bch_mca_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct cache_set *c = container_of(shrink, struct cache_set, shrink);
+
+	if (c->shrinker_disabled)
+		return 0;
+
+	if (c->try_harder)
+		return 0;
+
+	return mca_can_free(c) * c->btree_pages;
 }
 
 void bch_btree_cache_free(struct cache_set *c)
@@ -732,7 +732,8 @@ int bch_btree_cache_alloc(struct cache_set *c)
 		c->verify_data = NULL;
 #endif
 
-	c->shrink.shrink = bch_mca_shrink;
+	c->shrink.count_objects = bch_mca_count;
+	c->shrink.scan_objects = bch_mca_scan;
 	c->shrink.seeks = 4;
 	c->shrink.batch = c->btree_pages * 2;
 	register_shrinker(&c->shrink);
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 4d9cca4..fa8d048 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -535,7 +535,7 @@ STORE(__bch_cache_set)
 		struct shrink_control sc;
 		sc.gfp_mask = GFP_KERNEL;
 		sc.nr_to_scan = strtoul_or_return(buf);
-		c->shrink.shrink(&c->shrink, &sc);
+		c->shrink.scan_objects(&c->shrink, &sc);
 	}
 
 	sysfs_strtoul(congested_read_threshold_us,
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index d489dfd..cdafc63 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1383,62 +1383,80 @@ static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
 				unsigned long max_jiffies)
 {
 	if (jiffies - b->last_accessed < max_jiffies)
-		return 1;
+		return 0;
 
 	if (!(gfp & __GFP_IO)) {
 		if (test_bit(B_READING, &b->state) ||
 		    test_bit(B_WRITING, &b->state) ||
 		    test_bit(B_DIRTY, &b->state))
-			return 1;
+			return 0;
 	}
 
 	if (b->hold_count)
-		return 1;
+		return 0;
 
 	__make_buffer_clean(b);
 	__unlink_buffer(b);
 	__free_buffer_wake(b);
 
-	return 0;
+	return 1;
 }
 
-static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
-		   struct shrink_control *sc)
+static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
+		   gfp_t gfp_mask)
 {
 	int l;
 	struct dm_buffer *b, *tmp;
+	long freed = 0;
 
 	for (l = 0; l < LIST_SIZE; l++) {
-		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
-			if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
-			    !--nr_to_scan)
-				return;
+		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
+			freed += __cleanup_old_buffer(b, gfp_mask, 0);
+			if (!--nr_to_scan)
+				break;
+		}
 		dm_bufio_cond_resched();
 	}
+	return freed;
 }
 
-static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
+static long
+dm_bufio_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct dm_bufio_client *c =
-	    container_of(shrinker, struct dm_bufio_client, shrinker);
-	unsigned long r;
-	unsigned long nr_to_scan = sc->nr_to_scan;
+	    container_of(shrink, struct dm_bufio_client, shrinker);
+	long freed;
 
 	if (sc->gfp_mask & __GFP_IO)
 		dm_bufio_lock(c);
 	else if (!dm_bufio_trylock(c))
-		return !nr_to_scan ? 0 : -1;
+		return -1;
 
-	if (nr_to_scan)
-		__scan(c, nr_to_scan, sc);
+	freed  = __scan(c, sc->nr_to_scan, sc->gfp_mask);
+	dm_bufio_unlock(c);
+	return freed;
+}
 
-	r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
-	if (r > INT_MAX)
-		r = INT_MAX;
+static long
+dm_bufio_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct dm_bufio_client *c =
+	    container_of(shrink, struct dm_bufio_client, shrinker);
+	long count;
+
+	if (sc->gfp_mask & __GFP_IO)
+		dm_bufio_lock(c);
+	else if (!dm_bufio_trylock(c))
+		return 0;
 
+	count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
 	dm_bufio_unlock(c);
+	return count;
 
-	return r;
 }
 
 /*
@@ -1540,7 +1558,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
 	__cache_size_refresh();
 	mutex_unlock(&dm_bufio_clients_lock);
 
-	c->shrinker.shrink = shrink;
+	c->shrinker.count_objects = dm_bufio_shrink_count;
+	c->shrinker.scan_objects = dm_bufio_shrink_scan;
 	c->shrinker.seeks = 1;
 	c->shrinker.batch = 0;
 	register_shrinker(&c->shrinker);
@@ -1627,7 +1646,7 @@ static void cleanup_old_buffers(void)
 			struct dm_buffer *b;
 			b = list_entry(c->lru[LIST_CLEAN].prev,
 				       struct dm_buffer, lru_list);
-			if (__cleanup_old_buffer(b, 0, max_age * HZ))
+			if (!__cleanup_old_buffer(b, 0, max_age * HZ))
 				break;
 			dm_bufio_cond_resched();
 		}
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 65f36d7..7b466b5 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -341,27 +341,28 @@ out:
 /*
  * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
  *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
+ * 'nr_to_scan' is the number of objects to scan for freeing.
  *
  * 'gfp_mask' is the mask of the allocation that got us into this mess.
  *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
+ * Return value is the number of objects freed or -1 if we cannot
  * proceed without risk of deadlock (due to gfp_mask).
  *
  * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
  * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
  * pages freed.
  */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
+static long
+ashmem_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct ashmem_range *range, *next;
+	long freed = 0;
 
 	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
 
 	mutex_lock(&ashmem_mutex);
 	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
@@ -374,17 +375,34 @@ static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
 		range->purged = ASHMEM_WAS_PURGED;
 		lru_del(range);
 
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
+		freed += range_size(range);
+		if (--sc->nr_to_scan <= 0)
 			break;
 	}
 	mutex_unlock(&ashmem_mutex);
+	return freed;
+}
 
+static long
+ashmem_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	/*
+	 * note that lru_count is count of pages on the lru, not a count of
+	 * objects on the list. This means the scan function needs to return the
+	 * number of pages freed, not the number of objects scanned.
+	 */
 	return lru_count;
 }
 
 static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
+	.count_objects = ashmem_shrink_count,
+	.scan_objects = ashmem_shrink_scan,
+	/*
+	 * XXX (dchinner): I wish people would comment on why they need on
+	 * significant changes to the default value here
+	 */
 	.seeks = DEFAULT_SEEKS * 4,
 };
 
@@ -690,14 +708,11 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		if (capable(CAP_SYS_ADMIN)) {
 			struct shrink_control sc = {
 				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
+				.nr_to_scan = LONG_MAX,
 			};
 
 			nodes_setall(sc.nodes_to_scan);
-
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
+			ashmem_shrink_scan(&ashmem_shrinker, &sc);
 		}
 		break;
 	}
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index fe74494..d23bfea 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -66,7 +66,15 @@ static unsigned long lowmem_deathpending_timeout;
 			pr_info(x);			\
 	} while (0)
 
-static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
+static long lowmem_count(struct shrinker *s, struct shrink_control *sc)
+{
+	return global_page_state(NR_ACTIVE_ANON) +
+		global_page_state(NR_ACTIVE_FILE) +
+		global_page_state(NR_INACTIVE_ANON) +
+		global_page_state(NR_INACTIVE_FILE);
+}
+
+static long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 {
 	struct task_struct *tsk;
 	struct task_struct *selected = NULL;
@@ -92,19 +100,17 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
 			break;
 		}
 	}
-	if (sc->nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %hd\n",
-				sc->nr_to_scan, sc->gfp_mask, other_free,
-				other_file, min_score_adj);
-	rem = global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE) +
-		global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE);
-	if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
-		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
-			     sc->nr_to_scan, sc->gfp_mask, rem);
-		return rem;
+
+	lowmem_print(3, "lowmem_scan %lu, %x, ofree %d %d, ma %hd\n",
+			sc->nr_to_scan, sc->gfp_mask, other_free,
+			other_file, min_score_adj);
+
+	if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
+		lowmem_print(5, "lowmem_scan %lu, %x, return 0\n",
+			     sc->nr_to_scan, sc->gfp_mask);
+		return 0;
 	}
+
 	selected_oom_score_adj = min_score_adj;
 
 	rcu_read_lock();
@@ -154,16 +160,18 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
 		lowmem_deathpending_timeout = jiffies + HZ;
 		send_sig(SIGKILL, selected, 0);
 		set_tsk_thread_flag(selected, TIF_MEMDIE);
-		rem -= selected_tasksize;
+		rem += selected_tasksize;
 	}
-	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
+
+	lowmem_print(4, "lowmem_scan %lu, %x, return %d\n",
 		     sc->nr_to_scan, sc->gfp_mask, rem);
 	rcu_read_unlock();
 	return rem;
 }
 
 static struct shrinker lowmem_shrinker = {
-	.shrink = lowmem_shrink,
+	.scan_objects = lowmem_scan,
+	.count_objects = lowmem_count,
 	.seeks = DEFAULT_SEEKS * 16
 };
 
diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index dcceed2..4ade8e3 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -1140,23 +1140,19 @@ static bool zcache_freeze;
  * pageframes in use.  FIXME POLICY: Probably the writeback should only occur
  * if the eviction doesn't free enough pages.
  */
-static int shrink_zcache_memory(struct shrinker *shrink,
-				struct shrink_control *sc)
+static long scan_zcache_memory(struct shrinker *shrink,
+			       struct shrink_control *sc)
 {
 	static bool in_progress;
-	int ret = -1;
-	int nr = sc->nr_to_scan;
 	int nr_evict = 0;
 	int nr_writeback = 0;
 	struct page *page;
 	int  file_pageframes_inuse, anon_pageframes_inuse;
-
-	if (nr <= 0)
-		goto skip_evict;
+	long freed = 0;
 
 	/* don't allow more than one eviction thread at a time */
 	if (in_progress)
-		goto skip_evict;
+		return 0;
 
 	in_progress = true;
 
@@ -1176,6 +1172,7 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 		if (page == NULL)
 			break;
 		zcache_free_page(page);
+		freed++;
 	}
 
 	zcache_last_active_anon_pageframes =
@@ -1192,13 +1189,22 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 #ifdef CONFIG_ZCACHE_WRITEBACK
 		int writeback_ret;
 		writeback_ret = zcache_frontswap_writeback();
-		if (writeback_ret == -ENOMEM)
+		if (writeback_ret != -ENOMEM)
+			freed++;
+		else
 #endif
 			break;
 	}
 	in_progress = false;
 
-skip_evict:
+	return freed;
+}
+
+static long count_zcache_memory(struct shrinker *shrink,
+				struct shrink_control *sc)
+{
+	int ret = -1;
+
 	/* resample: has changed, but maybe not all the way yet */
 	zcache_last_active_file_pageframes =
 		global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
@@ -1212,7 +1218,8 @@ skip_evict:
 }
 
 static struct shrinker zcache_shrinker = {
-	.shrink = shrink_zcache_memory,
+	.scan_objects = scan_zcache_memory,
+	.count_objects = count_zcache_memory,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v10 31/35] vmscan: take at least one pass with shrinkers
  2013-06-03 19:29 [PATCH v10 00/35] kmemcg shrinkers Glauber Costa
                   ` (3 preceding siblings ...)
  2013-06-03 19:29 ` [PATCH v10 20/35] drivers: convert shrinkers to new count/scan API Glauber Costa
@ 2013-06-03 19:30 ` Glauber Costa
  2013-06-05 23:07 ` [PATCH v10 00/35] kmemcg shrinkers Andrew Morton
  5 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-03 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen, Glauber Costa, Carlos Maiolino, Theodore Ts'o,
	Al Viro

In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.

In particular, we are concerned with the direct reclaim case for memcg.
Although this same technique can be applied to other situations just as well,
we will start conservative and apply it for that case, which is the one
that matters the most.

[ v6: only do it per memcg ]
[ v5: differentiate no-scan case, don't do this for kswapd ]

Signed-off-by: Glauber Costa <glommer@openvz.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Carlos Maiolino <cmaiolino@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/vmscan.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a42c742..1067b1c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -297,21 +297,34 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 				nr_pages_scanned, lru_pages,
 				max_pass, delta, total_scan);
 
-	while (total_scan >= batch_size) {
+	do {
 		long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
+		struct mem_cgroup *memcg = shrinkctl->target_mem_cgroup;
+
+		/*
+		 * Differentiate between "few objects" and "no objects"
+		 * as returned by the count step.
+		 */
+		if (!total_scan)
+			break;
+
+		if ((total_scan < batch_size) &&
+		   !(memcg && memcg_kmem_is_active(memcg)))
+			break;
 
-		shrinkctl->nr_to_scan = batch_size;
+		shrinkctl->nr_to_scan = nr_to_scan;
 		ret = shrinker->scan_objects(shrinker, shrinkctl);
 
 		if (ret == -1)
 			break;
 		freed += ret;
 
-		count_vm_events(SLABS_SCANNED, batch_size);
-		total_scan -= batch_size;
+		count_vm_events(SLABS_SCANNED, nr_to_scan);
+		total_scan -= nr_to_scan;
 
 		cond_resched();
-	}
+	} while (total_scan >= batch_size);
 
 	/*
 	 * move the unused scan count back into the shrinker in a
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 00/35] kmemcg shrinkers
  2013-06-03 19:29 [PATCH v10 00/35] kmemcg shrinkers Glauber Costa
                   ` (4 preceding siblings ...)
  2013-06-03 19:30 ` [PATCH v10 31/35] vmscan: take at least one pass with shrinkers Glauber Costa
@ 2013-06-05 23:07 ` Andrew Morton
  2013-06-06  3:44   ` Dave Chinner
                     ` (2 more replies)
  5 siblings, 3 replies; 103+ messages in thread
From: Andrew Morton @ 2013-06-05 23:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen

On Mon,  3 Jun 2013 23:29:29 +0400 Glauber Costa <glommer@openvz.org> wrote:

> Andrew,
> 
> This submission contains one small bug fix over the last one. I have been
> testing it regularly and believe this is ready for merging. I have follow up
> patches for this series, with a few improvements (namely: dynamic sized
> list_lru node arrays, memcg flush-at-destruction, kmemcg shrinking setting
> limit < usage).  But since this series is already quite mature - and very
> extensive, I don't believe that adding new patches would make them receive the
> appropriate level of review. So please advise me if there is anything crucial
> missing in here. Thanks!
> 
> Hi,
> 
> This patchset implements targeted shrinking for memcg when kmem limits are
> present. So far, we've been accounting kernel objects but failing allocations
> when short of memory. This is because our only option would be to call the
> global shrinker, depleting objects from all caches and breaking isolation.
> 
> The main idea is to associate per-memcg lists with each of the LRUs. The main
> LRU still provides a single entry point and when adding or removing an element
> from the LRU, we use the page information to figure out which memcg it belongs
> to and relay it to the right list.
> 
> Base work:
> ==========
> 
> Please note that this builds upon the recent work from Dave Chinner that
> sanitizes the LRU shrinking API and make the shrinkers node aware. Node
> awareness is not *strictly* needed for my work, but I still perceive it
> as an advantage. The API unification is a major need, and I build upon it
> heavily. That allows us to manipulate the LRUs without knowledge of the
> underlying objects with ease. This time, I am including that work here as
> a baseline.

This patchset is huge.

My overall take is that the patchset is massive and intrusive and scary
:( I'd like to see more evidence that the memcg people (mhocko, hannes,
kamezawa etc) have spent quality time reviewing and testing this code. 
There really is a lot of it!

I haven't seen any show-stoppers yet so I guess I'll slam it all into
-next and cross fingers.  I would ask that the relevant developers set
aside a solid day to read and runtime test it all.  Realistically, it's
likely to take considerably more time that that.

I do expect that I'll drop the entire patchset again for the next
version, if only because the next version should withdraw all the
switch-random-code-to-xfs-coding-style changes...


I'm thinking that we should approach this in two stages: all the new
shrinker stuff separated from the memcg_kmem work.  So we merge
everything up to "shrinker: Kill old ->shrink API" and then continue to
work on the memcg things?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 00/35] kmemcg shrinkers
  2013-06-05 23:07 ` [PATCH v10 00/35] kmemcg shrinkers Andrew Morton
@ 2013-06-06  3:44   ` Dave Chinner
  2013-06-06  5:51   ` Glauber Costa
       [not found]   ` <20130605160721.da995af82eb247ccf8f8537f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2013-06-06  3:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, linux-mm, cgroups,
	kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen

On Wed, Jun 05, 2013 at 04:07:21PM -0700, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:29 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
> > Andrew,
> > 
> > This submission contains one small bug fix over the last one. I have been
> > testing it regularly and believe this is ready for merging. I have follow up
> > patches for this series, with a few improvements (namely: dynamic sized
> > list_lru node arrays, memcg flush-at-destruction, kmemcg shrinking setting
> > limit < usage).  But since this series is already quite mature - and very
> > extensive, I don't believe that adding new patches would make them receive the
> > appropriate level of review. So please advise me if there is anything crucial
> > missing in here. Thanks!
> > 
> > Hi,
> > 
> > This patchset implements targeted shrinking for memcg when kmem limits are
> > present. So far, we've been accounting kernel objects but failing allocations
> > when short of memory. This is because our only option would be to call the
> > global shrinker, depleting objects from all caches and breaking isolation.
> > 
> > The main idea is to associate per-memcg lists with each of the LRUs. The main
> > LRU still provides a single entry point and when adding or removing an element
> > from the LRU, we use the page information to figure out which memcg it belongs
> > to and relay it to the right list.
> > 
> > Base work:
> > ==========
> > 
> > Please note that this builds upon the recent work from Dave Chinner that
> > sanitizes the LRU shrinking API and make the shrinkers node aware. Node
> > awareness is not *strictly* needed for my work, but I still perceive it
> > as an advantage. The API unification is a major need, and I build upon it
> > heavily. That allows us to manipulate the LRUs without knowledge of the
> > underlying objects with ease. This time, I am including that work here as
> > a baseline.
> 
> This patchset is huge.

*nod*

> My overall take is that the patchset is massive and intrusive and scary
> :( I'd like to see more evidence that the memcg people (mhocko, hannes,
> kamezawa etc) have spent quality time reviewing and testing this code. 
> There really is a lot of it!
> 
> I haven't seen any show-stoppers yet so I guess I'll slam it all into
> -next and cross fingers.  I would ask that the relevant developers set
> aside a solid day to read and runtime test it all.  Realistically, it's
> likely to take considerably more time that that.

Yes, it will.

> I do expect that I'll drop the entire patchset again for the next
> version, if only because the next version should withdraw all the
> switch-random-code-to-xfs-coding-style changes...
> 
> 
> I'm thinking that we should approach this in two stages: all the new
> shrinker stuff separated from the memcg_kmem work.  So we merge
> everything up to "shrinker: Kill old ->shrink API" and then continue to
> work on the memcg things?

Fine by me. I'll work with Glauber to get all the documentation and
formatting and bugs you found fixed for the LRU/shrinker part of the 
patchset as quickly as possible...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 00/35] kmemcg shrinkers
  2013-06-05 23:07 ` [PATCH v10 00/35] kmemcg shrinkers Andrew Morton
  2013-06-06  3:44   ` Dave Chinner
@ 2013-06-06  5:51   ` Glauber Costa
       [not found]     ` <51B02347.60809-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
       [not found]   ` <20130605160721.da995af82eb247ccf8f8537f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  5:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen

On 06/06/2013 03:07 AM, Andrew Morton wrote:
> On Mon,  3 Jun 2013 23:29:29 +0400 Glauber Costa <glommer@openvz.org> wrote:
> 
>> Andrew,
>>
>> This submission contains one small bug fix over the last one. I have been
>> testing it regularly and believe this is ready for merging. I have follow up
>> patches for this series, with a few improvements (namely: dynamic sized
>> list_lru node arrays, memcg flush-at-destruction, kmemcg shrinking setting
>> limit < usage).  But since this series is already quite mature - and very
>> extensive, I don't believe that adding new patches would make them receive the
>> appropriate level of review. So please advise me if there is anything crucial
>> missing in here. Thanks!
>>
>> Hi,
>>
>> This patchset implements targeted shrinking for memcg when kmem limits are
>> present. So far, we've been accounting kernel objects but failing allocations
>> when short of memory. This is because our only option would be to call the
>> global shrinker, depleting objects from all caches and breaking isolation.
>>
>> The main idea is to associate per-memcg lists with each of the LRUs. The main
>> LRU still provides a single entry point and when adding or removing an element
>> from the LRU, we use the page information to figure out which memcg it belongs
>> to and relay it to the right list.
>>
>> Base work:
>> ==========
>>
>> Please note that this builds upon the recent work from Dave Chinner that
>> sanitizes the LRU shrinking API and make the shrinkers node aware. Node
>> awareness is not *strictly* needed for my work, but I still perceive it
>> as an advantage. The API unification is a major need, and I build upon it
>> heavily. That allows us to manipulate the LRUs without knowledge of the
>> underlying objects with ease. This time, I am including that work here as
>> a baseline.
> 
> This patchset is huge.
> 
> My overall take is that the patchset is massive and intrusive and scary
> :( I'd like to see more evidence that the memcg people (mhocko, hannes,
> kamezawa etc) have spent quality time reviewing and testing this code. 
> There really is a lot of it!
> 

More review is useful, indeed.

> I haven't seen any show-stoppers yet so I guess I'll slam it all into
> -next and cross fingers.  I would ask that the relevant developers set
> aside a solid day to read and runtime test it all.  Realistically, it's
> likely to take considerably more time that that.
> 
> I do expect that I'll drop the entire patchset again for the next
> version, if only because the next version should withdraw all the
> switch-random-code-to-xfs-coding-style changes...
> 
Ok, how do you want me to proceed ? Should I send a new series, or
incremental? When exactly?

I do have at least two fixes to send that popped out this week: one of
them for the drivers patch, since Kent complained about a malconversion
of the bcache driver, and another one in the memcg page path.

> 
> I'm thinking that we should approach this in two stages: all the new
> shrinker stuff separated from the memcg_kmem work.  So we merge
> everything up to "shrinker: Kill old ->shrink API" and then continue to
> work on the memcg things?
> 

I agree with this, the shrinker part got a very thorough review from Mel
recently. I do need to send you the fix for the bcache driver (or the
whole thing, as you would prefer), and fix whatever comments you have.

Please note that as I have mentioned in the opening letter, I have two
follow up patches for memcg (one of them allows us to use the shrinker
infrastructure to reduce the value of kmem.limit, and the other one
flushes the caches upon destruction). I haven't included in the series
because the series is already huge, and I believe by including them,
they would not get the review they deserve (by being new). Splitting it
in two would allow me to include them in a smaller series.

I will go over your comments in a couple of hours. Please just advise me
how would you like me to proceed with this logistically (new submission,
fixes, for which patches, etc)

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <51B02347.60809-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: [PATCH v10 00/35] kmemcg shrinkers
       [not found]     ` <51B02347.60809-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-06-06  7:18       ` Andrew Morton
       [not found]         ` <20130606001855.48d9da2e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  7:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen

On Thu, 6 Jun 2013 09:51:03 +0400 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> On 06/06/2013 03:07 AM, Andrew Morton wrote:
> > On Mon,  3 Jun 2013 23:29:29 +0400 Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> 
> > I haven't seen any show-stoppers yet so I guess I'll slam it all into
> > -next and cross fingers.  I would ask that the relevant developers set
> > aside a solid day to read and runtime test it all.  Realistically, it's
> > likely to take considerably more time that that.
> > 
> > I do expect that I'll drop the entire patchset again for the next
> > version, if only because the next version should withdraw all the
> > switch-random-code-to-xfs-coding-style changes...
> > 
> Ok, how do you want me to proceed ? Should I send a new series, or
> incremental? When exactly?
> 
> I do have at least two fixes to send that popped out this week: one of
> them for the drivers patch, since Kent complained about a malconversion
> of the bcache driver, and another one in the memcg page path.

Definitely a new series.  I tossed this series into -mm and -next so
that others can conveniently review and test it (hint).

> > 
> > I'm thinking that we should approach this in two stages: all the new
> > shrinker stuff separated from the memcg_kmem work.  So we merge
> > everything up to "shrinker: Kill old ->shrink API" and then continue to
> > work on the memcg things?
> > 
> 
> I agree with this, the shrinker part got a very thorough review from Mel
> recently. I do need to send you the fix for the bcache driver (or the
> whole thing, as you would prefer), and fix whatever comments you have.
> 
> Please note that as I have mentioned in the opening letter, I have two
> follow up patches for memcg (one of them allows us to use the shrinker
> infrastructure to reduce the value of kmem.limit, and the other one
> flushes the caches upon destruction). I haven't included in the series
> because the series is already huge, and I believe by including them,
> they would not get the review they deserve (by being new). Splitting it
> in two would allow me to include them in a smaller series.
> 
> I will go over your comments in a couple of hours. Please just advise me
> how would you like me to proceed with this logistically (new submission,
> fixes, for which patches, etc)

New everything, please.  There's no hurry - linux-next is going on
holidays for a week.

The shrinker stuff seems sensible and straightforward and I expect we
can proceed with that at the normal pace.  The memcg changes struck me
as being hairy as hell and I'd really like to see the other memcg
people go through it carefully.

Of course, "new series" doesn't give you an easily accessible tree to
target.  I could drop it all again to give you a clean shot at
tomorrow's -next?

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130606001855.48d9da2e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 00/35] kmemcg shrinkers
       [not found]         ` <20130606001855.48d9da2e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-06  7:37           ` Glauber Costa
  2013-06-06  7:47             ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  7:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Michal Hocko,
	Johannes Weiner, hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen

On 06/06/2013 11:18 AM, Andrew Morton wrote:
> The shrinker stuff seems sensible and straightforward and I expect we
> can proceed with that at the normal pace.  The memcg changes struck me
> as being hairy as hell and I'd really like to see the other memcg
> people go through it carefully.
> 
> Of course, "new series" doesn't give you an easily accessible tree to
> target.  I could drop it all again to give you a clean shot at
> tomorrow's -next?
If you just keep them on top (not really sure how hard it is for you), I
can just remove them all and apply a new series on top.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 00/35] kmemcg shrinkers
  2013-06-06  7:37           ` Glauber Costa
@ 2013-06-06  7:47             ` Andrew Morton
  2013-06-06  7:59               ` Glauber Costa
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2013-06-06  7:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen

On Thu, 6 Jun 2013 11:37:03 +0400 Glauber Costa <glommer@parallels.com> wrote:

> On 06/06/2013 11:18 AM, Andrew Morton wrote:
> > The shrinker stuff seems sensible and straightforward and I expect we
> > can proceed with that at the normal pace.  The memcg changes struck me
> > as being hairy as hell and I'd really like to see the other memcg
> > people go through it carefully.
> > 
> > Of course, "new series" doesn't give you an easily accessible tree to
> > target.  I could drop it all again to give you a clean shot at
> > tomorrow's -next?
> If you just keep them on top (not really sure how hard it is for you), I
> can just remove them all and apply a new series on top.

I could do that but then anyone else who wants to test the code has to
do the same thing.  Dropping them out of -next does seem the clean
approach.

We still need to work out what to do with
memcg-debugging-facility-to-access-dangling-memcgs.patch btw.  See
other email.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v10 00/35] kmemcg shrinkers
  2013-06-06  7:47             ` Andrew Morton
@ 2013-06-06  7:59               ` Glauber Costa
  0 siblings, 0 replies; 103+ messages in thread
From: Glauber Costa @ 2013-06-06  7:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel, Mel Gorman, Dave Chinner, linux-mm,
	cgroups, kamezawa.hiroyu, Michal Hocko, Johannes Weiner, hughd,
	Greg Thelen

On 06/06/2013 11:47 AM, Andrew Morton wrote:
> On Thu, 6 Jun 2013 11:37:03 +0400 Glauber Costa <glommer@parallels.com> wrote:
> 
>> On 06/06/2013 11:18 AM, Andrew Morton wrote:
>>> The shrinker stuff seems sensible and straightforward and I expect we
>>> can proceed with that at the normal pace.  The memcg changes struck me
>>> as being hairy as hell and I'd really like to see the other memcg
>>> people go through it carefully.
>>>
>>> Of course, "new series" doesn't give you an easily accessible tree to
>>> target.  I could drop it all again to give you a clean shot at
>>> tomorrow's -next?
>> If you just keep them on top (not really sure how hard it is for you), I
>> can just remove them all and apply a new series on top.
> 
> I could do that but then anyone else who wants to test the code has to
> do the same thing.  Dropping them out of -next does seem the clean
> approach.
> 
> We still need to work out what to do with
> memcg-debugging-facility-to-access-dangling-memcgs.patch btw.  See
> other email.
> 
Ok. I am using half of that as official infrastructure now. I will reply
to your comments when I reach that.

^ permalink raw reply	[flat|nested] 103+ messages in thread

[parent not found: <20130605160721.da995af82eb247ccf8f8537f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v10 00/35] kmemcg shrinkers
       [not found]   ` <20130605160721.da995af82eb247ccf8f8537f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2013-06-07 14:15     ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2013-06-07 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Glauber Costa, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Mel Gorman,
	Dave Chinner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Johannes Weiner,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, Greg Thelen

On Wed 05-06-13 16:07:21, Andrew Morton wrote:
[...]
> This patchset is huge.

yes it is really huge which is made it lower on the todo list because
other things always preempted it.
 
> My overall take is that the patchset is massive and intrusive and scary
> :( I'd like to see more evidence that the memcg people (mhocko, hannes,
> kamezawa etc) have spent quality time reviewing and testing this code. 
> There really is a lot of it!

I only following discussions right now, and I wasn't even able to catch
up on those. I plan to review memcg parts soon, but cannot give any
estimate. I am sorry for that but the time doesn't allow me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2013-06-07 14:15 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-03 19:29 [PATCH v10 00/35] kmemcg shrinkers Glauber Costa
2013-06-03 19:29 ` [PATCH v10 02/35] super: fix calculation of shrinkable objects for small numbers Glauber Costa
     [not found] ` <1370287804-3481-1-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2013-06-03 19:29   ` [PATCH v10 01/35] fs: bump inode and dentry counters to long Glauber Costa
2013-06-03 19:29   ` [PATCH v10 03/35] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
2013-06-05 23:07     ` Andrew Morton
2013-06-06  1:45       ` Dave Chinner
2013-06-06  2:48         ` Andrew Morton
2013-06-06  4:02           ` Dave Chinner
2013-06-06 12:40           ` Glauber Costa
2013-06-06 22:25             ` Andrew Morton
     [not found]               ` <20130606152546.52f614d852da32d28a0b460f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06 23:42                 ` Dave Chinner
2013-06-07  6:03                   ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 04/35] dentry: move to per-sb LRU locks Glauber Costa
2013-06-05 23:07     ` Andrew Morton
2013-06-06  1:56       ` Dave Chinner
2013-06-06  8:03       ` Glauber Costa
2013-06-06 12:51         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 06/35] mm: new shrinker API Glauber Costa
2013-06-05 23:07     ` Andrew Morton
     [not found]       ` <20130605160751.499f0ebb35e89a80dd7931f2-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  7:58         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 07/35] shrinker: convert superblock shrinkers to new API Glauber Costa
2013-06-03 19:29   ` [PATCH v10 08/35] list: add a new LRU list type Glauber Costa
     [not found]     ` <1370287804-3481-9-git-send-email-glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2013-06-05 23:07       ` Andrew Morton
2013-06-06  2:49         ` Dave Chinner
2013-06-06  3:05           ` Andrew Morton
     [not found]             ` <20130605200554.d4dae16f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  4:44               ` Dave Chinner
2013-06-06  7:04                 ` Andrew Morton
2013-06-06  9:03                   ` Glauber Costa
2013-06-06  9:55                     ` Andrew Morton
     [not found]                       ` <20130606025517.8400c279.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06 11:47                         ` Glauber Costa
2013-06-06 14:28           ` Glauber Costa
2013-06-06  8:10         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 09/35] inode: convert inode lru list to generic lru list code Glauber Costa
2013-06-03 19:29   ` [PATCH v10 10/35] dcache: convert to use new lru list infrastructure Glauber Costa
2013-06-03 19:29   ` [PATCH v10 11/35] list_lru: per-node " Glauber Costa
2013-06-05 23:08     ` Andrew Morton
2013-06-06  3:21       ` Dave Chinner
2013-06-06  3:51         ` Andrew Morton
2013-06-06  8:21         ` Glauber Costa
2013-06-06 16:15       ` Glauber Costa
2013-06-06 16:48         ` Andrew Morton
2013-06-03 19:29   ` [PATCH v10 12/35] shrinker: add node awareness Glauber Costa
2013-06-05 23:08     ` Andrew Morton
2013-06-06  3:26       ` Dave Chinner
2013-06-06  3:54         ` Andrew Morton
     [not found]       ` <20130605160810.5b203c3368b9df7d087ee3b1-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  8:23         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 13/35] vmscan: per-node deferred work Glauber Costa
2013-06-05 23:08     ` Andrew Morton
2013-06-06  3:37       ` Dave Chinner
2013-06-06  4:59         ` Dave Chinner
2013-06-06  7:12           ` Andrew Morton
     [not found]       ` <20130605160815.fb69f7d4d1736455727fc669-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  9:00         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 14/35] list_lru: per-node API Glauber Costa
2013-06-03 19:29   ` [PATCH v10 15/35] fs: convert inode and dentry shrinking to be node aware Glauber Costa
2013-06-03 19:29   ` [PATCH v10 16/35] xfs: convert buftarg LRU to generic code Glauber Costa
2013-06-03 19:29   ` [PATCH v10 17/35] xfs: rework buffer dispose list tracking Glauber Costa
2013-06-03 19:29   ` [PATCH v10 18/35] xfs: convert dquot cache lru to list_lru Glauber Costa
2013-06-03 19:29   ` [PATCH v10 19/35] fs: convert fs shrinkers to new scan/count API Glauber Costa
2013-06-03 19:29   ` [PATCH v10 21/35] i915: bail out earlier when shrinker cannot acquire mutex Glauber Costa
2013-06-03 19:29   ` [PATCH v10 22/35] shrinker: convert remaining shrinkers to count/scan API Glauber Costa
2013-06-05 23:08     ` Andrew Morton
     [not found]       ` <20130605160821.59adf9ad4efe48144fd9e237-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  3:41         ` Dave Chinner
2013-06-06  8:27           ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 23/35] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
2013-06-03 19:29   ` [PATCH v10 24/35] shrinker: Kill old ->shrink API Glauber Costa
2013-06-03 19:29   ` [PATCH v10 25/35] vmscan: also shrink slab in memcg pressure Glauber Costa
2013-06-03 19:29   ` [PATCH v10 26/35] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
2013-06-05 23:08     ` Andrew Morton
     [not found]       ` <20130605160828.1ec9f3538258d9a6d6c74083-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  8:52         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 27/35] lru: add an element to a memcg list Glauber Costa
2013-06-05 23:08     ` Andrew Morton
2013-06-06  8:44       ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 28/35] list_lru: per-memcg walks Glauber Costa
2013-06-05 23:08     ` Andrew Morton
     [not found]       ` <20130605160837.0d0a35fbd4b32d7ad02f7136-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  8:37         ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 29/35] memcg: per-memcg kmem shrinking Glauber Costa
2013-06-05 23:08     ` Andrew Morton
     [not found]       ` <20130605160841.909420c06bfde62039489d2e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  8:35         ` Glauber Costa
2013-06-06  9:49           ` Andrew Morton
     [not found]             ` <20130606024906.e5b85b28.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06 12:09               ` Glauber Costa
     [not found]                 ` <51B07BEC.9010205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-06-06 22:23                   ` Andrew Morton
2013-06-07  6:10                     ` Glauber Costa
2013-06-03 19:29   ` [PATCH v10 30/35] memcg: scan cache objects hierarchically Glauber Costa
2013-06-05 23:08     ` Andrew Morton
2013-06-03 19:30   ` [PATCH v10 32/35] super: targeted memcg reclaim Glauber Costa
2013-06-03 19:30   ` [PATCH v10 33/35] memcg: move initialization to memcg creation Glauber Costa
2013-06-03 19:30   ` [PATCH v10 34/35] vmpressure: in-kernel notifications Glauber Costa
2013-06-03 19:30   ` [PATCH v10 35/35] memcg: reap dead memcgs upon global memory pressure Glauber Costa
2013-06-05 23:09     ` Andrew Morton
2013-06-06  8:33       ` Glauber Costa
2013-06-03 19:29 ` [PATCH v10 05/35] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
2013-06-05 23:07   ` Andrew Morton
2013-06-06  8:04     ` Glauber Costa
2013-06-03 19:29 ` [PATCH v10 20/35] drivers: convert shrinkers to new count/scan API Glauber Costa
2013-06-03 19:30 ` [PATCH v10 31/35] vmscan: take at least one pass with shrinkers Glauber Costa
2013-06-05 23:07 ` [PATCH v10 00/35] kmemcg shrinkers Andrew Morton
2013-06-06  3:44   ` Dave Chinner
2013-06-06  5:51   ` Glauber Costa
     [not found]     ` <51B02347.60809-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-06-06  7:18       ` Andrew Morton
     [not found]         ` <20130606001855.48d9da2e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-06  7:37           ` Glauber Costa
2013-06-06  7:47             ` Andrew Morton
2013-06-06  7:59               ` Glauber Costa
     [not found]   ` <20130605160721.da995af82eb247ccf8f8537f-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2013-06-07 14:15     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).