[patch 03/35] mm: implement per-zone shrinker

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 03/35] mm: implement per-zone shrinker
       [not found] <20101019034216.319085068@kernel.dk>
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  4:49   ` KOSAKI Motohiro
  2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
       [not found] ` <20101019034658.744504135@kernel.dk>
  2 siblings, 1 reply; 8+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: linux-mm

[-- Attachment #1: mm-zone-shrinker.patch --]
[-- Type: text/plain, Size: 23713 bytes --]

Allow the shrinker to do per-zone shrinking. This requires adding a zone
argument to the shrinker callback and calling shrinkers for each zone
scanned. The logic somewhat in vmscan code gets simpler: the shrinkers are
invoked for each zone, around the same time as the pagecache scanner.
Zone reclaim needed a bit of surgery to cope with the change, but the
idea is the same.

But all shrinkers are currently global-based, so they need a way to
convert per-zone ratios into global scan ratios. So seeing as we are
changing the shrinker API anyway, let's reorganise it to make it saner.

So the shrinker callback is passed:
- the number of pagecache pages scanned in this zone
- the number of pagecache pages in this zone
- the total number of pagecache pages in all zones to be scanned

The shrinker is now completely responsible for calculating and batching
(given helpers), which provides better flexibility. vmscan helper functions
are provided to accumulate these ratios, and help with batching.

Finally, add some fixed-point scaling to the ratio, which helps rounding.

The old shrinker API remains for unconverted code. There is no urgency
to convert them at once.

Cc: linux-mm@kvack.org
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/drop_caches.c    |    6 
 include/linux/mm.h  |   43 ++++++
 mm/memory-failure.c |   10 -
 mm/vmscan.c         |  327 +++++++++++++++++++++++++++++++++++++---------------
 4 files changed, 279 insertions(+), 107 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
@@ -997,6 +997,10 @@
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
+ * 'shrink_zone' is the new shrinker API. It is to be used in preference
+ * to 'shrink'. One must point to a shrinker function, the other must
+ * be NULL. See 'shrink_slab' for details about the shrink_zone API.
+ *
  * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'.  It should
  * look through the least-recently-used 'nr_to_scan' entries and
  * attempt to free them up.  It should return the number of objects
@@ -1013,13 +1017,53 @@
 	int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask);
 	int seeks;	/* seeks to recreate an obj */
 
+	/*
+	 * shrink_zone - slab shrinker callback for reclaimable objects
+	 * @shrink: this struct shrinker
+	 * @zone: zone to scan
+	 * @scanned: pagecache lru pages scanned in zone
+	 * @total: total pagecache lru pages in zone
+	 * @global: global pagecache lru pages (for zone-unaware shrinkers)
+	 * @flags: shrinker flags
+	 * @gfp_mask: gfp context we are operating within
+	 *
+	 * The shrinkers are responsible for calculating the appropriate
+	 * pressure to apply, batching up scanning (and cond_resched,
+	 * cond_resched_lock etc), and updating events counters including
+	 * count_vm_event(SLABS_SCANNED, nr).
+	 *
+	 * This approach gives flexibility to the shrinkers. They know best how
+	 * to do batching, how much time between cond_resched is appropriate,
+	 * what statistics to increment, etc.
+	 */
+	void (*shrink_zone)(struct shrinker *shrink,
+		struct zone *zone, unsigned long scanned,
+		unsigned long total, unsigned long global,
+		unsigned long flags, gfp_t gfp_mask);
+
 	/* These are for internal use */
 	struct list_head list;
 	long nr;	/* objs pending delete */
 };
+
+/* Constants for use by old shrinker API */
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
+
+/* Constants for use by new shrinker API */
+/*
+ * SHRINK_DEFAULT_SEEKS is shifted by 4 to match an arbitrary constant
+ * in the old shrinker code.
+ */
+#define SHRINK_FACTOR	(128UL) /* Fixed point shift */
+#define SHRINK_DEFAULT_SEEKS	(SHRINK_FACTOR*DEFAULT_SEEKS/4)
+#define SHRINK_BATCH	128	/* A good number if you don't know better */
+
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
+extern void shrinker_add_scan(unsigned long *dst,
+				unsigned long scanned, unsigned long total,
+				unsigned long objects, unsigned int ratio);
+extern unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch);
 
 int vma_wants_writenotify(struct vm_area_struct *vma);
 
@@ -1443,8 +1487,7 @@
 
 int drop_caches_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
-unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+void shrink_all_slab(void);
 
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/mm/vmscan.c	2010-10-19 14:33:38.000000000 +1100
@@ -74,6 +74,9 @@
 	/* Can pages be swapped as part of reclaim? */
 	int may_swap;
 
+	/* Can slab pages be reclaimed? */
+	int may_reclaim_slab;
+
 	int swappiness;
 
 	int order;
@@ -163,6 +166,8 @@
  */
 void register_shrinker(struct shrinker *shrinker)
 {
+	BUG_ON(shrinker->shrink && shrinker->shrink_zone);
+	BUG_ON(!shrinker->shrink && !shrinker->shrink_zone);
 	shrinker->nr = 0;
 	down_write(&shrinker_rwsem);
 	list_add_tail(&shrinker->list, &shrinker_list);
@@ -181,43 +186,101 @@
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-#define SHRINK_BATCH 128
 /*
- * Call the shrink functions to age shrinkable caches
+ * shrinker_add_scan - accumulate shrinker scan
+ * @dst: scan counter variable
+ * @scanned: pagecache pages scanned
+ * @total: total pagecache objects
+ * @tot: total objects in this cache
+ * @ratio: ratio of pagecache value to object value
  *
- * Here we assume it costs one seek to replace a lru page and that it also
- * takes a seek to recreate a cache object.  With this in mind we age equal
- * percentages of the lru and ageable caches.  This should balance the seeks
- * generated by these structures.
+ * shrinker_add_scan accumulates a number of objects to scan into @dst,
+ * based on the following ratio:
  *
- * If the vm encountered mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
+ * proportion = scanned / total        // proportion of pagecache scanned
+ * obj_prop   = objects * proportion   // same proportion of objects
+ * to_scan    = obj_prop / ratio       // modify by ratio
+ * *dst += (total / scanned)           // accumulate to dst
  *
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
+ * The ratio is a fixed point integer with a factor SHRINK_FACTOR.
+ * Higher ratios give objects higher value.
  *
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt.  It is used for balancing
- * slab reclaim versus page reclaim.
+ * @dst is also fixed point, so cannot be used as a simple count.
+ * shrinker_do_scan will take care of that for us.
  *
- * Returns the number of slab objects which we shrunk.
+ * There is no synchronisation here, which is fine really. A rare lost
+ * update is no huge deal in reclaim code.
  */
-unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+void shrinker_add_scan(unsigned long *dst,
+			unsigned long scanned, unsigned long total,
+			unsigned long objects, unsigned int ratio)
 {
-	struct shrinker *shrinker;
-	unsigned long ret = 0;
+	unsigned long long delta;
 
-	if (scanned == 0)
-		scanned = SWAP_CLUSTER_MAX;
+	delta = (unsigned long long)scanned * objects;
+	delta *= SHRINK_FACTOR;
+	do_div(delta, total + 1);
+	delta *= SHRINK_FACTOR; /* ratio is also in SHRINK_FACTOR units */
+	do_div(delta, ratio + 1);
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+	/*
+	 * Avoid risking looping forever due to too large nr value:
+	 * never try to free more than twice the estimate number of
+	 * freeable entries.
+	 */
+	*dst += delta;
+
+	if (*dst / SHRINK_FACTOR > objects)
+		*dst = objects * SHRINK_FACTOR;
+}
+EXPORT_SYMBOL(shrinker_add_scan);
+
+/*
+ * shrinker_do_scan - scan a batch of objects
+ * @dst: scan counter
+ * @batch: number of objects to scan in this batch
+ * @Returns: number of objects to scan
+ *
+ * shrinker_do_scan takes the scan counter accumulated by shrinker_add_scan,
+ * and decrements it by @batch if it is greater than batch and returns batch.
+ * Otherwise returns 0. The caller should use the return value as the number
+ * of objects to scan next.
+ *
+ * Between shrinker_do_scan calls, the caller should drop locks if possible
+ * and call cond_resched.
+ *
+ * Note, @dst is a fixed point scaled integer. See shrinker_add_scan.
+ *
+ * Like shrinker_add_scan, shrinker_do_scan is not SMP safe, but it doesn't
+ * really need to be.
+ */
+unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch)
+{
+	unsigned long nr = ACCESS_ONCE(*dst);
+	if (nr < batch * SHRINK_FACTOR)
+		return 0;
+	*dst = nr - batch * SHRINK_FACTOR;
+	return batch;
+}
+EXPORT_SYMBOL(shrinker_do_scan);
+
+#define SHRINK_BATCH 128
+/*
+ * Scan the deprecated shrinkers. This will go away soon in favour of
+ * converting everybody to new shrinker API.
+ */
+static void shrink_slab_old(unsigned long scanned, gfp_t gfp_mask,
+			unsigned long lru_pages)
+{
+	struct shrinker *shrinker;
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
 		unsigned long total_scan;
 		unsigned long max_pass;
 
+		if (!shrinker->shrink)
+			continue;
 		max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask);
 		delta = (4 * scanned) / shrinker->seeks;
 		delta *= max_pass;
@@ -244,15 +307,11 @@
 		while (total_scan >= SHRINK_BATCH) {
 			long this_scan = SHRINK_BATCH;
 			int shrink_ret;
-			int nr_before;
 
-			nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask);
 			shrink_ret = (*shrinker->shrink)(shrinker, this_scan,
 								gfp_mask);
 			if (shrink_ret == -1)
 				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
 			count_vm_events(SLABS_SCANNED, this_scan);
 			total_scan -= this_scan;
 
@@ -261,8 +320,75 @@
 
 		shrinker->nr += total_scan;
 	}
+}
+/*
+ * shrink_slab - Call the shrink functions to age shrinkable caches
+ * @zone: the zone we are currently reclaiming from
+ * @scanned: how many pagecache pages were scanned in this zone
+ * @total: total number of reclaimable pagecache pages in this zone
+ * @global: total number of reclaimable pagecache pages in the system
+ * @gfp_mask: gfp context that we are in
+ *
+ * Slab shrinkers should scan their objects in a proportion to the ratio of
+ * scanned to total pagecache pages in this zone, modified by a "cost"
+ * constant.
+ *
+ * For example, we have a slab cache with 100 reclaimable objects in a
+ * particular zone, and the cost of reclaiming an object is determined to be
+ * twice as expensive as reclaiming a pagecache page (due to likelihood and
+ * cost of reconstruction). If we have 200 reclaimable pagecache pages in that
+ * zone particular zone, and scan 20 of them (10%), we should scan 5% (5) of
+ * the objects in our slab cache.
+ *
+ * If we have a single global list of objects and no per-zone lists, the
+ * global count of objects can be used to find the correct ratio to scan.
+ *
+ * See shrinker_add_scan and shrinker_do_scan for helper functions and
+ * details on how to calculate these numbers.
+ */
+static void shrink_slab(struct zone *zone, unsigned long scanned,
+			unsigned long total, unsigned long global,
+			gfp_t gfp_mask)
+{
+	struct shrinker *shrinker;
+
+	if (scanned == 0)
+		scanned = SWAP_CLUSTER_MAX;
+
+	if (!down_read_trylock(&shrinker_rwsem))
+		return;
+
+	/* do a global shrink with the old shrinker API */
+	shrink_slab_old(scanned, gfp_mask, global);
+
+	list_for_each_entry(shrinker, &shrinker_list, list) {
+		if (!shrinker->shrink_zone)
+			continue;
+		(*shrinker->shrink_zone)(shrinker, zone, scanned,
+					total, global, 0, gfp_mask);
+	}
 	up_read(&shrinker_rwsem);
-	return ret;
+}
+
+void shrink_all_slab(void)
+{
+	struct zone *zone;
+	struct reclaim_state reclaim_state;
+
+	current->reclaim_state = &reclaim_state;
+	do {
+		reclaim_state.reclaimed_slab = 0;
+		/*
+		 * Use "100" for "scanned", "total", and "global", so
+		 * that shrinkers scan a large proportion of their
+		 * objects. 100 rather than 1 in order to reduce rounding
+		 * errors.
+		 */
+		for_each_populated_zone(zone)
+			shrink_slab(zone, 100, 100, 100, GFP_KERNEL);
+	} while (reclaim_state.reclaimed_slab);
+
+	current->reclaim_state = NULL;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
@@ -1740,18 +1866,24 @@
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+		struct scan_control *sc, unsigned long global_lru_pages)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	enum lru_list l;
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	unsigned long nr_scanned = sc->nr_scanned;
+	unsigned long lru_pages = 0;
 
 	get_scan_count(zone, sc, nr, priority);
 
 	set_lumpy_reclaim_mode(priority, sc);
 
+	/* Used by slab shrinking, below */
+	if (sc->may_reclaim_slab)
+		lru_pages = zone_reclaimable_pages(zone);
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1776,8 +1908,6 @@
 			break;
 	}
 
-	sc->nr_reclaimed = nr_reclaimed;
-
 	/*
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
@@ -1785,6 +1915,23 @@
 	if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from
+	 * over limit cgroups
+	 */
+	if (sc->may_reclaim_slab) {
+		struct reclaim_state *reclaim_state = current->reclaim_state;
+
+		shrink_slab(zone, sc->nr_scanned - nr_scanned,
+			lru_pages, global_lru_pages, sc->gfp_mask);
+		if (reclaim_state) {
+			nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
+	sc->nr_reclaimed = nr_reclaimed;
+
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
@@ -1805,7 +1952,7 @@
  * scan then give up on it.
  */
 static void shrink_zones(int priority, struct zonelist *zonelist,
-					struct scan_control *sc)
+		struct scan_control *sc, unsigned long global_lru_pages)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -1825,7 +1972,7 @@
 				continue;	/* Let kswapd poll it */
 		}
 
-		shrink_zone(priority, zone, sc);
+		shrink_zone(priority, zone, sc, global_lru_pages);
 	}
 }
 
@@ -1882,7 +2029,6 @@
 {
 	int priority;
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long writeback_threshold;
@@ -1894,30 +2040,20 @@
 		count_vm_event(ALLOCSTALL);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		sc->nr_scanned = 0;
-		if (!priority)
-			disable_swap_token();
-		shrink_zones(priority, zonelist, sc);
-		/*
-		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
-		 */
-		if (scanning_global_lru(sc)) {
-			unsigned long lru_pages = 0;
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
+		unsigned long lru_pages = 0;
 
-				lru_pages += zone_reclaimable_pages(zone);
-			}
+		for_each_zone_zonelist(zone, z, zonelist,
+				gfp_zone(sc->gfp_mask)) {
+			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+				continue;
 
-			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
+			lru_pages += zone_reclaimable_pages(zone);
 		}
+
+		sc->nr_scanned = 0;
+		if (!priority)
+			disable_swap_token();
+		shrink_zones(priority, zonelist, sc, lru_pages);
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
@@ -1975,6 +2111,7 @@
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_reclaim_slab = 1,
 		.swappiness = vm_swappiness,
 		.order = order,
 		.mem_cgroup = NULL,
@@ -2004,6 +2141,7 @@
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
+		.may_reclaim_slab = 0,
 		.swappiness = swappiness,
 		.order = 0,
 		.mem_cgroup = mem,
@@ -2022,7 +2160,7 @@
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	shrink_zone(0, zone, &sc, zone_reclaimable_pages(zone));
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2040,6 +2178,7 @@
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
+		.may_reclaim_slab = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.swappiness = swappiness,
 		.order = 0,
@@ -2117,11 +2256,11 @@
 	int priority;
 	int i;
 	unsigned long total_scanned;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_reclaim_slab = 1,
 		/*
 		 * kswapd doesn't want to be bailed out while reclaim. because
 		 * we want to put equal scanning pressure on each zone.
@@ -2195,7 +2334,6 @@
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2217,15 +2355,11 @@
 			 */
 			if (!zone_watermark_ok(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
-				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				shrink_zone(priority, zone, &sc, lru_pages);
 			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
 				continue;
-			if (nr_slab == 0 && !zone_reclaimable(zone))
+			if (!zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and
@@ -2482,6 +2616,7 @@
 		.may_swap = 1,
 		.may_unmap = 1,
 		.may_writepage = 1,
+		.may_reclaim_slab = 1,
 		.nr_to_reclaim = nr_to_reclaim,
 		.hibernation_mode = 1,
 		.swappiness = vm_swappiness,
@@ -2665,13 +2800,14 @@
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
+		.may_reclaim_slab = 0,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
 				       SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
 		.order = order,
 	};
-	unsigned long nr_slab_pages0, nr_slab_pages1;
+	unsigned long lru_pages, slab_pages;
 
 	cond_resched();
 	/*
@@ -2684,51 +2820,61 @@
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
+	lru_pages = zone_reclaimable_pages(zone);
+	slab_pages = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+
 	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
+		if (slab_pages > zone->min_slab_pages)
+			sc.may_reclaim_slab = 1;
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			shrink_zone(priority, zone, &sc);
+			shrink_zone(priority, zone, &sc, lru_pages);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
-	}
 
-	nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
-	if (nr_slab_pages0 > zone->min_slab_pages) {
+	} else if (slab_pages > zone->min_slab_pages) {
 		/*
-		 * shrink_slab() does not currently allow us to determine how
-		 * many pages were freed in this zone. So we take the current
-		 * number of slab pages and shake the slab until it is reduced
-		 * by the same nr_pages that we used for reclaiming unmapped
-		 * pages.
-		 *
-		 * Note that shrink_slab will free memory on all zones and may
-		 * take a long time.
+		 * Scanning slab without pagecache, have to open code
+		 * call to shrink_slab (shirnk_zone drives slab reclaim via
+		 * pagecache scanning, so it isn't set up to shrink slab
+		 * without scanning pagecache.
 		 */
-		for (;;) {
-			unsigned long lru_pages = zone_reclaimable_pages(zone);
-
-			/* No reclaimable slab or very low memory pressure */
-			if (!shrink_slab(sc.nr_scanned, gfp_mask, lru_pages))
-				break;
 
-			/* Freed enough memory */
-			nr_slab_pages1 = zone_page_state(zone,
-							NR_SLAB_RECLAIMABLE);
-			if (nr_slab_pages1 + nr_pages <= nr_slab_pages0)
-				break;
-		}
+		/*
+		 * lru_pages / 10  -- put a 10% pressure on the slab
+		 * which roughly corresponds to ZONE_RECLAIM_PRIORITY
+		 * scanning 1/16th of pagecache.
+		 *
+		 * Global slabs will be shrink at a relatively more
+		 * aggressive rate because we don't calculate the
+		 * global lru size for speed. But they really should
+		 * be converted to per zone slabs if they are important
+		 */
+		shrink_slab(zone, lru_pages / 10, lru_pages, lru_pages,
+				gfp_mask);
 
 		/*
-		 * Update nr_reclaimed by the number of slab pages we
-		 * reclaimed from this zone.
+		 * Although we have a zone based slab shrinker API, some slabs
+		 * are still scanned globally. This means we can't quite
+		 * determine how many pages were freed in this zone by
+		 * checking reclaimed_slab. However the regular shrink_zone
+		 * paths have exactly the same problem that they largely
+		 * ignore. So don't be different.
+		 *
+		 * The situation will improve dramatically as important slabs
+		 * are switched over to using reclaimed_slab after the
+		 * important slabs are converted to using per zone shrinkers.
+		 *
+		 * Note that shrink_slab may free memory on all zones and may
+		 * take a long time, but again switching important slabs to
+		 * zone based shrinkers will solve this problem.
 		 */
-		nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
-		if (nr_slab_pages1 < nr_slab_pages0)
-			sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1;
+		sc.nr_reclaimed += reclaim_state.reclaimed_slab;
+		reclaim_state.reclaimed_slab = 0;
 	}
 
 	p->reclaim_state = NULL;
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:20:01.000000000 +1100
@@ -35,11 +35,7 @@
 
 static void drop_slab(void)
 {
-	int nr_objects;
-
-	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
-	} while (nr_objects > 10);
+	shrink_all_slab();
 }
 
 int drop_caches_sysctl_handler(ctl_table *table, int write,
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/mm/memory-failure.c	2010-10-19 14:20:01.000000000 +1100
@@ -231,14 +231,8 @@
 	 * Only all shrink_slab here (which would also
 	 * shrink other caches) if access is not potentially fatal.
 	 */
-	if (access) {
-		int nr;
-		do {
-			nr = shrink_slab(1000, GFP_KERNEL, 1000);
-			if (page_count(p) == 1)
-				break;
-		} while (nr > 10);
-	}
+	if (access)
+		shrink_all_slab();
 }
 EXPORT_SYMBOL_GPL(shake_page);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
@ 2010-10-19  4:49   ` KOSAKI Motohiro
  2010-10-19  5:33     ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  4:49 UTC (permalink / raw)
  To: npiggin; +Cc: kosaki.motohiro, linux-kernel, linux-fsdevel, linux-mm

Hi

> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
> +++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
> @@ -997,6 +997,10 @@
>  /*
>   * A callback you can register to apply pressure to ageable caches.
>   *
> + * 'shrink_zone' is the new shrinker API. It is to be used in preference
> + * to 'shrink'. One must point to a shrinker function, the other must
> + * be NULL. See 'shrink_slab' for details about the shrink_zone API.
> + *
>   * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'.  It should
>   * look through the least-recently-used 'nr_to_scan' entries and
>   * attempt to free them up.  It should return the number of objects
> @@ -1013,13 +1017,53 @@
>  	int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask);
>  	int seeks;	/* seeks to recreate an obj */
>  
> +	/*
> +	 * shrink_zone - slab shrinker callback for reclaimable objects
> +	 * @shrink: this struct shrinker
> +	 * @zone: zone to scan
> +	 * @scanned: pagecache lru pages scanned in zone
> +	 * @total: total pagecache lru pages in zone
> +	 * @global: global pagecache lru pages (for zone-unaware shrinkers)
> +	 * @flags: shrinker flags
> +	 * @gfp_mask: gfp context we are operating within
> +	 *
> +	 * The shrinkers are responsible for calculating the appropriate
> +	 * pressure to apply, batching up scanning (and cond_resched,
> +	 * cond_resched_lock etc), and updating events counters including
> +	 * count_vm_event(SLABS_SCANNED, nr).
> +	 *
> +	 * This approach gives flexibility to the shrinkers. They know best how
> +	 * to do batching, how much time between cond_resched is appropriate,
> +	 * what statistics to increment, etc.
> +	 */
> +	void (*shrink_zone)(struct shrinker *shrink,
> +		struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global,
> +		unsigned long flags, gfp_t gfp_mask);

Now we decided to don't remove old (*shrink)() interface and zone unaware
slab users continue to use it. so why do we need global argument?
If only zone aware shrinker user (*shrink_zone)(), we can remove it.

Personally I think we should remove it because a removing makes a clear
message that all shrinker need to implement zone awareness eventually.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  4:49   ` KOSAKI Motohiro
@ 2010-10-19  5:33     ` Nick Piggin
  2010-10-19  5:40       ` KOSAKI Motohiro
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2010-10-19  5:33 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: npiggin, linux-kernel, linux-fsdevel, linux-mm

Hi,

On Tue, Oct 19, 2010 at 01:49:12PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
> > +++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
> > @@ -997,6 +997,10 @@
> >  /*
> >   * A callback you can register to apply pressure to ageable caches.
> >   *
> > + * 'shrink_zone' is the new shrinker API. It is to be used in preference
> > + * to 'shrink'. One must point to a shrinker function, the other must
> > + * be NULL. See 'shrink_slab' for details about the shrink_zone API.
> 
...

> Now we decided to don't remove old (*shrink)() interface and zone unaware
> slab users continue to use it. so why do we need global argument?
> If only zone aware shrinker user (*shrink_zone)(), we can remove it.
> 
> Personally I think we should remove it because a removing makes a clear
> message that all shrinker need to implement zone awareness eventually.

I agree, I do want to remove the old API, but it's easier to merge if
I just start by adding the new API. It is split out from my previous
patch which does convert all users of the API. When this gets merged, I
will break those out and send them via respective maintainers, then
remove the old API when they're all converted upstream.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  5:33     ` Nick Piggin
@ 2010-10-19  5:40       ` KOSAKI Motohiro
  0 siblings, 0 replies; 8+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  5:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kosaki.motohiro, linux-kernel, linux-fsdevel, linux-mm

> Hi,
> 
> On Tue, Oct 19, 2010 at 01:49:12PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > Index: linux-2.6/include/linux/mm.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
> > > +++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
> > > @@ -997,6 +997,10 @@
> > >  /*
> > >   * A callback you can register to apply pressure to ageable caches.
> > >   *
> > > + * 'shrink_zone' is the new shrinker API. It is to be used in preference
> > > + * to 'shrink'. One must point to a shrinker function, the other must
> > > + * be NULL. See 'shrink_slab' for details about the shrink_zone API.
> > 
> ...
> 
> > Now we decided to don't remove old (*shrink)() interface and zone unaware
> > slab users continue to use it. so why do we need global argument?
> > If only zone aware shrinker user (*shrink_zone)(), we can remove it.
> > 
> > Personally I think we should remove it because a removing makes a clear
> > message that all shrinker need to implement zone awareness eventually.
> 
> I agree, I do want to remove the old API, but it's easier to merge if
> I just start by adding the new API. It is split out from my previous
> patch which does convert all users of the API. When this gets merged, I
> will break those out and send them via respective maintainers, then
> remove the old API when they're all converted upstream.

Ok, I've got. I have no objection this step-by-step development. thanks
quick responce!




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 04/35] vfs: convert inode and dentry caches to per-zone shrinker
       [not found] <20101019034216.319085068@kernel.dk>
  2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
@ 2010-10-19  3:42 ` npiggin
       [not found] ` <20101019034658.744504135@kernel.dk>
  2 siblings, 0 replies; 8+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: linux-mm

[-- Attachment #1: vfs-zone-shrinker.patch --]
[-- Type: text/plain, Size: 5149 bytes --]

Convert inode and dentry caches to per-zone shrinker API in preparation
for doing proper per-zone cache LRU lists. These two caches tend to be
the most important in the system after the pagecache lrus, so making these
per-zone will help to fix up the funny quirks in vmscan code that tries
to reconcile the whole zone-driven scanning with the global slab reclaim.

Cc: linux-mm@kvack.org
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/dcache.c |   31 ++++++++++++++++++++-----------
 fs/inode.c  |   39 ++++++++++++++++++++++++---------------
 2 files changed, 44 insertions(+), 26 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-10-19 14:35:42.000000000 +1100
+++ linux-2.6/fs/dcache.c	2010-10-19 14:36:53.000000000 +1100
@@ -534,7 +534,7 @@
  *
  * This function may fail to free any resources if all the dentries are in use.
  */
-static void prune_dcache(int count)
+static void prune_dcache(unsigned long count)
 {
 	struct super_block *sb, *p = NULL;
 	int w_count;
@@ -887,7 +887,8 @@
 EXPORT_SYMBOL(shrink_dcache_parent);
 
 /*
- * Scan `nr' dentries and return the number which remain.
+ * shrink_dcache_memory scans and reclaims unused dentries. This function
+ * is defined according to the shrinker API described in linux/mm.h.
  *
  * We need to avoid reentering the filesystem if the caller is performing a
  * GFP_NOFS allocation attempt.  One example deadlock is:
@@ -895,22 +896,30 @@
  * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
  * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
  * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
  */
-static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
+static void shrink_dcache_memory(struct shrinker *shrink,
+		struct zone *zone, unsigned long scanned,
+		unsigned long total, unsigned long global,
+		unsigned long flags, gfp_t gfp_mask)
 {
-	if (nr) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
+	static unsigned long nr_to_scan;
+	unsigned long nr;
+
+	shrinker_add_scan(&nr_to_scan, scanned, global,
+			dentry_stat.nr_unused,
+			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
+	if (!(gfp_mask & __GFP_FS))
+	       return;
+
+	while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
 		prune_dcache(nr);
+		count_vm_events(SLABS_SCANNED, nr);
+		cond_resched();
 	}
-	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker dcache_shrinker = {
-	.shrink = shrink_dcache_memory,
-	.seeks = DEFAULT_SEEKS,
+	.shrink_zone = shrink_dcache_memory,
 };
 
 /**
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:35:42.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:37:05.000000000 +1100
@@ -445,7 +445,7 @@
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
-static void prune_icache(int nr_to_scan)
+static void prune_icache(unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -503,27 +503,36 @@
  * not open and the dcache references to those inodes have already been
  * reclaimed.
  *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
+ * This function is defined according to shrinker API described in linux/mm.h.
  */
-static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
+static void shrink_icache_memory(struct shrinker *shrink,
+		struct zone *zone, unsigned long scanned,
+		unsigned long total, unsigned long global,
+		unsigned long flags, gfp_t gfp_mask)
 {
-	if (nr) {
-		/*
-		 * Nasty deadlock avoidance.  We may hold various FS locks,
-		 * and we don't want to recurse into the FS that called us
-		 * in clear_inode() and friends..
-		 */
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
+	static unsigned long nr_to_scan;
+	unsigned long nr;
+
+	shrinker_add_scan(&nr_to_scan, scanned, global,
+			inodes_stat.nr_unused,
+			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
+	/*
+	 * Nasty deadlock avoidance.  We may hold various FS locks,
+	 * and we don't want to recurse into the FS that called us
+	 * in clear_inode() and friends..
+	 */
+	if (!(gfp_mask & __GFP_FS))
+	       return;
+
+	while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
 		prune_icache(nr);
+		count_vm_events(SLABS_SCANNED, nr);
+		cond_resched();
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
-	.shrink = shrink_icache_memory,
-	.seeks = DEFAULT_SEEKS,
+	.shrink_zone = shrink_icache_memory,
 };
 
 static void __wait_on_freeing_inode(struct inode *inode);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20101019034658.744504135@kernel.dk>]

[parent not found: <20101019123852.GA12506@dastard>]

[parent not found: <20101020023556.GC3740@amd>]

* Re: [patch 31/35] fs: icache per-zone inode LRU
       [not found]     ` <20101020023556.GC3740@amd>
@ 2010-10-20  3:12       ` Nick Piggin
  2010-10-20  9:43         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2010-10-20  3:12 UTC (permalink / raw)
  To: Nick Piggin, linux-mm; +Cc: Dave Chinner, linux-kernel, linux-fsdevel

Gah. Try again.

On Wed, Oct 20, 2010 at 01:35:56PM +1100, Nick Piggin wrote:
> [I should have cc'ed this one to linux-mm as well, so I quote your
> reply in full here]
> 
> On Tue, Oct 19, 2010 at 11:38:52PM +1100, Dave Chinner wrote:
> > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > > Per-zone LRUs and shrinkers for inode cache.
> > 
> > Regardless of whether this is the right way to scale or not, I don't
> > like the fact that this moves the cache LRUs into the memory
> > management structures, and expands the use of MM specific structures
> > throughout the code.
> 
> The zone structure really is the basic unit of memory abstraction
> in the whole zoned VM concept (which covers different properties
> of both physical address and NUMA cost).
> 
> The zone contains structures for memory management that aren't
> otherwise directly related to one another. Generic page waitqueues,
> page allocator structures, pagecache reclaim structures, memory model
> data, and various statistics.
> 
> Structures to reclaim inodes from a particular zone belong in the
> zone struct as much as those to reclaim pagecache or anonymous
> memory from that zone too. It actually fits far better in here than
> globally, because all our allocation/reclaiming/watermarks etc is
> driven per-zone.
> 
> The structure is not frequent -- a couple per NUMA node.
> 
> 
> > It ties the cache implementation to the current
> > VM implementation. That, IMO, goes against all the principle of
> > modularisation at the source code level, and it means we have to tie
> > all shrinker implemenations to the current internal implementation
> > of the VM. I don't think that is wise thing to do because of the
> > dependencies and impedance mismatches it introduces.
> 
> It's very fundamental. We allocate memory from, and have to reclaim
> memory from -- zones. Memory reclaim is driven based on how the VM
> wants to reclaim memory: nothing you can do to avoid some linkage
> between the two.
> 
> Look at it this way. The dumb global shrinker is also tied to an
> MM implementation detail, but that detail in fact does *not* match
> the reality of the MM, and so it has all these problems interacting
> with real reclaim.
> 
> What problems? OK, on an N zone system (assuming equal zones and
> even distribution of objects around memory), then if there is a shortage
> on a particular zone, slabs from _all_ zones are reclaimed. We reclaim
> a factor of N too many objects. In a NUMA situation, we also touch
> remote memory with a chance (N-1)/N.
> 
> As number of nodes grow beyond 2, this quickly goes down hill.
> 
> In summary, there needs to be some knowledge of how MM reclaims memory
> in memory reclaim shrinkers -- simply can't do a good implementation
> without that. If the zone concept changes, the MM gets turned upside
> down and all those assumptions would need to be revisited anyway.
> 
> 
> > As an example: XFS inodes to be reclaimed are simply tagged in a
> > radix tree so the shrinker can reclaim inodes in optimal IO order
> > rather strict LRU order. It simply does not match a zone-based
> 
> This is another problem, similar to what we have in pagecache. In
> the pagecache, we need to clean pages in optimal IO order, but we
> still reclaim them according to some LRU order.
> 
> If you reclaim them in optimal IO order, cache efficiency will go
> down because you sacrifice recency/frequency information. If you
> IO in reclaim order, IO efficiency goes down. The solution is to
> decouple them with like writeout versus reclaim.
> 
> But anyway, that's kind of an "aside": inode caches are reclaimed
> in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
> change that in the slightest.
> 
> > shrinker implementation in any way, shape or form, nor does it's
> > inherent parallelism match that of the way shrinkers are called.
> > 
> > Any change in shrinker infrastructure needs to be able to handle
> > these sorts of impedance mismatches between the VM and the cache
> > subsystem. The current API doesn't handle this very well, either,
> > so it's something that we need to fix so that scalability is easy
> > for everyone.
> > 
> > Anyway, my main point is that tying the LRU and shrinker scaling to
> > the implementation of the VM is a one-off solution that doesn't work
> > for generic infrastructure.
> 
> No it isn't. It worked for the pagecache, and it works for dcache.
> 
> 
> > Other subsystems need the same
> > large-machine scaling treatment, and there's no way we should be
> > tying them all into the struct zone. It needs further abstraction.
> 
> An abstraction? Other than the zone? What do you suggest? Invent
> something that the VM has no concept of and try to use that?
> 
> No. The zone is the right thing to base it on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  3:12       ` [patch 31/35] fs: icache per-zone inode LRU Nick Piggin
@ 2010-10-20  9:43         ` Dave Chinner
  2010-10-20 10:02           ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-10-20  9:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, linux-kernel, linux-fsdevel

> On Wed, Oct 20, 2010 at 01:35:56PM +1100, Nick Piggin wrote:
> > On Tue, Oct 19, 2010 at 11:38:52PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > > > Per-zone LRUs and shrinkers for inode cache.
> > > 
> > > Regardless of whether this is the right way to scale or not, I don't
> > > like the fact that this moves the cache LRUs into the memory
> > > management structures, and expands the use of MM specific structures
> > > throughout the code.
> > 
> > The zone structure really is the basic unit of memory abstraction
> > in the whole zoned VM concept (which covers different properties
> > of both physical address and NUMA cost).

[ snip lecture on NUMA VM 101 - I got that at SGI w.r.t. Irix more than
8 years ago, and Linux isn't any different. ]

> > > It ties the cache implementation to the current
> > > VM implementation. That, IMO, goes against all the principle of
> > > modularisation at the source code level, and it means we have to tie
> > > all shrinker implemenations to the current internal implementation
> > > of the VM. I don't think that is wise thing to do because of the
> > > dependencies and impedance mismatches it introduces.
> > 
> > It's very fundamental. We allocate memory from, and have to reclaim
> > memory from -- zones. Memory reclaim is driven based on how the VM
> > wants to reclaim memory: nothing you can do to avoid some linkage
> > between the two.

The allocation API exposes per-node allocation, not zones. The zones
are the internal implementation of the API, not what people use
directly for allocation...

> > > As an example: XFS inodes to be reclaimed are simply tagged in a
> > > radix tree so the shrinker can reclaim inodes in optimal IO order
> > > rather strict LRU order. It simply does not match a zone-based
....
> > But anyway, that's kind of an "aside": inode caches are reclaimed
> > in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
> > change that in the slightest.

I suspect you didn't read what I wrote, so I'll repeat it. XFS has
reclaimed inodes in optimal IO order for several releases and so
per-zone LRU would change that drastically.

> > > Other subsystems need the same
> > > large-machine scaling treatment, and there's no way we should be
> > > tying them all into the struct zone. It needs further abstraction.
> > 
> > An abstraction? Other than the zone? What do you suggest? Invent
> > something that the VM has no concept of and try to use that?

I think you answered that question yourself a moment ago:

> > The structure is not frequent -- a couple per NUMA node.

Sounds to me like a per-node LRU/shrinker arrangement is an
abstraction that the VM could work with. Indeed, make it run only
from the *per-node kswapd* instead of from direct reclaim, and we'd
also solve the unbound reclaim parallelism problem at the same
time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  9:43         ` Dave Chinner
@ 2010-10-20 10:02           ` Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2010-10-20 10:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-mm, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 08:43:02PM +1100, Dave Chinner wrote:
> > On Wed, Oct 20, 2010 at 01:35:56PM +1100, Nick Piggin wrote:
> > > 
> > > It's very fundamental. We allocate memory from, and have to reclaim
> > > memory from -- zones. Memory reclaim is driven based on how the VM
> > > wants to reclaim memory: nothing you can do to avoid some linkage
> > > between the two.
> 
> The allocation API exposes per-node allocation, not zones. The zones
> are the internal implementation of the API, not what people use
> directly for allocation...

Of course it exposes zones (with GFP flags). In fact they were exposed
before the zone concept was extended to NUMA.

> > > > As an example: XFS inodes to be reclaimed are simply tagged in a
> > > > radix tree so the shrinker can reclaim inodes in optimal IO order
> > > > rather strict LRU order. It simply does not match a zone-based
> ....
> > > But anyway, that's kind of an "aside": inode caches are reclaimed
> > > in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
> > > change that in the slightest.
> 
> I suspect you didn't read what I wrote, so I'll repeat it. XFS has
> reclaimed inodes in optimal IO order for several releases and so
> per-zone LRU would change that drastically.

You were talking about XFS's own inode reclaim code? My patches
of course don't change that. I would like to see them usable by
XFS as well of course, but I'm not forcing anything to be
shoehorned in where it doesn't fit properly yet.

The Linux inode reclaimer is pretty well "random" from POV of
disk order, as you know.

I don't have the complete answer about how to write back required
inode information in IO optimal order, and at the same time make
reclaim optimal reclaiming choices.

It could be that a 2 stage reclaim process is enough (have the
Linux inode reclaim make the thing and make it eligible for IO
and real reclaiming, then have an inode writeout pass that does
IO optimal reclaiming from those).

That is really quite speculative and out of scope of this patch set.
But the point is that this patch set doesn't prohibit anything like
that happening, does not change XFS's reclaim currently.

> > > > Other subsystems need the same
> > > > large-machine scaling treatment, and there's no way we should be
> > > > tying them all into the struct zone. It needs further abstraction.
> > > 
> > > An abstraction? Other than the zone? What do you suggest? Invent
> > > something that the VM has no concept of and try to use that?
> 
> I think you answered that question yourself a moment ago:
> 
> > > The structure is not frequent -- a couple per NUMA node.
> 
> Sounds to me like a per-node LRU/shrinker arrangement is an
> abstraction that the VM could work with.

The zone really is the right place. If you do it per node, then
you can still have shortages in one node in a zone but not
another, causing the same excessive reclaim problem.

> Indeed, make it run only
> from the *per-node kswapd* instead of from direct reclaim, and we'd
> also solve the unbound reclaim parallelism problem at the same
> time...

That's also out of scope, but it is among things being
considered, as far as I know (along with capping number of
threads in reclaim etc). But doing zone LRUs doesn't change
this either -- kswapd pagecache reclaim also works per node,
by simply processing all the zones that belong to the node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-10-20 10:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20101019034216.319085068@kernel.dk>
2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
2010-10-19  4:49   ` KOSAKI Motohiro
2010-10-19  5:33     ` Nick Piggin
2010-10-19  5:40       ` KOSAKI Motohiro
2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
     [not found] ` <20101019034658.744504135@kernel.dk>
     [not found]   ` <20101019123852.GA12506@dastard>
     [not found]     ` <20101020023556.GC3740@amd>
2010-10-20  3:12       ` [patch 31/35] fs: icache per-zone inode LRU Nick Piggin
2010-10-20  9:43         ` Dave Chinner
2010-10-20 10:02           ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).