public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] Adaptive read-ahead V8
@ 2005-11-25 15:12 Wu Fengguang
  2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
                   ` (19 more replies)
  0 siblings, 20 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

Changelog
=========

V8  2005-11-25

- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values

V7  2005-11-09

- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size

V6  2005-11-01

- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages

V5  2005-10-28

- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities

V4  2005-10-15

- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache

V3  2005-10-06

- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones

V2  2005-09-15

- delayed page activation
- look-ahead: towards pipelined read-ahead

V1  2005-09-13

Initial release which features:
        o stateless (for now)
        o adapts to available memory / read speed
        o free of thrashing (in theory)

And handles:
        o large number of slow streams (FTP server)
	o open/read/close access patterns (NFS server)
        o multiple interleaved, sequential streams in one file
	  (multithread / multimedia / database)


Overview
========

The current read-ahead logic uses an inflexible algorithm with 128KB
VM_MAX_READAHEAD. Less memory leads to thrashing, more memory helps no
throughput. The new logic is simply safer and faster. It makes sure
every single read-ahead request is safe for the current load. Memory
tight systems are expected to benefit a lot: no thrashing any more.
It can also help boost I/O throughput for large memory systems, for
VM_MAX_READAHEAD now defaults to 1MB. The value is no longer tightly
coupled with the thrashing problem, and therefore constrainted by it.

Thanks,
Wu Fengguang
--
Dept. Automation                University of Science and Technology of China

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/19] mm: delayed page activation
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 02/19] vm: kswapd incmin Wu Fengguang
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: mm-delayed-activation.patch --]
[-- Type: text/plain, Size: 4764 bytes --]

When a page is referenced the second time in inactive_list, mark it with
PG_activate instead of moving it into active_list immediately. The actual
moving work is delayed to vmscan time.

This implies two essential changes:
- keeps the adjecency of pages in lru;
- lifts the page reference counter max from 1 to 3.

And leads to the following improvements:
- read-ahead for a leading reader will not be disturbed by a following reader;
- enables the thrashing protection logic to save pages for following readers;
- keeping relavant pages together helps improve I/O efficiency;
- and also helps decrease vm fragmantation;
- increased refcnt space might help page replacement algorithms.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/page-flags.h |   31 +++++++++++++++++++++++++++++++
 mm/page_alloc.c            |    1 +
 mm/swap.c                  |    9 ++++-----
 mm/vmscan.c                |    6 ++++++
 4 files changed, 42 insertions(+), 5 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.15-rc2-mm1/include/linux/page-flags.h
@@ -76,6 +76,7 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
+#define PG_activate		20	/* delayed activate */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -304,6 +305,12 @@ extern void __mod_page_state(unsigned lo
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageActivate(page)	test_bit(PG_activate, &(page)->flags)
+#define SetPageActivate(page)	set_bit(PG_activate, &(page)->flags)
+#define ClearPageActivate(page)	clear_bit(PG_activate, &(page)->flags)
+#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
+#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
@@ -329,4 +336,28 @@ static inline void set_page_writeback(st
 #define ClearPageFsMisc(page)		clear_bit(PG_fs_misc, &(page)->flags)
 #define TestClearPageFsMisc(page)	test_and_clear_bit(PG_fs_misc, &(page)->flags)
 
+#if PG_activate < PG_referenced
+#error unexpected page flags order
+#endif
+
+#define PAGE_REFCNT_0		0
+#define PAGE_REFCNT_1		(1 << PG_referenced)
+#define PAGE_REFCNT_2		(1 << PG_activate)
+#define PAGE_REFCNT_3		((1 << PG_activate) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK	PAGE_REFCNT_3
+
+/*
+ * STATUS   REFERENCE COUNT
+ *  __                   0
+ *  _R       PAGE_REFCNT_1
+ *  A_       PAGE_REFCNT_2
+ *  AR       PAGE_REFCNT_3
+ *
+ *  A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+	return page->flags & PAGE_REFCNT_MASK;
+}
+
 #endif	/* PAGE_FLAGS_H */
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -508,6 +508,7 @@ static int prep_new_page(struct page *pa
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
+			1 << PG_activate |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);
 	set_page_refs(page, order);
--- linux-2.6.15-rc2-mm1.orig/mm/swap.c
+++ linux-2.6.15-rc2-mm1/mm/swap.c
@@ -29,7 +29,6 @@
 #include <linux/percpu.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
-#include <linux/init.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -115,13 +114,13 @@ void fastcall activate_page(struct page 
  * Mark a page as having seen activity.
  *
  * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
+ * inactive,referenced		->	activate,unreferenced
+ * activate,unreferenced	->	activate,referenced
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
+	if (!PageActivate(page) && PageReferenced(page) && PageLRU(page)) {
+		SetPageActivate(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -454,6 +454,12 @@ static int shrink_list(struct list_head 
 		if (PageWriteback(page))
 			goto keep_locked;
 
+		if (PageActivate(page)) {
+			ClearPageActivate(page);
+			ClearPageReferenced(page);
+			goto activate_locked;
+		}
+
 		referenced = page_referenced(page, 1, sc->priority <= 0);
 		/* In active use or really unfreeable?  Activate it. */
 		if (referenced && page_mapping_inuse(page))

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 02/19] vm: kswapd incmin
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
  2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 03/19] mm: balance page aging between zones and slabs Wu Fengguang
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Nick Piggin, Wu Fengguang

[-- Attachment #1: vm-kswapd-incmin.patch --]
[-- Type: text/plain, Size: 4789 bytes --]

Explicitly teach kswapd about the incremental min logic instead of just scanning
all zones under the first low zone. This should keep more even pressure applied
on the zones.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


This patch is taken unchanged from Nick Piggin's work.

 mm/vmscan.c |  105 ++++++++++++++++++++----------------------------------------
 1 files changed, 35 insertions(+), 70 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -1314,97 +1314,63 @@ loop_again:
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
+		int first_low_zone = 0;
 
 		all_zones_ok = 1;
+		sc.nr_scanned = 0;
+		sc.nr_reclaimed = 0;
+		sc.priority = priority;
+		sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;
 
-		if (nr_pages == 0) {
-			/*
-			 * Scan in the highmem->dma direction for the highest
-			 * zone which needs scanning
-			 */
-			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
-				struct zone *zone = pgdat->node_zones + i;
+		/* Scan in the highmem->dma direction */
+		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+			struct zone *zone = pgdat->node_zones + i;
 
-				if (!populated_zone(zone))
-					continue;
+			if (!populated_zone(zone))
+				continue;
 
-				if (zone->all_unreclaimable &&
-						priority != DEF_PRIORITY)
+			if (nr_pages == 0) {	/* Not software suspend */
+				if (zone_watermark_ok(zone, order,
+					zone->pages_high, first_low_zone, 0))
 					continue;
 
-				if (!zone_watermark_ok(zone, order,
-						zone->pages_high, 0, 0)) {
-					end_zone = i;
-					goto scan;
-				}
+				all_zones_ok = 0;
+				if (first_low_zone < i)
+					first_low_zone = i;
 			}
-			goto out;
-		} else {
-			end_zone = pgdat->nr_zones - 1;
-		}
-scan:
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			lru_pages += zone->nr_active + zone->nr_inactive;
-		}
-
-		/*
-		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
-		 */
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
-
-			if (!populated_zone(zone))
-				continue;
 
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			if (nr_pages == 0) {	/* Not software suspend */
-				if (!zone_watermark_ok(zone, order,
-						zone->pages_high, end_zone, 0))
-					all_zones_ok = 0;
-			}
 			zone->temp_priority = priority;
 			if (zone->prev_priority > priority)
 				zone->prev_priority = priority;
-			sc.nr_scanned = 0;
-			sc.nr_reclaimed = 0;
-			sc.priority = priority;
-			sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
+			lru_pages += zone->nr_active + zone->nr_inactive;
+
 			atomic_inc(&zone->reclaim_in_progress);
 			shrink_zone(zone, &sc);
 			atomic_dec(&zone->reclaim_in_progress);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_reclaimed += sc.nr_reclaimed;
-			total_scanned += sc.nr_scanned;
-			if (zone->all_unreclaimable)
-				continue;
-			if (nr_slab == 0 && zone->pages_scanned >=
+
+			if (zone->pages_scanned >=
 				    (zone->nr_active + zone->nr_inactive) * 4)
 				zone->all_unreclaimable = 1;
-			/*
-			 * If we've done a decent amount of scanning and
-			 * the reclaim ratio is low, start doing writepage
-			 * even in laptop mode
-			 */
-			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
-			    total_scanned > total_reclaimed+total_reclaimed/2)
-				sc.may_writepage = 1;
 		}
+		reclaim_state->reclaimed_slab = 0;
+		shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+		sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+		total_reclaimed += sc.nr_reclaimed;
+		total_scanned += sc.nr_scanned;
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > total_reclaimed+total_reclaimed/2)
+			sc.may_writepage = 1;
+
 		if (nr_pages && to_free > total_reclaimed)
 			continue;	/* swsusp: need to do more work */
 		if (all_zones_ok)
@@ -1425,7 +1391,6 @@ scan:
 		if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages))
 			break;
 	}
-out:
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 03/19] mm: balance page aging between zones and slabs
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
  2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
  2005-11-25 15:12 ` [PATCH 02/19] vm: kswapd incmin Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 04/19] mm: debug page reclaim Wu Fengguang
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli, Wu Fengguang

[-- Attachment #1: mm-balanced-aging.patch --]
[-- Type: text/plain, Size: 16543 bytes --]

The zone aging rates are currently imbalanced, the gap can be as large as 3
times, which can severely damage read-ahead requests and shorten their
effective life time.

The current slab shrinking code is way too fragile. It should manage aging
pace by itself, and provide a simple and robust interface.

THE METHOD
==========
This patch adds three variables in struct zone and shrinker
- aging_total
- aging_milestone
- page_age
to keep track of page aging rate, and keep it in sync on page reclaim time.
The aging_total is just a per-zone counter-part to the per-cpu
pgscan_{kswapd,direct}_{zone name}. But it is not direct comparable between
zones, so the aging_milestone/page_age are maintained based on aging_total.

The main design aspects of page balancing:
- In the kswapd reclaim path,
	- reclaim also the least aged zone to help it catch up with the most
	  aged zone;
	- the reclaim for watermark is avoided whenever possible by calling
	  watermark_ok() with classzone_idx = 0;
	- in this way, the reclaims for aging are garuanteed to be more than
	  reclaims for watermark if there's a big imbalance, so that there
	  will not be any chance for a growing gap.
- In the direct reclaim path,
	- reclaim only the least aged local/headless zone in the first two
	  rounds;
	- two rounds are enough in normal to get enough free pages;
	- and prevents unnecessarily disturbing remote nodes.
- In shrink_cache()/shrink_zone()/shrink_list(),
	- zone->nr_scan_inactive is purged, the requested zone will be
	  scanned at least SWAP_CLUSTER_MAX pages;
	- sc->nr_scanned reflects exactly the number of cold pages scanned.
- In shrink_slab(),
	- keep the age of slabs in line with that of the largest zone, using
	  the same syncing logic as that of the zones;
	- a minimal number of unused slabs are reserved in normal;
	- the slabs are shrinked more if vm pressure is high;
	- `mmap pages found' - `shrink more slabs' - `avoid swapping',
	  I don't think they are strongly related, so the code is removed.

THE RESULT
==========
- It does not touch the time critical page alloc path, so will not hurt
  performance at all.
- The shrink of slabs become simple, consistent and robust. The callers no
  longer have to worry about the tricky aging thing.
- Experiments with 64M/512M/2G memory show that it keeps the zone ages in
  sync perfectly. One can check the effects by running:
			tar c / | cat > /dev/null &
			watch -n1 'grep "age " /proc/zoneinfo'

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h     |    4 
 include/linux/mmzone.h |   14 +++
 mm/page_alloc.c        |   16 ++-
 mm/vmscan.c            |  222 +++++++++++++++++++++++++++++++------------------
 4 files changed, 173 insertions(+), 83 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/mmzone.h
+++ linux-2.6.15-rc2-mm1/include/linux/mmzone.h
@@ -149,6 +149,20 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
+	/* Fields for balanced page aging:
+	 * aging_total     - The accumulated number of activities that may
+	 *                   cause page aging, that is, make some pages closer
+	 *                   to the tail of inactive_list.
+	 * aging_milestone - A snapshot of total_scan every time a full
+	 *                   inactive_list of pages become aged.
+	 * page_age        - A normalized value showing the percent of pages
+	 *                   have been aged.  It is compared between zones to
+	 *                   balance the rate of page aging.
+	 */
+	unsigned long		aging_total;
+	unsigned long		aging_milestone;
+	unsigned long		page_age;
+
 	/*
 	 * Does the allocator try to reclaim pages from the zone as soon
 	 * as it fails a watermark_ok() in __alloc_pages?
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -775,7 +775,9 @@ struct shrinker {
 	shrinker_t		shrinker;
 	struct list_head	list;
 	int			seeks;	/* seeks to recreate an obj */
-	long			nr;	/* objs pending delete */
+	unsigned long		aging_total;
+	unsigned long		aging_milestone;
+	unsigned long		page_age;
 	struct shrinker_stats	*s_stats;
 };
 
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -1468,6 +1468,8 @@ void show_free_areas(void)
 			" active:%lukB"
 			" inactive:%lukB"
 			" present:%lukB"
+			" aging:%lukB"
+			" age:%lu"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
@@ -1479,6 +1481,8 @@ void show_free_areas(void)
 			K(zone->nr_active),
 			K(zone->nr_inactive),
 			K(zone->present_pages),
+			K(zone->aging_total),
+			zone->page_age,
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
 			);
@@ -2089,9 +2093,11 @@ static void __init free_area_init_core(s
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
 		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
+		zone->aging_total = 0;
+		zone->aging_milestone = 0;
+		zone->page_age = 0;
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
 			continue;
@@ -2240,7 +2246,9 @@ static int zoneinfo_show(struct seq_file
 			   "\n        high     %lu"
 			   "\n        active   %lu"
 			   "\n        inactive %lu"
-			   "\n        scanned  %lu (a: %lu i: %lu)"
+			   "\n        aging    %lu"
+			   "\n        age      %lu"
+			   "\n        scanned  %lu (a: %lu)"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone->free_pages,
@@ -2249,8 +2257,10 @@ static int zoneinfo_show(struct seq_file
 			   zone->pages_high,
 			   zone->nr_active,
 			   zone->nr_inactive,
+			   zone->aging_total,
+			   zone->page_age,
 			   zone->pages_scanned,
-			   zone->nr_scan_active, zone->nr_scan_inactive,
+			   zone->nr_scan_active,
 			   zone->spanned_pages,
 			   zone->present_pages);
 		seq_printf(m,
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -123,6 +123,56 @@ static long total_memory;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+#ifdef CONFIG_HIGHMEM64G
+#define		PAGE_AGE_SHIFT  8
+#elif BITS_PER_LONG == 32
+#define		PAGE_AGE_SHIFT  12
+#elif BITS_PER_LONG == 64
+#define		PAGE_AGE_SHIFT  20
+#else
+#error unknown BITS_PER_LONG
+#endif
+#define		PAGE_AGE_MASK   ((1 << PAGE_AGE_SHIFT) - 1)
+
+/*
+ * The simplified code is: (a->page_age > b->page_age)
+ * The complexity deals with the wrap-around problem.
+ * Two page ages not close enough should also be ignored:
+ * they are out of sync and the comparison may be nonsense.
+ */
+#define pages_more_aged(a, b) 						\
+	((b->page_age - a->page_age) & PAGE_AGE_MASK) >			\
+			PAGE_AGE_MASK - (1 << (PAGE_AGE_SHIFT - 3))	\
+
+/*
+ * Keep track of the percent of cold pages that have been scanned / aged.
+ * It's not really ##%, but a high resolution normalized value.
+ */
+static inline void update_zone_age(struct zone *z, int nr_scan)
+{
+	unsigned long len = z->nr_inactive + z->free_pages;
+
+	z->aging_total += nr_scan;
+
+	if (z->aging_total - z->aging_milestone > len)
+		z->aging_milestone += len;
+
+	z->page_age = ((z->aging_total - z->aging_milestone)
+						<< PAGE_AGE_SHIFT) / len;
+}
+
+static inline void update_slab_age(struct shrinker *s,
+					unsigned long len, int nr_scan)
+{
+	s->aging_total += nr_scan;
+
+	if (s->aging_total - s->aging_milestone > len)
+		s->aging_milestone += len;
+
+	s->page_age = ((s->aging_total - s->aging_milestone)
+						<< PAGE_AGE_SHIFT) / len;
+}
+
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -134,7 +184,9 @@ struct shrinker *set_shrinker(int seeks,
         if (shrinker) {
 	        shrinker->shrinker = theshrinker;
 	        shrinker->seeks = seeks;
-	        shrinker->nr = 0;
+	        shrinker->aging_total = 0;
+	        shrinker->aging_milestone = 0;
+	        shrinker->page_age = 0;
 		shrinker->s_stats = alloc_percpu(struct shrinker_stats);
 		if (!shrinker->s_stats) {
 			kfree(shrinker);
@@ -170,80 +222,61 @@ EXPORT_SYMBOL(remove_shrinker);
  * percentages of the lru and ageable caches.  This should balance the seeks
  * generated by these structures.
  *
- * If the vm encounted mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
- *
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
- *
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt.  It is used for balancing
- * slab reclaim versus page reclaim.
+ * If the vm pressure is high, shrink the slabs more.
  *
  * Returns the number of slab objects which we shrunk.
  */
-static int shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+static int shrink_slab(gfp_t gfp_mask)
 {
 	struct shrinker *shrinker;
-	int ret = 0;
-
-	if (scanned == 0)
-		scanned = SWAP_CLUSTER_MAX;
+	struct pglist_data *pgdat;
+	struct zone *zone;
+	int n;
 
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 1;	/* Assume we'll be able to shrink next time */
 
-	list_for_each_entry(shrinker, &shrinker_list, list) {
-		unsigned long long delta;
-		unsigned long total_scan;
-		unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);
-
-		delta = (4 * scanned) / shrinker->seeks;
-		delta *= max_pass;
-		do_div(delta, lru_pages + 1);
-		shrinker->nr += delta;
-		if (shrinker->nr < 0) {
-			printk(KERN_ERR "%s: nr=%ld\n",
-					__FUNCTION__, shrinker->nr);
-			shrinker->nr = max_pass;
-		}
+	/* find the major zone for the slabs to catch up age with */
+	pgdat = NODE_DATA(numa_node_id());
+	zone = pgdat->node_zones;
+	for (n = 1; n < pgdat->nr_zones; n++) {
+		struct zone *z = pgdat->node_zones + n;
 
-		/*
-		 * Avoid risking looping forever due to too large nr value:
-		 * never try to free more than twice the estimate number of
-		 * freeable entries.
-		 */
-		if (shrinker->nr > max_pass * 2)
-			shrinker->nr = max_pass * 2;
-
-		total_scan = shrinker->nr;
-		shrinker->nr = 0;
+		if (zone->present_pages < z->present_pages)
+			zone = z;
+	}
 
-		while (total_scan >= SHRINK_BATCH) {
-			long this_scan = SHRINK_BATCH;
-			int shrink_ret;
+	n = 0;
+	list_for_each_entry(shrinker, &shrinker_list, list) {
+		while (pages_more_aged(zone, shrinker)) {
 			int nr_before;
+			int nr_after;
 
 			nr_before = (*shrinker->shrinker)(0, gfp_mask);
-			shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
-			if (shrink_ret == -1)
+			if (nr_before <= SHRINK_BATCH * zone->prev_priority)
+				break;
+
+			nr_after = (*shrinker->shrinker)(SHRINK_BATCH, gfp_mask);
+			if (nr_after == -1)
 				break;
-			if (shrink_ret < nr_before) {
-				ret += nr_before - shrink_ret;
-				shrinker_stat_add(shrinker, nr_freed,
-					(nr_before - shrink_ret));
+
+			if (nr_after < nr_before) {
+				int nr_freed = nr_before - nr_after;
+
+				n += nr_freed;
+				shrinker_stat_add(shrinker, nr_freed, nr_freed);
 			}
-			shrinker_stat_add(shrinker, nr_req, this_scan);
-			mod_page_state(slabs_scanned, this_scan);
-			total_scan -= this_scan;
+			shrinker_stat_add(shrinker, nr_req, SHRINK_BATCH);
+			mod_page_state(slabs_scanned, SHRINK_BATCH);
+			update_slab_age(shrinker, nr_before * DEF_PRIORITY,
+						SHRINK_BATCH * shrinker->seeks *
+							zone->prev_priority);
 
 			cond_resched();
 		}
-
-		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
-	return ret;
+	return n;
 }
 
 /* Called without lock on whether page is mapped, so answer is unstable */
@@ -446,11 +479,6 @@ static int shrink_list(struct list_head 
 
 		BUG_ON(PageActive(page));
 
-		sc->nr_scanned++;
-		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
-			sc->nr_scanned++;
-
 		if (PageWriteback(page))
 			goto keep_locked;
 
@@ -894,11 +922,13 @@ static void shrink_cache(struct zone *zo
 					     &page_list, &nr_scan);
 		zone->nr_inactive -= nr_taken;
 		zone->pages_scanned += nr_scan;
+		update_zone_age(zone, nr_scan);
 		spin_unlock_irq(&zone->lru_lock);
 
 		if (nr_taken == 0)
 			goto done;
 
+		sc->nr_scanned += nr_scan;
 		max_scan -= nr_scan;
 		if (current_is_kswapd())
 			mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
@@ -1101,12 +1131,7 @@ shrink_zone(struct zone *zone, struct sc
 	else
 		nr_active = 0;
 
-	zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
-		zone->nr_scan_inactive = 0;
-	else
-		nr_inactive = 0;
+	nr_inactive = (zone->nr_inactive >> sc->priority) + 1;
 
 	sc->nr_to_reclaim = sc->swap_cluster_max;
 
@@ -1153,6 +1178,7 @@ static void
 shrink_caches(struct zone **zones, struct scan_control *sc)
 {
 	int i;
+	struct zone *z = NULL;
 
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
@@ -1170,8 +1196,31 @@ shrink_caches(struct zone **zones, struc
 		if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
 			continue;	/* Let kswapd poll it */
 
+		/*
+		 * Balance page aging in local zones and following headless
+		 * zones in the first two rounds of direct reclaim.
+		 */
+		if (sc->priority > DEF_PRIORITY - 2) {
+			if (zone->zone_pgdat != zones[0]->zone_pgdat) {
+				cpumask_t cpu = node_to_cpumask(
+						zone->zone_pgdat->node_id);
+				if (!cpus_empty(cpu))
+					break;
+			}
+
+			if (!z)
+				z = zone;
+			else if (pages_more_aged(z, zone))
+				z = zone;
+
+			continue;
+		}
+
 		shrink_zone(zone, sc);
 	}
+
+	if (z)
+		shrink_zone(z, sc);
 }
  
 /*
@@ -1194,7 +1243,6 @@ int try_to_free_pages(struct zone **zone
 	int total_scanned = 0, total_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
-	unsigned long lru_pages = 0;
 	int i;
 
 	delay_prefetch();
@@ -1212,7 +1260,6 @@ int try_to_free_pages(struct zone **zone
 			continue;
 
 		zone->temp_priority = DEF_PRIORITY;
-		lru_pages += zone->nr_active + zone->nr_inactive;
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -1222,7 +1269,7 @@ int try_to_free_pages(struct zone **zone
 		sc.priority = priority;
 		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
 		shrink_caches(zones, &sc);
-		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+		shrink_slab(gfp_mask);
 		if (reclaim_state) {
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -1296,6 +1343,23 @@ static int balance_pgdat(pg_data_t *pgda
 	int total_scanned, total_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
+	struct zone *youngest_zone;
+	struct zone *oldest_zone;
+
+	youngest_zone = oldest_zone = NULL;
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (zone->present_pages == 0)
+			continue;
+
+		if (!oldest_zone)
+			youngest_zone = oldest_zone = zone;
+		else if (pages_more_aged(zone, oldest_zone))
+			oldest_zone = zone;
+		else if (pages_more_aged(youngest_zone, zone))
+			youngest_zone = zone;
+	}
 
 loop_again:
 	total_scanned = 0;
@@ -1314,9 +1378,6 @@ loop_again:
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		unsigned long lru_pages = 0;
-		int first_low_zone = 0;
-
 		all_zones_ok = 1;
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
@@ -1331,13 +1392,17 @@ loop_again:
 				continue;
 
 			if (nr_pages == 0) {	/* Not software suspend */
-				if (zone_watermark_ok(zone, order,
-					zone->pages_high, first_low_zone, 0))
+				if (!zone_watermark_ok(zone, order,
+							zone->pages_high,
+							0, 0)) {
+					all_zones_ok = 0;
+				} else if (zone == youngest_zone &&
+						pages_more_aged(oldest_zone,
+								youngest_zone)) {
+					/* if (priority > DEF_PRIORITY - 2) */
+						/* all_zones_ok = 0; */
+				} else
 					continue;
-
-				all_zones_ok = 0;
-				if (first_low_zone < i)
-					first_low_zone = i;
 			}
 
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -1346,7 +1411,6 @@ loop_again:
 			zone->temp_priority = priority;
 			if (zone->prev_priority > priority)
 				zone->prev_priority = priority;
-			lru_pages += zone->nr_active + zone->nr_inactive;
 
 			atomic_inc(&zone->reclaim_in_progress);
 			shrink_zone(zone, &sc);
@@ -1357,7 +1421,7 @@ loop_again:
 				zone->all_unreclaimable = 1;
 		}
 		reclaim_state->reclaimed_slab = 0;
-		shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+		shrink_slab(GFP_KERNEL);
 		sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 		total_reclaimed += sc.nr_reclaimed;
 		total_scanned += sc.nr_scanned;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 04/19] mm: debug page reclaim
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (2 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 03/19] mm: balance page aging between zones and slabs Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 05/19] radixtree: sync with mainline Wu Fengguang
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: mm-debug-page-reclaim.patch --]
[-- Type: text/plain, Size: 8784 bytes --]

Show the detailed steps of early/direct/kswapd page reclaim.

To enable the printk traces:
# echo y > /debug/debug_page_reclaim

Sample lines:

reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-31, age 1248, hot+cold+free pages 92563+25703+860
reclaim zone2 from kswapd for watermark, prio 11, scan-reclaimed 33-32, age 1253, hot+cold+free pages 92547+25688+892
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1258, hot+cold+free pages 92547+25656+924
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-29, age 1262, hot+cold+free pages 92532+25639+924
reclaim zone2 from kswapd for watermark, prio 11, scan-reclaimed 33-28, age 1269, hot+cold+free pages 92531+25611+956
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-31, age 1274, hot+cold+free pages 92535+25579+988
reclaim zone2 from kswapd for watermark, prio 11, scan-reclaimed 33-29, age 1278, hot+cold+free pages 92518+25565+1020
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-29, age 1282, hot+cold+free pages 92507+25547+1052
reclaim zone2 from kswapd for aging, prio 11, scan-reclaimed 33-27, age 1287, hot+cold+free pages 92506+25519+1084
reclaim zone2 from direct for aging, prio 12, scan-reclaimed 33-29, age 1292, hot+cold+free pages 92505+25494+1116
reclaim zone2 from direct for aging, prio 11, scan-reclaimed 33-32, age 1297, hot+cold+free pages 92506+25464+1148
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1302, hot+cold+free pages 92506+25746+860
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1307, hot+cold+free pages 92488+25732+892
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1312, hot+cold+free pages 92488+25700+924
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1316, hot+cold+free pages 92481+25675+956
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1322, hot+cold+free pages 92481+25643+988
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1326, hot+cold+free pages 92474+25618+1020
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1331, hot+cold+free pages 92474+25586+1052
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 33-32, age 1336, hot+cold+free pages 92474+25554+1084
reclaim zone2 from direct for aging, prio 12, scan-reclaimed 33-32, age 1341, hot+cold+free pages 92474+25522+1116
reclaim zone2 from direct for aging, prio 12, scan-reclaimed 33-32, age 1347, hot+cold+free pages 92470+25744+892
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1352, hot+cold+free pages 92463+25712+924
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-25, age 1357, hot+cold+free pages 92462+25681+956

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


 mm/vmscan.c |   90 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 75 insertions(+), 15 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -38,6 +38,7 @@
 #include <asm/div64.h>
 
 #include <linux/swapops.h>
+#include <linux/debugfs.h>
 
 /* possible outcome of pageout() */
 typedef enum {
@@ -72,10 +73,7 @@ struct scan_control {
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
-	int may_writepage;
-
-	/* Can pages be swapped as part of reclaim? */
-	int may_swap;
+	unsigned long flags;
 
 	/* This context's SWAP_CLUSTER_MAX. If freeing memory for
 	 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
@@ -84,6 +82,59 @@ struct scan_control {
 	int swap_cluster_max;
 };
 
+#define SC_RECLAIM_FROM_KSWAPD		0x01
+#define SC_RECLAIM_FROM_DIRECT		0x02
+#define SC_RECLAIM_FROM_EARLY		0x04
+#define SC_RECLAIM_FOR_WATERMARK	0x10
+#define SC_RECLAIM_FOR_AGING		0x20
+#define SC_RECLAIM_MASK			0xFF
+
+#define SC_MAY_WRITEPAGE		0x100
+#define SC_MAY_SWAP			0x200	/* Can pages be swapped as part of reclaim? */
+
+#ifdef CONFIG_DEBUG_FS
+static u32 debug_page_reclaim;
+
+static inline void debug_reclaim(struct scan_control *sc, unsigned long flags)
+{
+	sc->flags = (sc->flags & ~SC_RECLAIM_MASK) | flags;
+}
+
+static inline void debug_reclaim_report(struct scan_control *sc, struct zone *z)
+{
+	if (!debug_page_reclaim)
+		return;
+
+	printk(KERN_DEBUG "reclaim zone%d from %s for %s, "
+			"prio %d, scan-reclaimed %lu-%lu, age %lu, "
+			"hot+cold+free pages %lu+%lu+%lu\n",
+			zone_idx(z),
+			(sc->flags & SC_RECLAIM_FROM_KSWAPD) ? "kswapd" :
+			((sc->flags & SC_RECLAIM_FROM_DIRECT) ? "direct" :
+								"early"),
+			(sc->flags & SC_RECLAIM_FOR_AGING) ?
+							"aging" : "watermark",
+			sc->priority, sc->nr_scanned, sc->nr_reclaimed,
+			z->page_age, z->nr_active, z->nr_inactive, z->free_pages);
+}
+
+static inline void debug_reclaim_init(void)
+{
+	debugfs_create_bool("debug_page_reclaim", 0644, NULL,
+							&debug_page_reclaim);
+}
+#else
+static inline void debug_reclaim(struct scan_control *sc, int flags)
+{
+}
+static inline void debug_reclaim_report(struct scan_control *sc, struct zone *z)
+{
+}
+static inline void debug_reclaim_init(void)
+{
+}
+#endif
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -499,7 +550,7 @@ static int shrink_list(struct list_head 
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!sc->may_swap)
+			if (!(sc->flags & SC_MAY_SWAP))
 				goto keep_locked;
 			if (!add_to_swap(page, GFP_ATOMIC))
 				goto activate_locked;
@@ -530,7 +581,7 @@ static int shrink_list(struct list_head 
 				goto keep_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
-			if (laptop_mode && !sc->may_writepage)
+			if (laptop_mode && !(sc->flags & SC_MAY_WRITEPAGE))
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
@@ -939,6 +990,7 @@ static void shrink_cache(struct zone *zo
 			mod_page_state(kswapd_steal, nr_freed);
 		mod_page_state_zone(zone, pgsteal, nr_freed);
 		sc->nr_to_reclaim -= nr_freed;
+		debug_reclaim_report(sc, zone);
 
 		spin_lock_irq(&zone->lru_lock);
 		/*
@@ -1216,11 +1268,14 @@ shrink_caches(struct zone **zones, struc
 			continue;
 		}
 
+		debug_reclaim(sc, SC_RECLAIM_FROM_DIRECT);
 		shrink_zone(zone, sc);
 	}
 
-	if (z)
+	if (z) {
+		debug_reclaim(sc, SC_RECLAIM_FROM_DIRECT|SC_RECLAIM_FOR_AGING);
 		shrink_zone(z, sc);
+	}
 }
  
 /*
@@ -1248,8 +1303,7 @@ int try_to_free_pages(struct zone **zone
 	delay_prefetch();
 
 	sc.gfp_mask = gfp_mask;
-	sc.may_writepage = 0;
-	sc.may_swap = 1;
+	sc.flags = SC_MAY_SWAP;
 
 	inc_page_state(allocstall);
 
@@ -1290,7 +1344,7 @@ int try_to_free_pages(struct zone **zone
 		 */
 		if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max/2) {
 			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
-			sc.may_writepage = 1;
+			sc.flags |= SC_MAY_WRITEPAGE;
 		}
 
 		/* Take a nap, wait for some writeback to complete */
@@ -1365,8 +1419,7 @@ loop_again:
 	total_scanned = 0;
 	total_reclaimed = 0;
 	sc.gfp_mask = GFP_KERNEL;
-	sc.may_writepage = 0;
-	sc.may_swap = 1;
+	sc.flags = SC_MAY_SWAP;
 	sc.nr_mapped = read_page_state(nr_mapped);
 
 	inc_page_state(pageoutrun);
@@ -1395,10 +1448,16 @@ loop_again:
 				if (!zone_watermark_ok(zone, order,
 							zone->pages_high,
 							0, 0)) {
+					debug_reclaim(&sc,
+							SC_RECLAIM_FROM_KSWAPD|
+							SC_RECLAIM_FOR_WATERMARK);
 					all_zones_ok = 0;
 				} else if (zone == youngest_zone &&
 						pages_more_aged(oldest_zone,
 								youngest_zone)) {
+					debug_reclaim(&sc,
+							SC_RECLAIM_FROM_KSWAPD|
+							SC_RECLAIM_FOR_AGING);
 					/* if (priority > DEF_PRIORITY - 2) */
 						/* all_zones_ok = 0; */
 				} else
@@ -1433,7 +1492,7 @@ loop_again:
 		 */
 		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
 		    total_scanned > total_reclaimed+total_reclaimed/2)
-			sc.may_writepage = 1;
+			sc.flags |= SC_MAY_WRITEPAGE;
 
 		if (nr_pages && to_free > total_reclaimed)
 			continue;	/* swsusp: need to do more work */
@@ -1623,6 +1682,7 @@ static int __init kswapd_init(void)
 		= find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
 	total_memory = nr_free_pagecache_pages();
 	hotcpu_notifier(cpu_callback, 0);
+	debug_reclaim_init();
 	return 0;
 }
 
@@ -1645,8 +1705,7 @@ int zone_reclaim(struct zone *zone, gfp_
 		return 0;
 
 	sc.gfp_mask = gfp_mask;
-	sc.may_writepage = 0;
-	sc.may_swap = 0;
+	sc.flags = 0;
 	sc.nr_mapped = read_page_state(nr_mapped);
 	sc.nr_scanned = 0;
 	sc.nr_reclaimed = 0;
@@ -1662,6 +1721,7 @@ int zone_reclaim(struct zone *zone, gfp_
 	if (atomic_read(&zone->reclaim_in_progress) > 0)
 		goto out;
 
+	debug_reclaim(&sc, SC_RECLAIM_FROM_EARLY);
 	shrink_zone(zone, &sc);
 	total_reclaimed = sc.nr_reclaimed;
 

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 05/19] radixtree: sync with mainline
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (3 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 04/19] mm: debug page reclaim Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 06/19] radixtree: look-aside cache Wu Fengguang
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Christoph Lameter, Wu Fengguang

[-- Attachment #1: radixtree-sync-with-mainline.patch --]
[-- Type: text/plain, Size: 1182 bytes --]

The patch from Christoph Lameter:

[PATCH] radix-tree: Remove unnecessary indirections and clean up code

is only partially merged into -mm tree. This patch completes it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 lib/radix-tree.c |   12 +++++-------
 1 files changed, 5 insertions(+), 7 deletions(-)

--- linux-2.6.14-mm1.orig/lib/radix-tree.c
+++ linux-2.6.14-mm1/lib/radix-tree.c
@@ -291,27 +291,25 @@ static inline void **__lookup_slot(struc
 				   unsigned long index)
 {
 	unsigned int height, shift;
-	struct radix_tree_node **slot;
+	struct radix_tree_node *slot;
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
+	slot = root->rnode;
 
 	while (height > 0) {
-		if (*slot == NULL)
+		if (slot == NULL)
 			return NULL;
 
-		slot = (struct radix_tree_node **)
-			((*slot)->slots +
-				((index >> shift) & RADIX_TREE_MAP_MASK));
+		slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
 
-	return (void **)slot;
+	return slot;
 }
 
 /**

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 06/19] radixtree: look-aside cache
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (4 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 05/19] radixtree: sync with mainline Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 07/19] readahead: some preparation Wu Fengguang
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Nick Piggin, Wu Fengguang

[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 11999 bytes --]

This introduces a set of lookup functions to radix tree for the read-ahead
logic.  Other access patterns with high locality may also benefit from them.

- radix_tree_lookup_node(root, index, level)
	Perform partial lookup, return the @level'th parent of the slot at
	@index.

- radix_tree_cache_xxx()
	Init/Query the cache.
- radix_tree_cache_lookup(root, cache, index)
	Perform lookup with the aid of a look-aside cache.
	For sequential scans, it has a time complexity of 64*O(1) + 1*O(logn).

	Typical usage:

   void func() {
  +       struct radix_tree_cache cache;
  +
  +       radix_tree_cache_init(&cache);
          read_lock_irq(&mapping->tree_lock);
          for(;;) {
  -               page = radix_tree_lookup(&mapping->page_tree, index);
  +               page = radix_tree_cache_lookup(&mapping->page_tree, &cache, index);
          }
          read_unlock_irq(&mapping->tree_lock);
   }                                                                                                                       	

- radix_tree_lookup_head(root, index, max_scan)
- radix_tree_lookup_tail(root, index, max_scan)
	Assume [head, tail) to be a segment with continuous pages. The two
	functions search for the head and tail index of the segment at @index.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/radix-tree.h |   80 +++++++++++++++++-
 lib/radix-tree.c           |  196 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 254 insertions(+), 22 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/radix-tree.h
+++ linux-2.6.15-rc2-mm1/include/linux/radix-tree.h
@@ -22,12 +22,24 @@
 #include <linux/preempt.h>
 #include <linux/types.h>
 
+#define RADIX_TREE_MAP_SHIFT	6
+#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
+
 struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	*rnode;
 };
 
+/*
+ * Support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+	unsigned long first_index;
+	struct radix_tree_node *tree_node;
+};
+
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
 	.gfp_mask = (mask),						\
@@ -45,9 +57,18 @@ do {									\
 } while (0)
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_node(struct radix_tree_root *, unsigned long,
+							unsigned int);
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+int radix_tree_cache_count(struct radix_tree_cache *cache);
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index, unsigned int level);
+unsigned long radix_tree_lookup_head(struct radix_tree_root *root,
+				unsigned long index, unsigned int max_scan);
+unsigned long radix_tree_lookup_tail(struct radix_tree_root *root,
+				unsigned long index, unsigned int max_scan);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
@@ -69,4 +90,59 @@ static inline void radix_tree_preload_en
 	preempt_enable();
 }
 
+/**
+ *	radix_tree_lookup    -    perform lookup operation on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+							unsigned long index)
+{
+	return radix_tree_lookup_node(root, index, 0);
+}
+
+/**
+ *	radix_tree_cache_init    -    init the cache
+ *	@cache:		look-aside cache
+ *
+ *	Init the @cache.
+ */
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+	cache->first_index = RADIX_TREE_MAP_MASK;
+	cache->tree_node = NULL;
+}
+
+/**
+ *	radix_tree_cache_lookup    -    cached lookup page
+ *	@root:		radix tree root
+ *	@cache:		look-aside cache
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index)
+{
+	return radix_tree_cache_lookup_node(root, cache, index, 0);
+}
+
+static inline int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+	return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+	return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline int radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+	return cache->first_index;
+}
+
 #endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.15-rc2-mm1.orig/lib/radix-tree.c
+++ linux-2.6.15-rc2-mm1/lib/radix-tree.c
@@ -32,16 +32,7 @@
 #include <linux/bitops.h>
 
 
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT	6
-#else
-#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
-#endif
 #define RADIX_TREE_TAGS		2
-
-#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
-
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
@@ -287,8 +278,21 @@ int radix_tree_insert(struct radix_tree_
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-static inline void **__lookup_slot(struct radix_tree_root *root,
-				   unsigned long index)
+/**
+ *	radix_tree_lookup_node    -    low level lookup routine
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@level:		stop at that many levels from bottom
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ *	The return value is:
+ *	@level == 0:      page at @index;
+ *	@level == 1:      the corresponding bottom level tree node;
+ *	@level < height:  (height - @level)th level tree node;
+ *	@level >= height: root node.
+ */
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+				unsigned long index, unsigned int level)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *slot;
@@ -300,7 +304,7 @@ static inline void **__lookup_slot(struc
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
 
-	while (height > 0) {
+	while (height > level) {
 		if (slot == NULL)
 			return NULL;
 
@@ -311,6 +315,49 @@ static inline void **__lookup_slot(struc
 
 	return slot;
 }
+EXPORT_SYMBOL(radix_tree_lookup_node);
+
+/**
+ *	radix_tree_cache_lookup_node    -    cached lookup node
+ *	@root:		radix tree root
+ *	@cache:		look-aside cache
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root,
+ *	and return the node @level levels from the bottom in the search path.
+ *	@cache stores the last accessed upper level tree node by this
+ *	function, and is always checked first before searching in the tree.
+ *	It can improve speed for access patterns with strong locality.
+ *	NOTE:
+ *	- The cache becomes invalid on leaving the lock;
+ *	- Do not intermix calls with different @level.
+ */
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index, unsigned int level)
+{
+	struct radix_tree_node *node;
+        unsigned long i;
+        unsigned long mask;
+
+        if (level >= root->height)
+                return root->rnode;
+
+        i = ((index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK);
+        mask = ~((RADIX_TREE_MAP_SIZE << (level * RADIX_TREE_MAP_SHIFT)) - 1);
+
+	if ((index & mask) == cache->first_index)
+                return cache->tree_node->slots[i];
+
+	node = radix_tree_lookup_node(root, index, level + 1);
+	if (!node)
+		return 0;
+
+	cache->tree_node = node;
+	cache->first_index = (index & mask);
+        return node->slots[i];
+}
+EXPORT_SYMBOL(radix_tree_cache_lookup_node);
 
 /**
  *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
@@ -322,25 +369,134 @@ static inline void **__lookup_slot(struc
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return __lookup_slot(root, index);
+	struct radix_tree_node *node;
+
+	node = radix_tree_lookup_node(root, index, 1);
+	return node->slots + (index & RADIX_TREE_MAP_MASK);
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
 /**
- *	radix_tree_lookup    -    perform lookup operation on a radix tree
+ *	radix_tree_cache_count    -    items in the cached node
+ *	@cache:      radix tree look-aside cache
+ *
+ *      Query the number of items contained in the cached node.
+ */
+int radix_tree_cache_count(struct radix_tree_cache *cache)
+{
+	if (!(cache->first_index & RADIX_TREE_MAP_MASK))
+		return cache->tree_node->count;
+	else
+		return 0;
+}
+EXPORT_SYMBOL(radix_tree_cache_count);
+
+/**
+ *	radix_tree_lookup_head    -    lookup the head index
  *	@root:		radix tree root
  *	@index:		index key
+ *	@max_scan:      max items to scan
  *
- *	Lookup the item at the position @index in the radix tree @root.
+ *      Lookup head index of the segment which contains @index. A segment is
+ *      a set of continuous pages in a file.
+ *      CASE                       RETURN VALUE
+ *      no page at @index          (not head) = @index + 1
+ *      found in the range         @index - @max_scan < (head index) <= @index
+ *      not found in range         (unfinished head) <= @index - @max_scan
  */
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned long radix_tree_lookup_head(struct radix_tree_root *root,
+				unsigned long index, unsigned int max_scan)
 {
-	void **slot;
+	struct radix_tree_cache cache;
+	struct radix_tree_node *node;
+	int i;
+	unsigned long origin;
 
-	slot = __lookup_slot(root, index);
-	return slot != NULL ? *slot : NULL;
+	origin = index;
+	if (unlikely(max_scan > index))
+		max_scan = index;
+        radix_tree_cache_init(&cache);
+
+next_node:
+	if (origin - index > max_scan)
+		goto out;
+
+	node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+	if (!node)
+		goto out;
+
+	if (node->count == RADIX_TREE_MAP_SIZE) {
+		if (index < RADIX_TREE_MAP_SIZE) {
+			index = -1;
+			goto out;
+		}
+		index = (index - RADIX_TREE_MAP_SIZE) | RADIX_TREE_MAP_MASK;
+		goto next_node;
+	}
+
+	for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+		if (!node->slots[i])
+			goto out;
+	}
+
+	goto next_node;
+
+out:
+	return index + 1;
+}
+EXPORT_SYMBOL(radix_tree_lookup_head);
+
+/**
+ *	radix_tree_lookup_tail    -    lookup the tail index
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      max items to scan
+ *
+ *      Lookup tail(pass the end) index of the segment which contains @index.
+ *      A segment is a set of continuous pages in a file.
+ *      CASE                       RETURN VALUE
+ *      found in the range         @index <= (tail index) < @index + @max_scan
+ *      not found in range         @index + @max_scan <= (non tail)
+ */
+unsigned long radix_tree_lookup_tail(struct radix_tree_root *root,
+				unsigned long index, unsigned int max_scan)
+{
+	struct radix_tree_cache cache;
+	struct radix_tree_node *node;
+	int i;
+	unsigned long origin;
+
+	origin = index;
+	if (unlikely(index + max_scan < index))
+		max_scan = LONG_MAX - index;
+        radix_tree_cache_init(&cache);
+
+next_node:
+	if (index - origin >= max_scan)
+		goto out;
+
+	node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+	if (!node)
+		goto out;
+
+	if (node->count == RADIX_TREE_MAP_SIZE) {
+		index = (index | RADIX_TREE_MAP_MASK) + 1;
+		if (unlikely(!index))
+			goto out;
+		goto next_node;
+	}
+
+	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++, index++) {
+		if (!node->slots[i])
+			goto out;
+	}
+
+	goto next_node;
+
+out:
+	return index;
 }
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_lookup_tail);
 
 /**
  *	radix_tree_tag_set - set a tag on a radix tree node

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 07/19] readahead: some preparation
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (5 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 06/19] radixtree: look-aside cache Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 08/19] readahead: call scheme Wu Fengguang
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-prepare.patch --]
[-- Type: text/plain, Size: 8900 bytes --]

Some random changes that do not fit in elsewhere.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |    7 +
 mm/readahead.c     |  193 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 191 insertions(+), 9 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -986,6 +986,13 @@ void handle_ra_miss(struct address_space
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 
+#ifdef CONFIG_DEBUG_FS
+extern u32 readahead_debug_level;
+#define READAHEAD_DEBUG_LEVEL(n)	(readahead_debug_level >= n)
+#else
+#define READAHEAD_DEBUG_LEVEL(n)	(0)
+#endif
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -15,13 +15,58 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 
+/* The default max/min read-ahead pages. */
+#define KB(size)	(((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
+#define MAX_RA_PAGES	KB(VM_MAX_READAHEAD)
+#define MIN_RA_PAGES	KB(VM_MIN_READAHEAD)
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
+
+#define dprintk(args...) \
+	do { if (READAHEAD_DEBUG_LEVEL(1)) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+	do { if (READAHEAD_DEBUG_LEVEL(2)) printk(KERN_DEBUG args); } while(0)
+
+/*
+ * Debug facilities.
+ */
+#ifdef CONFIG_DEBUG_FS
+#define DEBUG_READAHEAD
+#endif
+
+#ifdef DEBUG_READAHEAD
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/init.h>
+
+u32 readahead_debug_level = 0;
+
+static int __init readahead_init(void)
+{
+	struct dentry *root;
+
+	root = debugfs_create_dir("readahead", NULL);
+
+	debugfs_create_u32("debug_level", 0644, root, &readahead_debug_level);
+
+	return 0;
+}
+
+module_init(readahead_init)
+#else /* !DEBUG_READAHEAD */
+
+#endif /* DEBUG_READAHEAD */
+
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+	.ra_pages	= MAX_RA_PAGES,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
@@ -50,7 +95,7 @@ static inline unsigned long get_max_read
 
 static inline unsigned long get_min_readahead(struct file_ra_state *ra)
 {
-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	return MIN_RA_PAGES;
 }
 
 static inline void ra_off(struct file_ra_state *ra)
@@ -258,10 +303,11 @@ out:
  */
 static int
 __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read)
+			pgoff_t offset, unsigned long nr_to_read,
+			unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
-	struct page *page;
+	struct page *page = NULL;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
@@ -271,7 +317,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -284,8 +330,14 @@ __do_page_cache_readahead(struct address
 			break;
 
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
-		if (page)
+		if (page) {
+#ifdef READAHEAD_STREAMING
+			if (prefer_adaptive_readahead() &&
+				page_idx == nr_to_read - lookahead_size)
+				SetPageReadahead(page);
+#endif
 			continue;
+		}
 
 		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
@@ -294,6 +346,9 @@ __do_page_cache_readahead(struct address
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
+		if (prefer_adaptive_readahead() &&
+				page_idx == nr_to_read - lookahead_size)
+			SetPageReadahead(page);
 		ret++;
 	}
 	read_unlock_irq(&mapping->tree_lock);
@@ -330,7 +385,7 @@ int force_page_cache_readahead(struct ad
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk);
+						offset, this_chunk, 0);
 		if (err < 0) {
 			ret = err;
 			break;
@@ -377,7 +432,7 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
@@ -397,7 +452,10 @@ blockable_page_cache_readahead(struct ad
 	if (!block && bdi_read_congested(mapping->backing_dev_info))
 		return 0;
 
-	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+	dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+			mapping->host->i_ino, offset, nr_to_read, actual);
 
 	return check_ra_success(ra, nr_to_read, actual);
 }
@@ -575,3 +633,120 @@ unsigned long max_sane_readahead(unsigne
 	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
+
+/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressures.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ *   1. state based   - the default one
+ *   2. context based - the fail safe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ *                 a read-ahead chunk
+ *    +-----------------------------------------+
+ *    |       # PG_readahead                    |
+ *    +-----------------------------------------+
+ *            ^ When this page is read, notify me for the next read-ahead.
+ *
+ *
+ * Here are some variable names used frequently:
+ *
+ *                                   |<------- la_size ------>|
+ *                  +-----------------------------------------+
+ *                  |                #                        |
+ *                  +-----------------------------------------+
+ *      ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+/*
+ * The nature of read-ahead allows most tests to fail or even be wrong.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+							pgoff_t offset)
+{
+	return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+							pgoff_t offset)
+{
+	struct page *page;
+
+	read_lock_irq(&mapping->tree_lock);
+	page = __find_page(mapping, offset);
+	read_unlock_irq(&mapping->tree_lock);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	if (page)
+		BUG_ON(page->index != offset);
+#endif
+	return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static int rescue_pages(struct page *page, pgoff_t nr_pages)
+{
+	unsigned long pgrescue;
+	pgoff_t index;
+	struct address_space *mapping;
+	struct zone *zone;
+
+	BUG_ON(!nr_pages || !page);
+	pgrescue = 0;
+	index = page_index(page);
+	mapping = page_mapping(page);
+
+	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+			mapping->host->i_ino, index, nr_pages);
+
+	for(;;) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (!PageLRU(page))
+			goto out_unlock;
+
+		while (page_mapping(page) == mapping &&
+				page_index(page) == index) {
+			struct page *the_page = page;
+			page = next_page(page);
+			if (!PageActive(the_page) &&
+					!PageActivate(the_page) &&
+					!PageLocked(the_page) &&
+					page_count(the_page) == 1) {
+				list_move(&the_page->lru, &zone->inactive_list);
+				pgrescue++;
+			}
+			index++;
+			if (!--nr_pages)
+				goto out_unlock;
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+
+		page = find_page(mapping, index);
+		if (!page)
+			goto out;
+	}
+out_unlock:
+	spin_unlock_irq(&zone->lru_lock);
+out:
+	ra_account(0, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+
+	return nr_pages ? index : 0;
+}

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 08/19] readahead: call scheme
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (6 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 07/19] readahead: some preparation Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 09/19] readahead: parameters Wu Fengguang
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 13823 bytes --]

An new page flag PG_readahead is introduced as a look-ahead mark.
The look-ahead mark corresponds to `ahead_start' of the current logic.

The read-ahead logic is called when
        - read reaches a look-ahead mark;
        - read on a non-present page.

And ra_access() is called on every page reference to maintain the cache_hit
counter.

This scheme has the following benefits:
        - makes all stateful/stateless methods happy;
        - eliminates the cache hit problem naturally;
        - lives in harmony with application managed read-aheads via
          fadvise/madvise.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h         |   13 +++
 include/linux/page-flags.h |    5 +
 mm/filemap.c               |   74 ++++++++++++++++---
 mm/page_alloc.c            |    2 
 mm/readahead.c             |  170 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 252 insertions(+), 12 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.15-rc2-mm1/include/linux/page-flags.h
@@ -77,6 +77,7 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 #define PG_activate		20	/* delayed activate */
+#define PG_readahead		21	/* check readahead when reading this page */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -311,6 +312,10 @@ extern void __mod_page_state(unsigned lo
 #define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
 #define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -985,6 +985,13 @@ unsigned long page_cache_readahead(struc
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t first_index,
+			pgoff_t index, pgoff_t last_index);
+void fastcall ra_access(struct file_ra_state *ra, struct page *page);
 
 #ifdef CONFIG_DEBUG_FS
 extern u32 readahead_debug_level;
@@ -993,6 +1000,12 @@ extern u32 readahead_debug_level;
 #define READAHEAD_DEBUG_LEVEL(n)	(0)
 #endif
 
+extern int readahead_ratio;
+static inline int prefer_adaptive_readahead(void)
+{
+	return readahead_ratio > 9;
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -508,7 +508,7 @@ static int prep_new_page(struct page *pa
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_activate |
+			1 << PG_activate | 1 << PG_readahead |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);
 	set_page_refs(page, order);
--- linux-2.6.15-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.15-rc2-mm1/mm/filemap.c
@@ -765,10 +765,12 @@ void do_generic_mapping_read(struct addr
 	unsigned long prev_index;
 	loff_t isize;
 	struct page *cached_page;
+	struct page *prev_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
+	prev_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -779,6 +781,10 @@ void do_generic_mapping_read(struct addr
 	if (!isize)
 		goto out;
 
+	if (READAHEAD_DEBUG_LEVEL(5))
+		printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+			inode->i_ino, index, last_index - index);
+
 	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 	for (;;) {
 		struct page *page;
@@ -797,16 +803,36 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
+
+		if (!prefer_adaptive_readahead() && index == next_index)
 			next_index = page_cache_readahead(mapping, &ra, filp,
 					index, last_index - index);
 
 find_page:
 		page = find_get_page(mapping, index);
+		if (prefer_adaptive_readahead()) {
+			if (unlikely(page == NULL)) {
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, NULL,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+				page = find_get_page(mapping, index);
+			} else if (PageReadahead(page)) {
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, page,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+			}
+		}
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
+		ra_access(&ra, page);
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -841,7 +868,6 @@ page_ok:
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
 		if (ret == nr && desc->count)
 			continue;
 		goto out;
@@ -853,7 +879,6 @@ page_not_up_to_date:
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);
-			page_cache_release(page);
 			continue;
 		}
 
@@ -883,7 +908,6 @@ readpage:
 					 * invalidate_inode_pages got it
 					 */
 					unlock_page(page);
-					page_cache_release(page);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -904,7 +928,6 @@ readpage:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			page_cache_release(page);
 			goto out;
 		}
 
@@ -913,7 +936,6 @@ readpage:
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
 			if (nr <= offset) {
-				page_cache_release(page);
 				goto out;
 			}
 		}
@@ -923,7 +945,6 @@ readpage:
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
-		page_cache_release(page);
 		goto out;
 
 no_cached_page:
@@ -948,15 +969,22 @@ no_cached_page:
 		}
 		page = cached_page;
 		cached_page = NULL;
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
 		goto readpage;
 	}
 
 out:
 	*_ra = ra;
+	if (prefer_adaptive_readahead())
+		_ra->prev_page = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);
+	if (prev_page)
+		page_cache_release(prev_page);
 	if (filp)
 		file_accessed(filp);
 }
@@ -1237,6 +1265,12 @@ struct page *filemap_nopage(struct vm_ar
 
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
+	if (READAHEAD_DEBUG_LEVEL(5))
+		printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s)\n",
+			inode->i_ino, pgoff,
+			VM_RandomReadHint(area) ? "random" :
+			(VM_SequentialReadHint(area) ? "sequential" : "none"));
+
 retry_all:
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (pgoff >= size)
@@ -1252,19 +1286,33 @@ retry_all:
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
+	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
+
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
 	page = find_get_page(mapping, pgoff);
+	if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+		if (!page) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, NULL,
+						pgoff, pgoff, pgoff + 1);
+			page = find_get_page(mapping, pgoff);
+		} else if (PageReadahead(page)) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, page,
+						pgoff, pgoff, pgoff + 1);
+		}
+	}
 	if (!page) {
 		unsigned long ra_pages;
 
 		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, ra, pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1301,6 +1349,8 @@ retry_find:
 	if (!did_readaround)
 		ra->mmap_hit++;
 
+	ra_access(ra, page);
+
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.
@@ -1315,6 +1365,8 @@ success:
 	mark_page_accessed(page);
 	if (type)
 		*type = majmin;
+	if (prefer_adaptive_readahead())
+		ra->prev_page = page->index;
 	return page;
 
 outside_data_content:
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -20,6 +20,43 @@
 #define MAX_RA_PAGES	KB(VM_MAX_READAHEAD)
 #define MIN_RA_PAGES	KB(VM_MIN_READAHEAD)
 
+/* Detailed classification of read-ahead behaviors. */
+#define RA_CLASS_SHIFT 3
+#define RA_CLASS_MASK  ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+	RA_CLASS_ALL,
+	RA_CLASS_NEWFILE,
+	RA_CLASS_STATE,
+	RA_CLASS_CONTEXT,
+	RA_CLASS_CONTEXT_ACCELERATED,
+	RA_CLASS_BACKWARD,
+	RA_CLASS_RANDOM_THRASHING,
+	RA_CLASS_RANDOM_SEEK,
+	RA_CLASS_END,
+};
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+	RA_EVENT_CACHE_MISS,		/* read cache misses */
+	RA_EVENT_READRANDOM,		/* random reads */
+	RA_EVENT_IO_CONGESTION,		/* io congestion */
+	RA_EVENT_IO_CACHE_HIT,		/* canceled io due to cache hit */
+	RA_EVENT_IO_BLOCK,		/* read on locked page */
+
+	RA_EVENT_READAHEAD,		/* read-ahead issued */
+	RA_EVENT_READAHEAD_HIT,		/* read-ahead page hit */
+	RA_EVENT_LOOKAHEAD,		/* look-ahead issued */
+	RA_EVENT_LOOKAHEAD_HIT,		/* look-ahead mark hit */
+	RA_EVENT_LOOKAHEAD_NOACTION,	/* look-ahead mark ignored */
+	RA_EVENT_READAHEAD_EOF,		/* read-ahead reaches EOF */
+	RA_EVENT_READAHEAD_SHRINK,	/* ra_size decreased, reflects var. */
+	RA_EVENT_READAHEAD_THRASHING,	/* read-ahead thrashing happened */
+	RA_EVENT_READAHEAD_MUTILATE,	/* read-ahead request mutilated */
+	RA_EVENT_READAHEAD_RESCUE,	/* read-ahead rescued */
+
+	RA_EVENT_END
+};
+
 #define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
 #define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
 
@@ -750,3 +787,136 @@ out:
 
 	return nr_pages ? index : 0;
 }
+
+/*
+ * This is the entry point of the adaptive read-ahead logic.
+ *
+ * It is only called on two conditions:
+ * 1. page == NULL
+ *    A cache miss happened, it can be either a random read or a sequential one.
+ * 2. page != NULL
+ *    There is a look-ahead mark(PG_readahead) from a previous sequential read.
+ *    It's time to do some checking and submit the next read-ahead IO.
+ *
+ * That has the merits of:
+ * - makes all stateful/stateless methods happy;
+ * - eliminates the cache hit problem naturally;
+ * - lives in harmony with application managed read-aheads via fadvise/madvise.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t begin_index,
+			pgoff_t index, pgoff_t end_index)
+{
+	unsigned long size;
+	unsigned long ra_min;
+	unsigned long ra_max;
+	int ret;
+
+	if (page) {
+		if(!TestClearPageReadahead(page))
+			return 0;
+		if (bdi_read_congested(mapping->backing_dev_info))
+			return 0;
+	}
+
+	if (page)
+		ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+				ra->readahead_index - ra->lookahead_index);
+	else if (index)
+		ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+	size = end_index - index;
+	get_readahead_bounds(ra, &ra_min, &ra_max);
+
+	/* readahead disabled? */
+	if (unlikely(!ra_min || !readahead_ratio)) {
+		size = max_sane_readahead(size);
+		goto readit;
+	}
+
+	/*
+	 * Start of file.
+	 */
+	if (index == 0)
+		return newfile_readahead(mapping, filp, ra, end_index, ra_min);
+
+	/*
+	 * State based sequential read-ahead.
+	 */
+	if (!disable_stateful_method() &&
+		index == ra->lookahead_index &&
+		(page || index == ra->readahead_index) &&
+		(ra_cache_hit_ok(ra) ||
+		 end_index - begin_index >= ra_max))
+		return state_based_readahead(mapping, filp, ra, page, ra_max);
+
+	/*
+	 * Backward read-ahead.
+	 */
+	if (try_read_backward(ra, begin_index, end_index, size, ra_min, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Context based sequential read-ahead.
+	 */
+	ret = try_context_based_readahead(mapping, ra, prev_page, page,
+						index, size, ra_min, ra_max);
+	if (ret > 0)
+		return ra_dispatch(ra, mapping, filp);
+	if (ret < 0)
+		return 0;
+
+	/* No action on look ahead time? */
+	if (page) {
+		ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+					ra->readahead_index - index);
+		return 0;
+	}
+
+	/*
+	 * Random read that follows a sequential one.
+	 */
+	if (try_random_readahead(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Random read.
+	 */
+	if (size > ra_max)
+		size = ra_max;
+
+readit:
+	size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+	ra_account(ra, RA_EVENT_READRANDOM, size);
+	dprintk("readrandom(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+			mapping->host->i_ino, mapping->nrpages,
+			begin_index, index, end_index, size);
+
+	return size;
+}
+
+/*
+ * Call me!
+ */
+void fastcall ra_access(struct file_ra_state *ra, struct page *page)
+{
+	if (page->flags & ((1 << PG_active)   |
+			   (1 << PG_activate) |
+			   (1 << PG_referenced)))
+		return;
+
+	if (!ra_has_index(ra, page->index))
+		return;
+
+	ra->cache_hit++;
+
+	if (page->index >= ra->ra_index)
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+	else
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 09/19] readahead: parameters
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (7 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 08/19] readahead: call scheme Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-parameters.patch --]
[-- Type: text/plain, Size: 8583 bytes --]

- new entry in /proc/sys/vm/readahead_ratio with default value of 50;
- new entry in /proc/sys/vm/readahead_hit_rate with default value of 2;
- new entry in /proc/sys/vm/readahead_live_chunk with default value of 2M;
- limit mmap read-around size to 256kb;
- dynamic minimal/initial read-ahead size.

readahead_ratio is divided into several ranges:

	condition			action
===================================================================
readahead_ratio == 0		disable read-ahead
readahead_ratio < 9		select old read-ahead logic
readahead_ratio >= 9		select new read-ahead logic
readahead_ratio >= 80		enable live pages protection

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 Documentation/sysctl/vm.txt |   51 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h          |    5 +++-
 include/linux/sysctl.h      |    3 ++
 kernel/sysctl.c             |   34 +++++++++++++++++++++++++++++
 mm/filemap.c                |    7 ++++++
 mm/readahead.c              |   39 +++++++++++++++++++++++++++++++++
 6 files changed, 138 insertions(+), 1 deletion(-)

--- linux-2.6.15-rc2-mm1.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.15-rc2-mm1/Documentation/sysctl/vm.txt
@@ -27,6 +27,9 @@ Currently, these files are in /proc/sys/
 - laptop_mode
 - block_dump
 - swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
+- readahead_live_chunk
 
 ==============================================================
 
@@ -114,3 +117,51 @@ except when laptop_mode is enabled and t
 Setting it to 0 disables prefetching entirely.
 
 The default value is dependant on ramsize.
+
+==============================================================
+
+readahead_ratio
+
+This limits read-ahead size to percent of the thrashing-threshold.
+The thrashing-threshold is dynamicly estimated according to the
+_history_ read speed and system load, and used to limit the
+_future_ read-ahead request size.
+
+Set it to a low value if you have not enough memory to counteract
+the I/O load fluctuations. But if there's plenty of memory, set it
+to a larger value might help increase read speed. Also note that a
+value >= 80 activates mandatory thrashing protection(see
+readahead_live_chunk).
+
+The default value is 50.
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (read-ahead-pages : accessed-pages).
+If the previous read-ahead request has bad hit rate, kernel will be
+very conservative to issue the next read-ahead.
+
+A large value helps speedup some sparse access patterns, at the cost
+of more memory consumption. It is recommended to keep the value below
+(max-readahead-pages / 8).
+
+The default value is 2.
+
+==============================================================
+
+readahead_live_chunk
+
+In a file server, there are typically one or more sequential
+readers working on a file. The kernel can detect most live
+chunks(a sequence of pages to be accessed by an active reader),
+and save them for their imminent readers. This is called
+mandatory thrashing protection, and is only in effect when
+(readahead_ratio >= 80).
+
+This parameter controls the max allowed chunk size, i.e. the max
+number of pages pinned for an active reader.
+
+The default value is 2MB size of pages. That is 512 on most archs.
+Increase it if you have enough memory.
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -968,11 +968,14 @@ extern int filemap_populate(struct vm_ar
 int write_one_page(struct page *page, int wait);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
+#define VM_MAX_READAHEAD	1024	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 #define VM_MAX_CACHE_HIT    	256	/* max pages in a row in cache before
 					 * turning readahead off */
 
+/* turn on read-ahead thrashing protection if (readahead_ratio >= ##) */
+#define VM_READAHEAD_PROTECT_RATIO	80
+
 int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
--- linux-2.6.15-rc2-mm1.orig/include/linux/sysctl.h
+++ linux-2.6.15-rc2-mm1/include/linux/sysctl.h
@@ -182,6 +182,9 @@ enum
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
 	VM_SWAP_PREFETCH=29,	/* int: amount to swap prefetch */
+	VM_READAHEAD_RATIO=30, /* percent of read-ahead size to thrashing-threshold */
+	VM_READAHEAD_HIT_RATE=31, /* one accessed page legitimizes so many read-ahead pages */
+	VM_READAHEAD_LIVE_CHUNK=32, /* pin no more than that many pages for a live reader */
 };
 
 
--- linux-2.6.15-rc2-mm1.orig/kernel/sysctl.c
+++ linux-2.6.15-rc2-mm1/kernel/sysctl.c
@@ -68,6 +68,9 @@ extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
 extern int pid_max_min, pid_max_max;
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+extern int readahead_live_chunk;
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
@@ -668,6 +671,7 @@ static ctl_table kern_table[] = {
 /* Constants for minimum and maximum testing in vm_table.
    We use these as one-element integer vectors. */
 static int zero;
+static int one = 1;
 static int one_hundred = 100;
 
 
@@ -867,6 +871,36 @@ static ctl_table vm_table[] = {
 	},
 #endif
 #endif
+	{
+		.ctl_name	= VM_READAHEAD_RATIO,
+		.procname	= "readahead_ratio",
+		.data		= &readahead_ratio,
+		.maxlen		= sizeof(readahead_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+	{
+		.ctl_name	= VM_READAHEAD_HIT_RATE,
+		.procname	= "readahead_hit_rate",
+		.data		= &readahead_hit_rate,
+		.maxlen		= sizeof(readahead_hit_rate),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one,
+	},
+	{
+		.ctl_name	= VM_READAHEAD_LIVE_CHUNK,
+		.procname	= "readahead_live_chunk",
+		.data		= &readahead_live_chunk,
+		.maxlen		= sizeof(readahead_live_chunk),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
--- linux-2.6.15-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.15-rc2-mm1/mm/filemap.c
@@ -1336,6 +1336,13 @@ retry_find:
 		if (ra_pages) {
 			pgoff_t start = 0;
 
+			/*
+			 * Max read-around should be much smaller than
+			 * max read-ahead.
+			 * How about adding a tunable parameter for this?
+			 */
+			if (ra_pages > 64)
+				ra_pages = 64;
 			if (pgoff > ra_pages / 2)
 				start = pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -20,6 +20,24 @@
 #define MAX_RA_PAGES	KB(VM_MAX_READAHEAD)
 #define MIN_RA_PAGES	KB(VM_MIN_READAHEAD)
 
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 2;
+EXPORT_SYMBOL(readahead_hit_rate);
+
+/* Scan backward ## pages to find a live reader. */
+int readahead_live_chunk = 2 * MAX_RA_PAGES;
+EXPORT_SYMBOL(readahead_live_chunk);
+
 /* Detailed classification of read-ahead behaviors. */
 #define RA_CLASS_SHIFT 3
 #define RA_CLASS_MASK  ((1 << RA_CLASS_SHIFT) - 1)
@@ -789,6 +807,27 @@ out:
 }
 
 /*
+ * ra_size is mainly determined by:
+ * 1. sequential-start: min(MIN_RA_PAGES + (pages>>14), KB(128))
+ * 2. sequential-max:	min(ra->ra_pages, 0xFFFF)
+ * 3. sequential:	(thrashing-threshold) * readahead_ratio / 100
+ *
+ * Table of concrete numbers for 4KB page size:
+ *  (inactive + free) (in MB):    4   8   16   32   64  128  256  512 1024
+ *    initial ra_size (in KB):   16  16   16   16   20   24   32   48   64
+ */
+static inline void get_readahead_bounds(struct file_ra_state *ra,
+					unsigned long *ra_min,
+					unsigned long *ra_max)
+{
+	unsigned long pages;
+
+	pages = node_free_and_cold_pages();
+	*ra_max = min(min(pages/2, 0xFFFFUL), ra->ra_pages);
+	*ra_min = min(min(MIN_RA_PAGES + (pages>>14), KB(128)), *ra_max/2);
+}
+
+/*
  * This is the entry point of the adaptive read-ahead logic.
  *
  * It is only called on two conditions:

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 10/19] readahead: state based method
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (8 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 09/19] readahead: parameters Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:21   ` Eric Dumazet
  2005-11-25 15:12 ` [PATCH 11/19] readahead: context " Wu Fengguang
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-method-stateful.patch --]
[-- Type: text/plain, Size: 14166 bytes --]

This is the fast code path.

Major steps:
        - estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/fs.h |    8 +
 include/linux/mm.h |    6 
 mm/memory.c        |    1 
 mm/readahead.c     |  355 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap.c          |    2 
 mm/vmscan.c        |    3 
 6 files changed, 374 insertions(+), 1 deletion(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
@@ -604,13 +604,19 @@ struct file_ra_state {
 	unsigned long start;		/* Current window */
 	unsigned long size;
 	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
-	unsigned long cache_hit;	/* cache hit count*/
+	uint64_t      cache_hit;	/* cache hit count*/
 	unsigned long prev_page;	/* Cache last read() position */
 	unsigned long ahead_start;	/* Ahead window */
 	unsigned long ahead_size;
 	unsigned long ra_pages;		/* Maximum readahead window */
 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
+
+	unsigned long age;
+	pgoff_t la_index;
+	pgoff_t ra_index;
+	pgoff_t lookahead_index;
+	pgoff_t readahead_index;
 };
 #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
 #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -1009,6 +1009,12 @@ static inline int prefer_adaptive_readah
 	return readahead_ratio > 9;
 }
 
+DECLARE_PER_CPU(unsigned long, readahead_aging);
+static inline void inc_readahead_aging(void)
+{
+	__get_cpu_var(readahead_aging)++;
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.15-rc2-mm1.orig/mm/memory.c
+++ linux-2.6.15-rc2-mm1/mm/memory.c
@@ -1894,6 +1894,7 @@ static int do_anonymous_page(struct mm_s
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto release;
+		inc_readahead_aging();
 		inc_mm_counter(mm, anon_rss);
 		lru_cache_add_active(page);
 		SetPageReferenced(page);
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -539,6 +539,9 @@ static int shrink_list(struct list_head 
 			goto activate_locked;
 		}
 
+		if (!PageReferenced(page))
+			inc_readahead_aging();
+
 		referenced = page_referenced(page, 1, sc->priority <= 0);
 		/* In active use or really unfreeable?  Activate it. */
 		if (referenced && page_mapping_inuse(page))
--- linux-2.6.15-rc2-mm1.orig/mm/swap.c
+++ linux-2.6.15-rc2-mm1/mm/swap.c
@@ -124,6 +124,8 @@ void fastcall mark_page_accessed(struct 
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
+		if (PageLRU(page) && !PageActivate(page))
+			inc_readahead_aging();
 	}
 }
 
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -38,6 +38,12 @@ EXPORT_SYMBOL(readahead_hit_rate);
 int readahead_live_chunk = 2 * MAX_RA_PAGES;
 EXPORT_SYMBOL(readahead_live_chunk);
 
+/* Analog to zone->aging_total.
+ * But mainly increased on fresh page references, so is much more smoother.
+ */
+DEFINE_PER_CPU(unsigned long, readahead_aging);
+EXPORT_PER_CPU_SYMBOL(readahead_aging);
+
 /* Detailed classification of read-ahead behaviors. */
 #define RA_CLASS_SHIFT 3
 #define RA_CLASS_MASK  ((1 << RA_CLASS_SHIFT) - 1)
@@ -97,6 +103,7 @@ enum ra_event {
 #include <linux/init.h>
 
 u32 readahead_debug_level = 0;
+u32 debug_disable_stateful_method = 0;
 
 static int __init readahead_init(void)
 {
@@ -105,13 +112,26 @@ static int __init readahead_init(void)
 	root = debugfs_create_dir("readahead", NULL);
 
 	debugfs_create_u32("debug_level", 0644, root, &readahead_debug_level);
+	debugfs_create_bool("disable_stateful_method", 0644, root,
+						&debug_disable_stateful_method);
 
 	return 0;
 }
 
 module_init(readahead_init)
+
+static inline int disable_stateful_method(void)
+{
+	return debug_disable_stateful_method;
+}
+
 #else /* !DEBUG_READAHEAD */
 
+static inline int disable_stateful_method(void)
+{
+	return 0;
+}
+
 #endif /* DEBUG_READAHEAD */
 
 
@@ -807,6 +827,341 @@ out:
 }
 
 /*
+ * State based calculation of read-ahead request.
+ *
+ * This figure shows the meaning of file_ra_state members:
+ *
+ *             chunk A                            chunk B
+ *  +---------------------------+-------------------------------------------+
+ *  |             #             |                   #                       |
+ *  +---------------------------+-------------------------------------------+
+ *                ^             ^                   ^                       ^
+ *              la_index      ra_index     lookahead_index         readahead_index
+ */
+
+/*
+ * The global effective length of the inactive_list(s).
+ */
+static unsigned long node_free_and_cold_pages(void)
+{
+	unsigned int i;
+	unsigned long sum = 0;
+	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		sum += zones[i].nr_inactive +
+			zones[i].free_pages - zones[i].pages_low;
+
+	return sum;
+}
+
+/*
+ * A more smooth and timely analog to zone->aging_total.
+ */
+static unsigned long node_readahead_aging(void)
+{
+	unsigned long cpu;
+	unsigned long sum = 0;
+	cpumask_t mask = node_to_cpumask(numa_node_id());
+
+	for_each_cpu_mask(cpu, mask)
+		sum += per_cpu(readahead_aging, cpu);
+
+	return sum;
+}
+
+/*
+ * Set class of read-ahead
+ */
+static inline void set_ra_class(struct file_ra_state *ra,
+				enum ra_class ra_class)
+{
+	unsigned long FLAGS_MASK;
+	unsigned long flags;
+	unsigned long old_ra_class;
+
+	FLAGS_MASK = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+	flags = ra->flags & FLAGS_MASK;
+
+	old_ra_class = (ra->flags & RA_CLASS_MASK) << RA_CLASS_SHIFT;
+
+	ra->flags = flags | old_ra_class | ra_class;
+}
+
+/*
+ * The 64bit cache_hit stores three accumulated value and one counter value.
+ * MSB                                                                   LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static inline int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+	return (ra->cache_hit >> (nr * 16)) & 0xFFFF;
+}
+
+/*
+ * Conceptual code:
+ * ra_cache_hit(ra, 1) += ra_cache_hit(ra, 0);
+ * ra_cache_hit(ra, 0) = 0;
+ */
+static inline void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+	int n;
+
+	n = ra_cache_hit(ra, 0);
+	ra->cache_hit -= n;
+	n <<= 16;
+	ra->cache_hit += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static inline int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+	return ra_cache_hit(ra, 0) * readahead_hit_rate >=
+					(ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+	if (index < ra->la_index || index >= ra->readahead_index)
+		return 0;
+
+	if (index >= ra->ra_index)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Prepare file_ra_state for a new read-ahead sequence.
+ */
+static inline void ra_state_init(struct file_ra_state *ra,
+				pgoff_t la_index, pgoff_t ra_index)
+{
+	ra_addup_cache_hit(ra);
+	ra->cache_hit <<= 16;
+	ra->lookahead_index = la_index;
+	ra->readahead_index = ra_index;
+}
+
+/*
+ * Take down a new read-ahead request in file_ra_state.
+ */
+static inline void ra_state_update(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+#ifdef DEBUG_READAHEAD
+	unsigned long old_ra = ra->readahead_index - ra->ra_index;
+	if (ra_size < old_ra && ra_cache_hit(ra, 0))
+		ra_account(ra, RA_EVENT_READAHEAD_SHRINK, old_ra - ra_size);
+#endif
+	ra_addup_cache_hit(ra);
+	ra->ra_index = ra->readahead_index;
+	ra->la_index = ra->lookahead_index;
+	ra->readahead_index += ra_size;
+	ra->lookahead_index = ra->readahead_index - la_size;
+	ra->age = node_readahead_aging();
+}
+
+/*
+ * Adjust the read-ahead request in file_ra_state.
+ */
+static inline void ra_state_adjust(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+	ra->readahead_index = ra->ra_index + ra_size;
+	ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+			struct address_space *mapping, struct file *filp)
+{
+	pgoff_t eof_index;
+	unsigned long ra_size;
+	unsigned long la_size;
+	int actual;
+	enum ra_class ra_class;
+
+	ra_class = (ra->flags & RA_CLASS_MASK);
+	BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
+
+	eof_index = ((i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT) + 1;
+	ra_size = ra->readahead_index - ra->ra_index;
+	la_size = ra->readahead_index - ra->lookahead_index;
+
+	/* Snap to EOF. */
+	if (unlikely(ra->ra_index >= eof_index))
+		return 0;
+	if (ra->readahead_index + ra_size / 2 > eof_index) {
+		if (ra_class == RA_CLASS_CONTEXT_ACCELERATED &&
+				eof_index > ra->lookahead_index + 1)
+			la_size = eof_index - ra->lookahead_index;
+		else
+			la_size = 0;
+		ra_size = eof_index - ra->ra_index;
+		ra_state_adjust(ra, ra_size, la_size);
+	}
+
+	actual = __do_page_cache_readahead(mapping, filp,
+					ra->ra_index, ra_size, la_size);
+
+#ifdef READAHEAD_STREAMING
+	if (actual < ra_size) {
+		struct page *page = find_page(mapping, ra->ra_index + actual);
+		if (page)
+			rescue_pages(page, ra_size);
+	}
+#endif
+
+	if (ra->readahead_index == eof_index)
+		ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+	if (la_size)
+		ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+	ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+			ra_class_name[ra_class],
+			mapping->host->i_ino, ra->la_index,
+			ra->ra_index, ra_size, la_size, actual);
+
+	return actual;
+}
+
+/*
+ * Determine the request parameters from primitive values.
+ *
+ * It applies the following rules:
+ *   - Substract ra_size by the old look-ahead to get real safe read-ahead;
+ *   - Set new la_size according to the (still large) ra_size;
+ *   - Apply upper limits;
+ *   - Make sure stream_shift is not too small.
+ *     (So that the next global_shift will not be too small.)
+ *
+ * Input:
+ * ra_size stores the estimated thrashing-threshold.
+ * la_size stores the look-ahead size of previous request.
+ */
+static inline int adjust_rala(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	unsigned long stream_shift = *la_size;
+
+	if (*ra_size > *la_size)
+		*ra_size -= *la_size;
+	else
+		return 0;
+
+	*la_size = *ra_size / LOOKAHEAD_RATIO;
+
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	stream_shift += (*ra_size - *la_size);
+	if (stream_shift < *ra_size / 4)
+		*la_size -= (*ra_size / 4 - stream_shift);
+
+	return 1;
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ *    It is returned to make the next read-ahead request.
+ * 2. the remained space for the current chunk
+ *    It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will vibrate
+ * more with light load(small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * This figure illustrates the formula used in the function:
+ * While the stream reads stream_shift pages inside the chunks,
+ * the chunks are shifted global_shift pages inside inactive_list.
+ *
+ *      chunk A                    chunk B
+ *                          |<=============== global_shift ================|
+ *  +-------------+         +-------------------+                          |
+ *  |       #     |         |           #       |            inactive_list |
+ *  +-------------+         +-------------------+                     head |
+ *          |---->|         |---------->|
+ *             |                  |
+ *             +-- stream_shift --+
+ */
+static inline unsigned long compute_thrashing_threshold(
+						struct file_ra_state *ra,
+						unsigned long *remain)
+{
+	unsigned long global_size;
+	unsigned long global_shift;
+	unsigned long stream_shift;
+	unsigned long ra_size;
+
+	global_size = node_free_and_cold_pages();
+	global_shift = node_readahead_aging() - ra->age + 1;
+	stream_shift = ra_cache_hit(ra, 0);
+
+	ra_size = stream_shift *
+			global_size * readahead_ratio / (100 * global_shift);
+
+	if (global_size > global_shift)
+		*remain = stream_shift *
+				(global_size - global_shift) / global_shift;
+	else
+		*remain = 0;
+
+	ddprintk("compute_thrashing_threshold: "
+			"ra=%lu=%lu*%lu/%lu, remain %lu for %lu\n",
+			ra_size, stream_shift, global_size, global_shift,
+			*remain, ra->readahead_index - ra->lookahead_index);
+
+	return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static inline unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+			struct file_ra_state *ra, struct page *page,
+			unsigned long ra_max)
+{
+	unsigned long ra_old;
+	unsigned long ra_size;
+	unsigned long la_size;
+	unsigned long remain_space;
+	unsigned long growth_limit;
+
+	la_size = ra->readahead_index - ra->lookahead_index;
+	ra_old = ra->readahead_index - ra->ra_index;
+	ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+	if (readahead_ratio < VM_READAHEAD_PROTECT_RATIO &&
+			remain_space <= la_size && la_size > 1) {
+		rescue_pages(page, la_size);
+		return 0;
+	}
+
+	growth_limit = 2 * ra_old + (ra_max - ra_old) / 16;
+	if (!adjust_rala(min(ra_max, growth_limit), &ra_size, &la_size))
+		return 0;
+
+	set_ra_class(ra, RA_CLASS_STATE);
+	ra_state_update(ra, ra_size, la_size);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+
+/*
  * ra_size is mainly determined by:
  * 1. sequential-start: min(MIN_RA_PAGES + (pages>>14), KB(128))
  * 2. sequential-max:	min(ra->ra_pages, 0xFFFF)

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 11/19] readahead: context based method
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (9 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 12/19] readahead: other methods Wu Fengguang
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 10430 bytes --]

This is the slow code path.

No valid state info is available, so the page cache is queried to abtain the
required position/timing infomation.

Major steps:
        - look back/forward to find the ra_index;
        - look back to estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  345 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 345 insertions(+)

--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -1160,6 +1160,351 @@ state_based_readahead(struct address_spa
 	return ra_dispatch(ra, mapping, filp);
 }
 
+/*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ *   chunk A           chunk B                      chunk C                 head
+ *
+ *   l01 l11           l12   l21                    l22
+ *| |-->|-->|       |------>|-->|                |------>|
+ *| +-------+       +-----------+                +-------------+               |
+ *| |   #   |       |       #   |                |       #     |               |
+ *| +-------+       +-----------+                +-------------+               |
+ *| |<==============|<===========================|<============================|
+ *        L0                     L1                            L2
+ *
+ * Let f(l) = L be a map from
+ * 	l: the number of pages read by the stream
+ * to
+ * 	L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * 	f(l01) <= L0
+ * 	f(l11 + l12) = L1
+ * 	f(l21 + l22) = L2
+ * 	...
+ * 	f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ *			   <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+/*
+ * STATUS   REFERENCE COUNT      TYPE
+ *  A__                   0      not in inactive list
+ *  ___                   0      fresh
+ *  __R       PAGE_REFCNT_1      stale
+ *  _a_       PAGE_REFCNT_2      disturbed once
+ *  _aR       PAGE_REFCNT_3      disturbed twice
+ *
+ *  A/a/R: Active / aCTIVATE / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+	if (!page || PageActive(page))
+		return 0;
+
+	return page_refcnt(page);
+}
+
+static inline char page_refcnt_symbol(struct page *page)
+{
+	if (!page)
+		return 'X';
+	if (PageActive(page))
+		return 'A';
+	switch (page_refcnt(page)) {
+		case 0:
+			return '_';
+		case PAGE_REFCNT_1:
+			return '-';
+		case PAGE_REFCNT_2:
+			return '=';
+		case PAGE_REFCNT_3:
+			return '#';
+	}
+	return '?';
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+				pgoff_t first_index, pgoff_t last_index)
+{
+	struct page *page;
+	int size = last_index - first_index + 1;
+	int count = 0;
+	int i;
+
+	read_lock_irq(&mapping->tree_lock);
+
+	/*
+	 * The first page may well is chunk head and has been accessed,
+	 * so it is index 0 that makes the estimation optimistic. This
+	 * behavior guarantees a readahead when (size < ra_max) and
+	 * (readahead_hit_rate >= 16).
+	 */
+	for (i = 0; i < 16;) {
+		page = __find_page(mapping, first_index +
+						size * ((i++ * 29) & 15) / 16);
+		if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+			break;
+	}
+
+	read_unlock_irq(&mapping->tree_lock);
+
+	return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static int query_page_cache(struct address_space *mapping,
+			struct file_ra_state *ra,
+			unsigned long *remain, pgoff_t offset,
+			unsigned long ra_min, unsigned long ra_max)
+{
+	int count;
+	pgoff_t index;
+	unsigned long nr_lookback;
+	struct radix_tree_cache cache;
+
+	/*
+	 * Scan backward and check the near @ra_max pages.
+	 * The count here determines ra_size.
+	 */
+	read_lock_irq(&mapping->tree_lock);
+	index = radix_tree_lookup_head(&mapping->page_tree, offset, ra_max);
+	read_unlock_irq(&mapping->tree_lock);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	if (index <= offset) {
+		WARN_ON(!find_page(mapping, index));
+		if (index + ra_max > offset)
+			WARN_ON(find_page(mapping, index - 1));
+	} else {
+		BUG_ON(index > offset + 1);
+		WARN_ON(find_page(mapping, offset));
+	}
+#endif
+
+	*remain = offset - index + 1;
+
+	if (unlikely(*remain <= ra_min))
+		return ra_min;
+
+	if (!index)
+		return *remain;
+
+	if (offset + 1 == ra->readahead_index && ra_cache_hit_ok(ra))
+		count = *remain;
+	else if (count_cache_hit(mapping, index, offset) *
+						readahead_hit_rate >= *remain)
+		count = *remain;
+	else
+		return ra_min;
+
+	if (count < ra_max)
+		goto out;
+
+	/*
+	 * Check the far pages coarsely.
+	 * The big count here helps increase la_size.
+	 */
+	nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+						100 / (readahead_ratio + 1);
+	if (nr_lookback > offset)
+		nr_lookback = offset;
+
+	radix_tree_cache_init(&cache);
+	read_lock_irq(&mapping->tree_lock);
+	for (count += ra_max; count < nr_lookback; count += ra_max) {
+		struct radix_tree_node *node;
+		node = radix_tree_cache_lookup_node(&mapping->page_tree,
+						&cache, offset - count, 1);
+		if (!node)
+			break;
+#ifdef DEBUG_READAHEAD_RADIXTREE
+		if (node != radix_tree_lookup_node(&mapping->page_tree,
+							offset - count, 1)) {
+			read_unlock_irq(&mapping->tree_lock);
+			printk(KERN_ERR "check radix_tree_cache_lookup_node!\n");
+			return 1;
+		}
+#endif
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 *  For sequential read that extends from index 0, the counted value
+	 *  may well be far under the true threshold, so return it unmodified
+	 *  for further process in adjust_rala_accelerated().
+	 */
+	if (count >= offset)
+		return offset + 1;
+
+out:
+	count = count * readahead_ratio / 100;
+	return count;
+}
+
+/*
+ * Scan backward in the file for the first non-present page.
+ */
+static inline pgoff_t first_absent_page_bw(struct address_space *mapping,
+					pgoff_t index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct page *page;
+	pgoff_t origin;
+
+	origin = index;
+	if (max_scan > index)
+		max_scan = index;
+
+	radix_tree_cache_init(&cache);
+	read_lock_irq(&mapping->tree_lock);
+	for (; origin - index <= max_scan;) {
+		page = radix_tree_cache_lookup(&mapping->page_tree,
+							&cache, --index);
+		if (page) {
+			index++;
+			break;
+		}
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+	return index;
+}
+
+/*
+ * Scan forward in the file for the first non-present page.
+ */
+static inline pgoff_t first_absent_page(struct address_space *mapping,
+					pgoff_t index, unsigned long max_scan)
+{
+	pgoff_t ra_index;
+
+	read_lock_irq(&mapping->tree_lock);
+	ra_index = radix_tree_lookup_tail(&mapping->page_tree,
+					index + 1, max_scan);
+	read_unlock_irq(&mapping->tree_lock);
+
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	BUG_ON(ra_index <= index);
+	if (index + max_scan > index) {
+		if (ra_index <= index + max_scan)
+			WARN_ON(find_page(mapping, ra_index));
+		WARN_ON(!find_page(mapping, ra_index - 1));
+	}
+#endif
+
+	if (ra_index <= index + max_scan)
+		return ra_index;
+	else
+		return 0;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not choosed to make the whole next chunk safe(as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static inline int adjust_rala_accelerated(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	pgoff_t index = *ra_size;
+
+	*ra_size -= min(*ra_size, *la_size);
+	*ra_size = *ra_size * readahead_ratio / 100;
+	*la_size = index * readahead_ratio / 100;
+	*ra_size += *la_size;
+
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ */
+static inline int
+try_context_based_readahead(struct address_space *mapping,
+			struct file_ra_state *ra,
+			struct page *prev_page, struct page *page,
+			pgoff_t index, unsigned long ra_size,
+			unsigned long ra_min, unsigned long ra_max)
+{
+	pgoff_t ra_index;
+	unsigned long la_size;
+	unsigned long remain_pages;
+
+	/* Where to start read-ahead?
+	 * NFSv3 daemons may process adjecent requests in parallel,
+	 * leading to many locally disordered, globally sequential reads.
+	 * So do not require nearby history pages to be present or accessed.
+	 */
+	if (page) {
+		ra_index = first_absent_page(mapping, index, ra_max * 5 / 4);
+		if (unlikely(!ra_index))
+			return -1;
+	} else if (!prev_page) {
+		ra_index = first_absent_page_bw(mapping, index,
+						readahead_hit_rate + ra_min);
+		if (index - ra_index > readahead_hit_rate + ra_min)
+			return 0;
+		ra_min += 2 * (index - ra_index);
+		index = ra_index;
+	} else {
+		ra_index = index;
+		if (ra_has_index(ra, index))
+			ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+						ra->readahead_index - index);
+	}
+
+	ra_size = query_page_cache(mapping, ra, &remain_pages,
+						index - 1, ra_min, ra_max);
+
+	la_size = ra_index - index;
+	if (readahead_ratio < VM_READAHEAD_PROTECT_RATIO &&
+			remain_pages <= la_size && la_size > 1) {
+		rescue_pages(page, la_size);
+		return -1;
+	}
+
+	if (ra_size == index) {
+		if (!adjust_rala_accelerated(ra_max, &ra_size, &la_size))
+			return -1;
+		set_ra_class(ra, RA_CLASS_CONTEXT_ACCELERATED);
+	} else {
+		if (!adjust_rala(ra_max, &ra_size, &la_size))
+			return -1;
+		set_ra_class(ra, RA_CLASS_CONTEXT);
+	}
+
+	ra_state_init(ra, index, ra_index);
+	ra_state_update(ra, ra_size, la_size);
+
+	return 1;
+}
+
 
 /*
  * ra_size is mainly determined by:

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 12/19] readahead: other methods
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (10 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 11/19] readahead: context " Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 13/19] readahead: detect and rescue live pages Wu Fengguang
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-method-others.patch --]
[-- Type: text/plain, Size: 3633 bytes --]

Various read-ahead strategies for:
	- fresh read from start of file
	- backward prefetching
	- seek and read one record pattern(db workload)
	- quick recover from thrashing

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  111 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 111 insertions(+)

--- linux-2.6.15-rc1-mm2.orig/mm/readahead.c
+++ linux-2.6.15-rc1-mm2/mm/readahead.c
@@ -1505,6 +1505,117 @@ try_context_based_readahead(struct addre
 	return 1;
 }
 
+/*
+ * Read-ahead on start of file.
+ *
+ * It is most important for small files.
+ * 1. Set a moderate large read-ahead size;
+ * 2. Issue the next read-ahead request as soon as possible.
+ *
+ * But be careful, there are some applications that dip into only the very head
+ * of a file. The most important thing is to prevent them from triggering the
+ * next (much larger) read-ahead request, which leads to lots of cache misses.
+ * Two pages should be enough for them, correct me if I'm wrong.
+ */
+static inline unsigned long
+newfile_readahead(struct address_space *mapping,
+		struct file *filp, struct file_ra_state *ra,
+		unsigned long req_size, unsigned long ra_min)
+{
+	unsigned long ra_size;
+	unsigned long la_size;
+
+	if (req_size > ra_min)
+		req_size = ra_min;
+
+	ra_size = 4 * req_size;
+	la_size = 2 * req_size;
+
+	set_ra_class(ra, RA_CLASS_NEWFILE);
+	ra_state_init(ra, 0, 0);
+	ra_state_update(ra, ra_size, la_size);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * Backward prefetching.
+ * No look ahead and thrashing threshold estimation for stepping backward
+ * pattern: should be unnecessary.
+ */
+static inline int
+try_read_backward(struct file_ra_state *ra,
+			pgoff_t begin_index, pgoff_t end_index,
+			unsigned long ra_size,
+			unsigned long ra_min, unsigned long ra_max)
+{
+	if (ra_size > ra_max || end_index > ra->prev_page)
+		return 0;
+
+	if (ra_has_index(ra, ra->prev_page)) {
+		if (end_index > ra->la_index)
+			return 0;
+		ra_size += 2 * ra_cache_hit(ra, 0);
+		end_index = ra->la_index;
+	} else {
+		ra_size += readahead_hit_rate + ra_min;
+		end_index = ra->prev_page;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	if (end_index > begin_index + ra_size)
+		return 0;
+
+	begin_index = end_index - ra_size;
+
+	set_ra_class(ra, RA_CLASS_BACKWARD);
+	ra_state_init(ra, begin_index, begin_index);
+	ra_state_update(ra, ra_size, 0);
+
+	return 1;
+}
+
+/*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ * Databases are known to have this seek-and-read-one-record pattern.
+ */
+static inline int
+try_random_readahead(struct file_ra_state *ra, pgoff_t index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	unsigned long hit0 = ra_cache_hit(ra, 0);
+	unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+	unsigned long hit2 = ra_cache_hit(ra, 2);
+	unsigned long hit3 = ra_cache_hit(ra, 3);
+
+	if (!ra_has_index(ra, ra->prev_page))
+		return 0;
+
+	if (index == ra->prev_page + 1) {    /* read after thrashing */
+		ra_size = hit0;
+		set_ra_class(ra, RA_CLASS_RANDOM_THRASHING);
+		ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+						ra->readahead_index - index);
+	} else if (ra_size < hit1 &&         /* read after seeking   */
+			hit1 > hit2 / 2 &&
+			hit2 > hit3 / 2 &&
+			hit3 > hit1 / 2) {
+		ra_size = max(hit1, hit2);
+		set_ra_class(ra, RA_CLASS_RANDOM_SEEK);
+	} else
+		return 0;
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	ra_state_init(ra, index, index);
+	ra_state_update(ra, ra_size, 0);
+
+	return 1;
+}
 
 /*
  * ra_size is mainly determined by:

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 13/19] readahead: detect and rescue live pages
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (11 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 12/19] readahead: other methods Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 14/19] readahead: events accounting Wu Fengguang
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-protect-live-pages.patch --]
[-- Type: text/plain, Size: 13422 bytes --]

It tries to identify and protect live pages (sequential pages that are going
to be accessed in the near future) on vmscan time. Almost all live pages can
be sorted out and saved.

The cost: dead pages that won't be read will be kept in inactive_list for
another round. Hopefully there won't be much, for file servers normally have
read-ahead hit rate as high as 99%.

This feature is greatly demanded by file servers, though should be useless
for desktop. It saves the case when one fast reader flushes the cache and
strips the read-ahead size of slow readers, or the case where caching is just
a wasting of memory. Then you can raise readahead_ratio to 200 or more to
enforce a larger read-ahead size and strip the useless cached data out of lru.

Its use is currently limited, for the context based method is not ready for
the case. This problem will be solved when the timing info of evicted pages
are available. It can also be extended to further benefit large memory
systems.  I'll leave them as future work.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h         |    2 
 include/linux/page-flags.h |    2 
 mm/page_alloc.c            |    2 
 mm/readahead.c             |  318 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   10 +
 5 files changed, 333 insertions(+), 1 deletion(-)

--- linux-2.6.15-rc2-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.15-rc2-mm1/include/linux/page-flags.h
@@ -107,6 +107,8 @@ struct page_state {
 	unsigned long pgfree;		/* page freeings */
 	unsigned long pgactivate;	/* pages moved inactive->active */
 	unsigned long pgdeactivate;	/* pages moved active->inactive */
+	unsigned long pgkeephot;	/* pages sent back to active */
+	unsigned long pgkeepcold;	/* pages sent back to inactive */
 
 	unsigned long pgfault;		/* faults (major+minor) */
 	unsigned long pgmajfault;	/* faults (major only) */
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -995,6 +995,8 @@ page_cache_readahead_adaptive(struct add
 			pgoff_t first_index,
 			pgoff_t index, pgoff_t last_index);
 void fastcall ra_access(struct file_ra_state *ra, struct page *page);
+int rescue_ra_pages(struct list_head *page_list, struct list_head *save_list);
+
 
 #ifdef CONFIG_DEBUG_FS
 extern u32 readahead_debug_level;
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -2351,6 +2351,8 @@ static char *vmstat_text[] = {
 	"pgfree",
 	"pgactivate",
 	"pgdeactivate",
+	"pgkeephot",
+	"pgkeepcold",
 
 	"pgfault",
 	"pgmajfault",
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -501,6 +501,8 @@ cannot_free:
 	return 0;
 }
 
+extern int readahead_live_chunk;
+
 /*
  * shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
  */
@@ -509,10 +511,15 @@ static int shrink_list(struct list_head 
 	LIST_HEAD(ret_pages);
 	struct pagevec freed_pvec;
 	int pgactivate = 0;
+	int pgkeep = 0;
 	int reclaimed = 0;
 
 	cond_resched();
 
+	if (readahead_ratio >= VM_READAHEAD_PROTECT_RATIO &&
+							readahead_live_chunk)
+		pgkeep += rescue_ra_pages(page_list, &ret_pages);
+
 	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;
@@ -661,11 +668,13 @@ keep_locked:
 keep:
 		list_add(&page->lru, &ret_pages);
 		BUG_ON(PageLRU(page));
+		pgkeep++;
 	}
 	list_splice(&ret_pages, page_list);
 	if (pagevec_count(&freed_pvec))
 		__pagevec_release_nonlru(&freed_pvec);
 	mod_page_state(pgactivate, pgactivate);
+	mod_page_state(pgkeepcold, pgkeep - pgactivate);
 	sc->nr_reclaimed += reclaimed;
 	return reclaimed;
 }
@@ -1162,6 +1171,7 @@ refill_inactive_zone(struct zone *zone, 
 
 	mod_page_state_zone(zone, pgrefill, pgscanned);
 	mod_page_state(pgdeactivate, pgdeactivate);
+	mod_page_state(pgkeephot, pgmoved);
 }
 
 /*
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -400,7 +400,7 @@ __do_page_cache_readahead(struct address
 	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
-		
+
 		if (page_offset > end_index)
 			break;
 
@@ -1770,3 +1770,319 @@ void fastcall ra_access(struct file_ra_s
 		ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
 }
 
+/*
+ * Detect and protect live read-ahead pages.
+ *
+ * This function provides safty guarantee for file servers with big
+ * readahead_ratio(>=VM_READAHEAD_PROTECT_RATIO) set.  The goal is to save all
+ * and only the sequential pages that are to be accessed in the near future.
+ *
+ * This function is called when pages in @page_list are to be freed,
+ * it protects live read-ahead pages by moving them into @save_list.
+ *
+ * The general idea is to classify pages of a file into groups of sequential
+ * accessed pages. Dead sequential pages are left over, live sequential pages
+ * are saved.
+ *
+ * Live read-ahead pages are defined as sequential pages that have reading in
+ * progress. They are detected by reference count pattern of:
+ *
+ *                        live head       live pages
+ *  ra pages group -->   ------------___________________
+ *                                   [  pages to save  ] (*)
+ *
+ * (*) for now, an extra page from the live head may also be saved.
+ *
+ * In pratical, the group of pages are fragmented into chunks. To tell whether
+ * pages inside a chunk are alive, we must check:
+ * 1) Are there any live heads inside the chunk?
+ * 2) Are there any live heads in the group before the chunk?
+ * 3) Sepcial case: live head just sits on the boundary of current chunk?
+ *
+ * The detailed rules employed must ensure:
+ * - no page is pinned in inactive_list.
+ * - no excessive pages are saved.
+ *
+ * A picture of common cases:
+ *             back search            chunk             case
+ *           -----___________|[____________________]    Normal
+ *           ----------------|----[________________]    Normal
+ *                           |----[________________]    Normal
+ *           ----------------|----------------------    Normal
+ *                           |----------------------    Normal
+ *           ________________|______________________    ra miss
+ *                           |______________________    ra miss
+ *           ________________|_______--------[_____]    two readers
+ *           ----____________|[______--------______]    two readers
+ *                           |_______--------[_____]    two readers
+ *                           |----[____------______]    two readers
+ *           ----------------|----[____------______]    two readers
+ *           _______---------|---------------[_____]    two readers
+ *           ----___---------|[--------------______]    two readers
+ *           ________________|---------------[_____]    two readers
+ *           ----____________|[--------------______]    two readers
+ *           ====------------|[---_________________]    two readers
+ *                           |====[----------______]    two readers
+ *                           |###======[-----------]    three readers
+ */
+static int save_chunk(struct page *head, struct page *live_head,
+			struct page *tail, struct list_head *save_list)
+{
+	struct page *page;
+	struct address_space *mapping;
+	struct radix_tree_cache cache;
+	int i;
+	pgoff_t index;
+	pgoff_t head_index;
+	unsigned long refcnt;
+
+#ifdef DEBUG_READAHEAD
+	static char static_buf[PAGE_SIZE];
+	static char *zone_names[] = {"DMA", "DMA32", "Normal", "HighMem"};
+	char *pat = static_buf;
+	int pidx = 0;
+#define	log_symbol(symbol)						\
+	do { 								\
+		if (READAHEAD_DEBUG_LEVEL(3) && pidx < PAGE_SIZE - 1)	\
+			pat[pidx++] = symbol; 				\
+	} while (0)
+
+	if (READAHEAD_DEBUG_LEVEL(3)) {
+		pat = (char *)get_zeroed_page(GFP_KERNEL);
+		if (!pat)
+			pat = static_buf;
+	}
+#else
+#define log_symbol(symbol) do {} while (0)
+#endif
+#define	log_page(page)	log_symbol(page_refcnt_symbol(page))
+
+	head_index = head->index;
+	mapping = head->mapping;
+	radix_tree_cache_init(&cache);
+
+	BUG_ON(!mapping); /* QUESTION: in what case mapping will be NULL ? */
+	read_lock_irq(&mapping->tree_lock);
+
+	/*
+	 * Common case test.
+	 * Does the far end indicates a leading live head?
+	 */
+	index = radix_tree_lookup_head(&mapping->page_tree,
+					head_index, readahead_live_chunk);
+	if (index >= head_index)
+		goto skip_scan_locked;
+
+	page = __find_page(mapping, index);
+	BUG_ON(!page);
+	log_symbol('|');
+	log_page(page);
+	refcnt = cold_page_refcnt(page);
+	if (head_index - index < readahead_live_chunk &&
+			refcnt > page_refcnt(head)) {
+		live_head = head;
+		goto skip_scan_locked;
+	}
+
+	/*
+	 * The slow path.
+	 * Scan page by page to see if the whole chunk should be saved.
+	 */
+	if (next_page(head) != tail)
+		head_index = next_page(head)->index;
+	else
+		head_index++;
+	for (i = 0, index++; index <= head_index; index++) {
+		page = radix_tree_cache_lookup(&mapping->page_tree, &cache,
+									index);
+		if (index == head->index)
+			log_symbol('|');
+		log_page(page);
+
+		if (!page) {
+			WARN_ON(index < head->index);
+			break;
+		}
+
+		if (refcnt == page_refcnt(page))
+			i++;
+		else if (refcnt < page_refcnt(page))
+			i = 0;
+		else if (i < 1)
+			i = INT_MIN;
+		else {
+			live_head = head;
+			break;
+		}
+
+		refcnt = page_refcnt(page);
+	}
+
+skip_scan_locked:
+#ifdef DEBUG_READAHEAD
+	if (index < head->index)
+		log_symbol('*');
+	index = prev_page(tail)->index;
+
+	log_symbol('|');
+	for (page = head; page != tail; page = next_page(page)) {
+		BUG_ON(PageAnon(page));
+		BUG_ON(PageSwapCache(page));
+		/* BUG_ON(page_mapped(page)); */
+
+		if (page == live_head)
+			log_symbol('[');
+		log_page(page);
+	}
+	if (live_head)
+		log_symbol(']');
+#endif
+
+	/*
+	 * Special case work around.
+	 *
+	 * Save one extra page if it is a live head of the following chunk.
+	 * Just to be safe.  It protects the rare situation when the reader
+	 * is just crossing the chunk boundary, and the following chunk is not
+	 * far away from tail of inactive_list.
+	 *
+	 * The special case is awkwardly delt with for now. They will be all set
+	 * when the timing information of recently evicted pages are available.
+	 * Dead pages can also be purged earlier with the timing info.
+	 */
+	if (live_head != head) {
+		struct page *last_page = prev_page(tail);
+		page = radix_tree_cache_lookup(&mapping->page_tree, &cache,
+						last_page->index + 1);
+		log_symbol('|');
+		log_page(page);
+		if (page && !live_head) {
+			refcnt = page_refcnt(last_page);
+			if (page_refcnt(page) >= refcnt)
+				page = radix_tree_cache_lookup(
+						&mapping->page_tree, &cache,
+						last_page->index + 2);
+			log_page(page);
+			if (page && page_refcnt(page) < refcnt) {
+				live_head = last_page;
+				log_symbol('I');
+			}
+		} else if (!page && live_head) {
+			live_head = next_page(live_head);
+				log_symbol('D');
+		}
+	}
+	log_symbol('\0');
+
+	read_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * Now save the alive pages.
+	 */
+	i = 0;
+	if (live_head) {
+		for (; live_head != tail;) { /* never dereference tail! */
+			page = next_page(live_head);
+			if (!PageActivate(live_head)) {
+				list_move(&live_head->lru, save_list);
+				i++;
+				if (!page_refcnt(live_head))
+					inc_readahead_aging();
+			}
+			live_head = page;
+		}
+
+		if (i)
+			ra_account(0, RA_EVENT_READAHEAD_RESCUE, i);
+	}
+
+#ifdef DEBUG_READAHEAD
+	if (READAHEAD_DEBUG_LEVEL(3)) {
+		ddprintk("save_chunk(ino=%lu, idx=%lu-%lu, %s@%s:%s)"
+				" = %d\n",
+				mapping->host->i_ino,
+				head->index, index,
+				mapping_mapped(mapping) ? "mmap" : "file",
+				zone_names[page_zonenum(head)], pat, i);
+		if (pat != static_buf)
+			free_page((unsigned long)pat);
+	}
+#endif
+
+	return i;
+}
+
+int rescue_ra_pages(struct list_head *page_list, struct list_head *save_list)
+{
+	struct address_space *mapping;
+	struct page *chunk_head;
+	struct page *live_head;
+	struct page *page;
+	unsigned long refcnt;
+	unsigned long min_refcnt;
+	int n;
+	int ret = 0;
+
+	page = list_to_page(page_list);
+
+next_chunk:
+	chunk_head = page;
+	live_head = NULL;
+	mapping = page->mapping;
+	n = 0;
+	min_refcnt = LONG_MAX;
+
+next_head_page:
+	refcnt = page_refcnt(page);
+	if (min_refcnt > refcnt)
+		min_refcnt = refcnt;
+	page = next_page(page);
+
+	if (mapping != page->mapping || &page->lru == page_list)
+		goto save_chunk;
+
+	/* At least 2 pages followed by a fall in refcnt makes a live head:
+	 *               --_
+	 *                ^ live_head
+	 */
+	if (refcnt == page_refcnt(page))
+		n++;
+	else if (refcnt < page_refcnt(page))
+		n = 0;
+	else if (n < 1)
+		n = INT_MIN;
+	else
+		goto got_live_head;
+
+	goto next_head_page;
+
+got_live_head:
+	n = 0;
+	live_head = prev_page(page);
+
+next_page:
+	if (refcnt < page_refcnt(page)) /* limit the number of rises */
+		n++;
+	refcnt = page_refcnt(page);
+	if (min_refcnt > refcnt)
+		min_refcnt = refcnt;
+	page = next_page(page);
+
+	if (mapping != page->mapping || &page->lru == page_list)
+		goto save_chunk;
+
+	goto next_page;
+
+save_chunk:
+	if (mapping && !PageAnon(chunk_head) &&
+			!PageSwapCache(chunk_head) &&
+			/* !page_mapped(chunk_head) && */
+			min_refcnt < PAGE_REFCNT_2 &&
+			n <= 3)
+		ret += save_chunk(chunk_head, live_head, page, save_list);
+
+	if (&page->lru != page_list)
+		goto next_chunk;
+
+	return ret;
+}

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 14/19] readahead: events accounting
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (12 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 13/19] readahead: detect and rescue live pages Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 15/19] readahead: page aging accounting Wu Fengguang
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, J?rn Engel, Ingo Oeser

[-- Attachment #1: readahead-account-events.patch --]
[-- Type: text/plain, Size: 10352 bytes --]

A debugfs file named `readahead/events' is created according to advices from
J?rn Engel, Andrew Morton and Ingo Oeser. It yields to much better
readability than the preious /proc/vmstat interface :)

It reveals various read-ahead activities/events, and is vital to the testing.

-------------------------------------------------------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.

- Preparations

## Compile with CONFIG_DEBUG_FS('Kernel hacking  --->  Debug Filesystem')
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 5 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

The commented out commands can uncover the very detailed file accesses,
which are useful sometimes. But the log file will be rather huge.

-------------------------------------------------------------------------
This is a trimmed down output on my PC:
# cat /debug/readahead/events
[table requests]      total    newfile      state    context      none
cache_miss              403         56         12         69       263
read_random             260         37          5         17       201
io_congestion             0          0          0          0         0
io_cache_hit             85          0         24         46         0
io_block               9796       5613        822        143      3203
readahead              5956       5418        383        139         0
lookahead               961        650        212         98         0
lookahead_hit           449        181        164         58        41
lookahead_ignore          0          0          0          0         0
readahead_eof          4981       4768        171         28         0
readahead_shrink          0          0          0          0         0
readahead_thrash          0          0          0          0         0
readahead_mutilt          0          0          0          0         0
readahead_rescue         45          0          0          0        45

[table pages]         total    newfile      state    context      none
cache_miss             5590         72       2506        181      2826
read_random             265         37          5         17       206
io_congestion             0          0          0          0         0
io_cache_hit           2440          0       1054       1366         0
io_block             165848      11117     147794       3668      3203
readahead             43080      11360      28949       2621         0
readahead_hit         38251      10716      25669       1818         9
lookahead             24013       1718      21641        647         0
lookahead_hit         20161        770      18679        712         0
lookahead_ignore          0          0          0          0         0
readahead_eof         15961       7924       7440        461         0
readahead_shrink          0          0          0          0         0
readahead_thrash          0          0          0          0         0
readahead_mutilt          0          0          0          0         0
readahead_rescue        240          0          0          0       240

[table summary]       total    newfile      state    context      none
random_rate              4%         0%         1%        10%       99%
ra_hit_rate             88%        94%        88%        69%      900%
la_hit_rate             46%        27%        76%        58%     4100%
avg_ra_size               7          2         75         19         0
avg_la_size              25          3        102          7         0

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 192 insertions(+), 2 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -105,6 +105,9 @@ enum ra_event {
 u32 readahead_debug_level = 0;
 u32 debug_disable_stateful_method = 0;
 
+static struct dentry *readahead_events_dentry;
+extern struct file_operations ra_debug_fops;
+
 static int __init readahead_init(void)
 {
 	struct dentry *root;
@@ -115,6 +118,9 @@ static int __init readahead_init(void)
 	debugfs_create_bool("disable_stateful_method", 0644, root,
 						&debug_disable_stateful_method);
 
+	readahead_events_dentry = debugfs_create_file("events",
+					0644, root, NULL, &ra_debug_fops);
+
 	return 0;
 }
 
@@ -132,6 +138,178 @@ static inline int disable_stateful_metho
 	return 0;
 }
 
+#endif
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef DEBUG_READAHEAD
+
+static char *ra_class_name[] = {
+	"total",
+	"newfile",
+	"state",
+	"context",
+	"contexta",
+	"backward",
+	"onthrash",
+	"onraseek",
+	"none",
+};
+
+static char *ra_event_name[] = {
+	"cache_miss",
+	"read_random",
+	"io_congestion",
+	"io_cache_hit",
+	"io_block",
+	"readahead",
+	"readahead_hit",
+	"lookahead",
+	"lookahead_hit",
+	"lookahead_ignore",
+	"readahead_eof",
+	"readahead_shrink",
+	"readahead_thrash",
+	"readahead_mutilt",
+	"readahead_rescue",
+};
+
+static unsigned long ra_event_count[RA_CLASS_END+1][RA_EVENT_END][2];
+
+static inline void ra_account(struct file_ra_state *ra,
+				enum ra_event e, int pages)
+{
+	enum ra_class c;
+
+	c = (ra ? ra->flags & RA_CLASS_MASK : RA_CLASS_END);
+	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		pages = -pages;
+	}
+	if (!c)
+		c = RA_CLASS_END;
+	BUG_ON(c > RA_CLASS_END);
+
+	ra_event_count[c][e][0] += 1;
+	ra_event_count[c][e][1] += pages;
+}
+
+static int ra_account_show(struct seq_file *s, void *_)
+{
+	int i;
+	int c;
+	int e;
+	static char event_fmt[] = "%-16s";
+	static char class_fmt[] = "%11s";
+	static char item_fmt[] = "%11lu";
+	static char percent_format[] = "%10lu%%";
+	static char *table_name[] = {
+		"[table requests]",
+		"[table pages]",
+		"[table summary]"};
+
+	for (i = 0; i <= 1; i++) {
+		for (e = 0; e < RA_EVENT_END; e++) {
+			ra_event_count[0][e][i] = 0;
+			for (c = 1; c <= RA_CLASS_END; c++)
+				ra_event_count[0][e][i] +=
+							ra_event_count[c][e][i];
+		}
+
+		seq_printf(s, event_fmt, table_name[i]);
+		for (c = 0; c <= RA_CLASS_END; c++)
+			seq_printf(s, class_fmt, ra_class_name[c]);
+		seq_puts(s, "\n");
+
+		for (e = 0; e < RA_EVENT_END; e++) {
+			if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+				continue;
+
+			seq_printf(s, event_fmt, ra_event_name[e]);
+			for (c = 0; c <= RA_CLASS_END; c++)
+				seq_printf(s, item_fmt,
+						ra_event_count[c][e][i]);
+			seq_puts(s, "\n");
+		}
+		seq_puts(s, "\n");
+	}
+
+	seq_printf(s, event_fmt, table_name[2]);
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, class_fmt, ra_class_name[c]);
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "random_rate");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, percent_format,
+			(ra_event_count[c][RA_EVENT_READRANDOM][0] * 100) /
+			(ra_event_count[c][RA_EVENT_READRANDOM][0] +
+			 ra_event_count[c][RA_EVENT_READAHEAD][0] + 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "ra_hit_rate");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, percent_format,
+			(ra_event_count[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+			(ra_event_count[c][RA_EVENT_READAHEAD][1] + 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "la_hit_rate");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, percent_format,
+			(ra_event_count[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+			(ra_event_count[c][RA_EVENT_LOOKAHEAD][0] + 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_ra_size");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, item_fmt,
+			(ra_event_count[c][RA_EVENT_READAHEAD][1] +
+			 ra_event_count[c][RA_EVENT_READAHEAD][0] / 2) /
+			(ra_event_count[c][RA_EVENT_READAHEAD][0] + 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_la_size");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, item_fmt,
+			(ra_event_count[c][RA_EVENT_LOOKAHEAD][1] +
+			 ra_event_count[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+			(ra_event_count[c][RA_EVENT_LOOKAHEAD][0] + 1));
+	seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int ra_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, ra_account_show, NULL);
+}
+
+static ssize_t ra_debug_write(struct file *file, const char __user *buf,
+				size_t size, loff_t *offset)
+{
+	if (file->f_dentry == readahead_events_dentry)
+		memset(ra_event_count, 0, sizeof(ra_event_count));
+	return 1;
+}
+
+static struct file_operations ra_debug_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ra_debug_open,
+	.write		= ra_debug_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+#else /* !DEBUG_READAHEAD */
+
+static inline void ra_account(struct file_ra_state *ra,
+				enum ra_event e, int pages)
+{
+}
+
 #endif /* DEBUG_READAHEAD */
 
 
@@ -1024,6 +1202,8 @@ static int ra_dispatch(struct file_ra_st
 		ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
 	if (la_size)
 		ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+	if (ra_size > actual)
+		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
 	ra_account(ra, RA_EVENT_READAHEAD, actual);
 
 	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
@@ -1668,8 +1848,11 @@ page_cache_readahead_adaptive(struct add
 	if (page) {
 		if(!TestClearPageReadahead(page))
 			return 0;
-		if (bdi_read_congested(mapping->backing_dev_info))
+		if (bdi_read_congested(mapping->backing_dev_info)) {
+			ra_account(ra, RA_EVENT_IO_CONGESTION,
+							end_index - index);
 			return 0;
+		}
 	}
 
 	if (page)
@@ -1759,8 +1942,15 @@ void fastcall ra_access(struct file_ra_s
 			   (1 << PG_referenced)))
 		return;
 
-	if (!ra_has_index(ra, page->index))
+	if (ra_has_index(ra, page->index)) {
+		if (!PageUptodate(page))
+			ra_account(ra, RA_EVENT_IO_BLOCK,
+					ra->readahead_index - page->index);
+	} else {
+		if (!PageUptodate(page))
+			ra_account(0, RA_EVENT_IO_BLOCK, 1);
 		return;
+	}
 
 	ra->cache_hit++;
 

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 15/19] readahead: page aging accounting
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (13 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 14/19] readahead: events accounting Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-account-aging.patch --]
[-- Type: text/plain, Size: 7977 bytes --]

The accuracy of stateful thrashing-threshold estimation depends largely on the
measurement of cold page aging speed.

A file named `page_aging' is created in debugfs to monitor the trace of two
measurement variables: per-zone `aging_total' and per-cpu `readahead_aging'.
Their values and the jiffies are recorded each time one of them has a delta
of 1, 1/2, 1/4, 1/16, 1/256, 1/4096 (nr_inactive + nr_free).

Sample series of collected data shows that smooth_aging is more stable in
small sampling granularity:

  time         dt         page_aging8       smooth_aging8
872765         26     520056       33     653782      163
872791         12     520089      132     653945       51
872803          4     520221      132     653996       66
872807         17     520353      165     654062      107
872824         22     520518       99     654169       74
872846        372     520617       99     654243       78
873218        294     520716       99     654321       73
873512        196     520815       99     654394      130
873708         15     520914      231     654524       28
873723         15     521145      198     654552        9
873738        881     521343       99     654561      182
874619        700     521442        0     654743      198
875319        384     521442       66     654941      110
875703       2119     521508       99     655051     1632
877822       3960     521607        0     656683      980
881782        904     521607        0     657663      216

  time         dt         page_aging1       smooth_aging1
-90822      12418       5775    12999      33302    10767
-78404      17510      18774    10303      44069    10345
-60894      24757      29077     9871      54414    14615
-36137      19194      38948    10404      69029    13726
-16943      19636      49352    10440      82755    12865
  2693      16299      59792    12453      95620    10734
 18992      19851      72245    10073     106354    15960
 38843      16099      82318    10767     122314    14059
 54942      16094      93085    10041     136373    12117
 71036      19888     103126    12595     148490    16155
 90924      18452     115721     9782     164645    11705
109376      22395     125503    10214     176350    13679
131771      19310     135717    10759     190029    11843
151081      20793     146476    10699     201872    12595
171874      22308     157175    10321     214467    13157
194182      17954     167496    10773     227624    14803
212136      19946     178269    10554     242427    13391
232082      21051     188823    11179     255818    11783

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 154 insertions(+), 1 deletion(-)

--- linux-2.6.15-rc1-mm2.orig/mm/readahead.c
+++ linux-2.6.15-rc1-mm2/mm/readahead.c
@@ -108,6 +108,9 @@ u32 debug_disable_stateful_method;
 static struct dentry *readahead_events_dentry;
 extern struct file_operations ra_debug_fops;
 
+static struct dentry *page_aging_dentry;
+extern struct file_operations aginginfo_fops;
+
 static int __init readahead_init(void)
 {
 	struct dentry *root;
@@ -120,6 +123,8 @@ static int __init readahead_init(void)
 
 	readahead_events_dentry = debugfs_create_file("events",
 					0644, root, NULL, &ra_debug_fops);
+	page_aging_dentry = debugfs_create_file("page_aging",
+					0644, root, NULL, &aginginfo_fops);
 
 	return 0;
 }
@@ -281,9 +286,14 @@ static int ra_account_show(struct seq_fi
 	return 0;
 }
 
+extern struct seq_operations aginginfo_ops;
+
 static int ra_debug_open(struct inode *inode, struct file *file)
 {
-	return single_open(file, ra_account_show, NULL);
+	if (file->f_dentry == readahead_events_dentry)
+		return single_open(file, ra_account_show, NULL);
+	else
+		return seq_open(file, &aginginfo_ops);
 }
 
 static ssize_t ra_debug_write(struct file *file, const char __user *buf,
@@ -312,6 +322,146 @@ static inline void ra_account(struct fil
 
 #endif /* DEBUG_READAHEAD */
 
+/*
+ * Measure the aging progress of cold pages over time.
+ */
+#ifdef DEBUG_READAHEAD
+
+#define AGING_INFO_SIZE	(1 << 8)
+#define AGING_INFO_MASK	(AGING_INFO_SIZE - 1)
+static int aging_info_shift[] = {0, 1, 2, 4, 8, 12};
+#define AGING_INFO_SHIFTS	(sizeof(aging_info_shift)/\
+				 sizeof(aging_info_shift[0]))
+static int aging_info_index[AGING_INFO_SHIFTS];
+static unsigned long aging_info[AGING_INFO_SIZE][AGING_INFO_SHIFTS*3];
+static spinlock_t aging_info_lock = SPIN_LOCK_UNLOCKED;
+
+static unsigned long node_free_and_cold_pages(void);
+static unsigned long node_readahead_aging(void);
+
+/*
+ * The accumulated count of pages pushed into inactive_list(s).
+ */
+static unsigned long aging_total(void)
+{
+	unsigned int i;
+	unsigned long sum = 0;
+	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		sum += zones[i].aging_total;
+
+	return sum;
+}
+
+static void collect_aging_info(void)
+{
+	int i;
+	unsigned long mem;
+	unsigned long page_aging;
+	unsigned long smooth_aging;
+
+	mem = node_free_and_cold_pages();
+	page_aging = aging_total();
+	smooth_aging = node_readahead_aging();
+
+	spin_lock_irq(&aging_info_lock);
+
+	for (i = AGING_INFO_SHIFTS - 1; i >= 0; i--) {
+		if (smooth_aging - aging_info[aging_info_index[i]][i*3+2] +
+		      page_aging - aging_info[aging_info_index[i]][i*3+1] >
+					2 * (mem >> aging_info_shift[i])) {
+			aging_info_index[i]++;
+			aging_info_index[i] &= AGING_INFO_MASK;
+			aging_info[aging_info_index[i]][i*3] = jiffies;
+			aging_info[aging_info_index[i]][i*3+1] = page_aging;
+			aging_info[aging_info_index[i]][i*3+2] = smooth_aging;
+		} else
+			break;
+	}
+
+	spin_unlock_irq(&aging_info_lock);
+}
+
+static void *aginginfo_start(struct seq_file *s, loff_t *pos)
+{
+	int n = *pos;
+	int i;
+
+	spin_lock_irq(&aging_info_lock);
+
+	if (!n) {
+		for (i = 0; i < AGING_INFO_SHIFTS; i++) {
+			seq_printf(s, "%12s %10s %18s%d %18s%d\t", "time","dt",
+                                        "page_aging", aging_info_shift[i],
+                                        "smooth_aging", aging_info_shift[i]);
+		}
+		seq_puts(s, "\n");
+	}
+
+	if (++n < AGING_INFO_SIZE)
+		return (void *)n;
+	else
+		return NULL;
+}
+
+static void *aginginfo_next(struct seq_file *s, void *p, loff_t *pos)
+{
+	int n = (int)p;
+
+	++*pos;
+	return (void *)(++n < AGING_INFO_SIZE ? n : 0);
+}
+
+static void aginginfo_stop(struct seq_file *s, void *p)
+{
+	spin_unlock_irq(&aging_info_lock);
+}
+
+static int aginginfo_show(struct seq_file *s, void *p)
+{
+	int n = (int)p;
+	int i;
+	int index0;
+	int index1;
+	long time;
+	unsigned long nr1;
+	unsigned long nr2;
+
+	for (i = 0; i < AGING_INFO_SHIFTS; i++) {
+		index0 = aging_info_index[i] + n;
+		index1 = aging_info_index[i] + n + 1;
+		index0 &= AGING_INFO_MASK;
+		index1 &= AGING_INFO_MASK;
+		time = aging_info[index0][i*3];
+		nr1 = aging_info[index1][i*3+1] - aging_info[index0][i*3+1];
+		nr2 = aging_info[index1][i*3+2] - aging_info[index0][i*3+2];
+		seq_printf(s, "%12ld %10lu %10lu %8lu %10lu %8lu\t",
+				time,  aging_info[index1][i*3] - time,
+				aging_info[index0][i*3+1], nr1,
+				aging_info[index0][i*3+2], nr2);
+	}
+	seq_puts(s, "\n");
+
+	return 0;
+}
+
+
+static struct seq_operations aginginfo_ops = {
+	.start	= aginginfo_start,
+	.next	= aginginfo_next,
+	.stop	= aginginfo_stop,
+	.show	= aginginfo_show,
+};
+
+static struct file_operations aginginfo_fops = {
+	.open		= ra_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+#endif /* DEBUG_READAHEAD */
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -1298,6 +1448,9 @@ static inline unsigned long compute_thra
 	else
 		*remain = 0;
 
+#ifdef DEBUG_READAHEAD
+	collect_aging_info();
+#endif
 	ddprintk("compute_thrashing_threshold: "
 			"ra=%lu=%lu*%lu/%lu, remain %lu for %lu\n",
 			ra_size, stream_shift, global_size, global_shift,

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 16/19] readahead: laptop mode support
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (14 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 15/19] readahead: page aging accounting Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 16:06   ` Bart Samwel
  2005-11-25 15:12 ` [PATCH 17/19] readahead: disable look-ahead for loopback file Wu Fengguang
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Bart Samwel

[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3313 bytes --]

When the laptop drive is spinned down, defer look-ahead to spin up time.

The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.

------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:

# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda      # this is the max possible value

Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
  which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/writeback.h |    6 ++++++
 mm/page-writeback.c       |    2 +-
 mm/readahead.c            |   30 ++++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 1 deletion(-)

--- linux-2.6.15-rc1-mm2.orig/include/linux/writeback.h
+++ linux-2.6.15-rc1-mm2/include/linux/writeback.h
@@ -85,6 +85,12 @@ void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
 
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+	return !timer_pending(&laptop_mode_wb_timer);
+}
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
--- linux-2.6.15-rc1-mm2.orig/mm/page-writeback.c
+++ linux-2.6.15-rc1-mm2/mm/page-writeback.c
@@ -369,7 +369,7 @@ static void wb_timer_fn(unsigned long un
 static void laptop_timer_fn(unsigned long unused);
 
 static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
  * Periodic writeback of "old" data.
--- linux-2.6.15-rc1-mm2.orig/mm/readahead.c
+++ linux-2.6.15-rc1-mm2/mm/readahead.c
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/writeback.h>
 
 /* The default max/min read-ahead pages. */
 #define KB(size)	(((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
@@ -1155,6 +1156,30 @@ out:
 }
 
 /*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static inline int renew_lookahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				pgoff_t index, pgoff_t new_index)
+{
+	struct page *page;
+
+	if (index == ra->lookahead_index &&
+			new_index >= ra->readahead_index)
+		return 1;
+
+	page = find_page(mapping, new_index);
+	if (!page)
+		return 1;
+
+	SetPageReadahead(page);
+	if (ra->lookahead_index == index)
+		ra->lookahead_index = new_index;
+	return 0;
+}
+
+/*
  * State based calculation of read-ahead request.
  *
  * This figure shows the meaning of file_ra_state members:
@@ -2006,6 +2031,11 @@ page_cache_readahead_adaptive(struct add
 							end_index - index);
 			return 0;
 		}
+		if (laptop_mode && laptop_spinned_down()) {
+			if (!renew_lookahead(mapping, ra, index,
+						index + LAPTOP_POLL_INTERVAL))
+				return 0;
+		}
 	}
 
 	if (page)

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 17/19] readahead: disable look-ahead for loopback file
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (15 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 18/19] readahead: nfsd support Wu Fengguang
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang

[-- Attachment #1: readahead-disable-lookahead-for-loopback.patch --]
[-- Type: text/plain, Size: 2027 bytes --]

Loopback files normally contain filesystem, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

I'd like to thank Tero Grundstr?m for uncovering the loopback problem.

 drivers/block/loop.c |    6 ++++++
 include/linux/fs.h   |    1 +
 mm/readahead.c       |    8 ++++++++
 3 files changed, 15 insertions(+)

--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
@@ -620,6 +620,7 @@ struct file_ra_state {
 };
 #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
 #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
+#define RA_FLAG_NO_LOOKAHEAD	(1UL<<31)	/* disable look-ahead */
 
 struct file {
 	/*
--- linux-2.6.15-rc2-mm1.orig/drivers/block/loop.c
+++ linux-2.6.15-rc2-mm1/drivers/block/loop.c
@@ -782,6 +782,12 @@ static int loop_set_fd(struct loop_devic
 	mapping = file->f_mapping;
 	inode = mapping->host;
 
+	/*
+	 * The upper layer should already do proper look-ahead,
+	 * one more look-ahead here only ruins the cache hit rate.
+	 */
+	file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
 	if (!(file->f_mode & FMODE_WRITE))
 		lo_flags |= LO_FLAGS_READ_ONLY;
 
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -1312,6 +1312,11 @@ static inline void ra_state_update(struc
 	if (ra_size < old_ra && ra_cache_hit(ra, 0))
 		ra_account(ra, RA_EVENT_READAHEAD_SHRINK, old_ra - ra_size);
 #endif
+
+	/* Disable look-ahead for loopback file. */
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		la_size = 0;
+
 	ra_addup_cache_hit(ra);
 	ra->ra_index = ra->readahead_index;
 	ra->la_index = ra->lookahead_index;
@@ -1672,6 +1677,9 @@ static int query_page_cache(struct addre
 	if (count < ra_max)
 		goto out;
 
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		goto out;
+
 	/*
 	 * Check the far pages coarsely.
 	 * The big count here helps increase la_size.

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 18/19] readahead: nfsd support
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (16 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 17/19] readahead: disable look-ahead for loopback file Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:12 ` [PATCH 19/19] io: avoid too much latency from read-ahead Wu Fengguang
  2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Neil Brown

[-- Attachment #1: readahead-nfsd-support.patch --]
[-- Type: text/plain, Size: 4985 bytes --]

- disable nfsd raparms: the new logic do not rely on it
- disable look-ahead on start of file: leave it to the client

For the case of NFS service, the new read-ahead logic
+ can handle disordered nfsd requests
+ can handle concurrent sequential requests on large files
  with the help of look-ahead
- will have much ado about the concurrent ones on small files

------------------------------------------------------------------------
Notes about the concurrent nfsd requests issue:

nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:

1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache

A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlaped read-ahead requests, which lead to many conflicts at step 4.

There's no much luck to eliminate the concurrent problem in general and
efficient ways. But for small to medium NFS servers where the bottleneck
lies in storage devices, here is a performance tip:

# for pid in `pidof nfsd`; do taskset -p 1 $pid; done

This command effectively serializes all nfsd requests. It would be nice if
someone can code this serialization on a per-file basis.

------------------------------------------------------------------------
Here is some test output(8 nfsd; local mount with tcp,rsize=8192):

SERIALIZED, SMALL FILES
=======================
readahead_ratio = 0, ra_max = 128kb (old logic, the ra_max is really not relavant)
96.51s real  11.32s system  3.27s user  160334+2829 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb (new read-ahead logic)
94.88s real  11.53s system  3.20s user  152415+3777 cs  diff -r $NFSDIR $NFSDIR2

PARALLEL, SMALL FILES
=====================
readahead_ratio = 0, ra_max = 128kb
99.87s real  11.41s system  3.15s user  173945+9163 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
100.14s real  12.06s system  3.16s user  170865+13406 cs  diff -r $NFSDIR $NFSDIR2

SERIALIZED, BIG FILES
=====================
readahead_ratio = 0, ra_max = 128kb
56.52s real  3.38s system  1.23s user  47930+5256 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
32.54s real  5.71s system  1.38s user  23851+17007 cs  diff $NFSFILE $NFSFILE2

PARALLEL, BIG FILES
===================
readahead_ratio = 0, ra_max = 128kb
63.35s real  5.68s system  1.57s user  82594+48747 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
33.87s real  10.17s system  1.55s user  72291+100079 cs  diff $NFSFILE $NFSFILE2

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


 fs/nfsd/vfs.c      |    6 +++++-
 include/linux/fs.h |    1 +
 mm/readahead.c     |   11 +++++++++--
 3 files changed, 15 insertions(+), 3 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/fs/nfsd/vfs.c
+++ linux-2.6.15-rc2-mm1/fs/nfsd/vfs.c
@@ -826,10 +826,14 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
 #endif
 
 	/* Get readahead parameters */
-	ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+	if (prefer_adaptive_readahead())
+		ra = NULL;
+	else
+		ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
 
 	if (ra && ra->p_set)
 		file->f_ra = ra->p_ra;
+	file->f_ra.flags |= RA_FLAG_NFSD;
 
 	if (file->f_op->sendfile) {
 		svc_pushback_unused_pages(rqstp);
--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
@@ -621,6 +621,7 @@ struct file_ra_state {
 #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
 #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
 #define RA_FLAG_NO_LOOKAHEAD	(1UL<<31)	/* disable look-ahead */
+#define RA_FLAG_NFSD		(1UL<<30)	/* request from nfsd */
 
 struct file {
 	/*
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -15,11 +15,13 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
+#include <linux/nfsd/const.h>
 
 /* The default max/min read-ahead pages. */
 #define KB(size)	(((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
 #define MAX_RA_PAGES	KB(VM_MAX_READAHEAD)
 #define MIN_RA_PAGES	KB(VM_MIN_READAHEAD)
+#define MIN_NFSD_PAGES	KB(NFSSVC_MAXBLKSIZE/1024)
 
 /* In laptop mode, poll delayed look-ahead on every ## pages read. */
 #define LAPTOP_POLL_INTERVAL 16
@@ -1894,8 +1896,13 @@ newfile_readahead(struct address_space *
 	if (req_size > ra_min)
 		req_size = ra_min;
 
-	ra_size = 4 * req_size;
-	la_size = 2 * req_size;
+	if (unlikely(ra->flags & RA_FLAG_NFSD)) {
+		ra_size = MIN_NFSD_PAGES;
+		la_size = 0;
+	} else {
+		ra_size = 4 * req_size;
+		la_size = 2 * req_size;
+	}
 
 	set_ra_class(ra, RA_CLASS_NEWFILE);
 	ra_state_init(ra, 0, 0);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 19/19] io: avoid too much latency from read-ahead
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (17 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 18/19] readahead: nfsd support Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
  2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
  19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Con Kolivas

[-- Attachment #1: io-low-latency.patch --]
[-- Type: text/plain, Size: 4666 bytes --]

Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
complex, it may be necessary to insert some cond_resched() calls in the
read-ahead path.

----------------------------------------------------------------------------
If you feel more latency with the new read-ahead code, the cause can
either be the complex read-ahead code, or some low level routines not ready
to support the larger default 1M read-ahead size. It can be cleared out with
the following commands:

dd if=/dev/zero of=sparse bs=1M seek=10000 count=1
cp sparse /dev/null

If the above copy does not lead to audio jitters, then it's an low level
issue. It's hard to find them out for now, though there's a workaround:

blockdev --setra 256 /dev/hda # or whatever device you are using

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


This is recommended by Con Kolivas to improve respond time for desktop.

 fs/mpage.c     |    4 +++-
 mm/readahead.c |   16 +++++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -368,6 +368,7 @@ static void collect_aging_info(void)
 	page_aging = aging_total();
 	smooth_aging = node_readahead_aging();
 
+	cond_resched();
 	spin_lock_irq(&aging_info_lock);
 
 	for (i = AGING_INFO_SHIFTS - 1; i >= 0; i--) {
@@ -625,8 +626,10 @@ static int read_pages(struct address_spa
 					page->index, GFP_KERNEL)) {
 			ret = mapping->a_ops->readpage(filp, page);
 			if (ret != AOP_TRUNCATED_PAGE) {
-				if (!pagevec_add(&lru_pvec, page))
+				if (!pagevec_add(&lru_pvec, page)) {
 					__pagevec_lru_add(&lru_pvec);
+					cond_resched();
+				}
 				continue;
 			} /* else fall through to release */
 		}
@@ -746,6 +749,7 @@ __do_page_cache_readahead(struct address
 		}
 
 		read_unlock_irq(&mapping->tree_lock);
+		cond_resched();
 		page = page_cache_alloc_cold(mapping);
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
@@ -1145,6 +1149,7 @@ static int rescue_pages(struct page *pag
 
 		spin_unlock_irq(&zone->lru_lock);
 
+		cond_resched();
 		page = find_page(mapping, index);
 		if (!page)
 			goto out;
@@ -1219,6 +1224,7 @@ static unsigned long node_readahead_agin
 	unsigned long sum = 0;
 	cpumask_t mask = node_to_cpumask(numa_node_id());
 
+	cond_resched();
 	for_each_cpu_mask(cpu, mask)
 		sum += per_cpu(readahead_aging, cpu);
 
@@ -1349,6 +1355,7 @@ static int ra_dispatch(struct file_ra_st
 	int actual;
 	enum ra_class ra_class;
 
+	cond_resched();
 	ra_class = (ra->flags & RA_CLASS_MASK);
 	BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
 
@@ -1371,6 +1378,7 @@ static int ra_dispatch(struct file_ra_st
 
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->ra_index, ra_size, la_size);
+	cond_resched();
 
 #ifdef READAHEAD_STREAMING
 	if (actual < ra_size) {
@@ -1609,6 +1617,7 @@ static int count_cache_hit(struct addres
 	int count = 0;
 	int i;
 
+	cond_resched();
 	read_lock_irq(&mapping->tree_lock);
 
 	/*
@@ -1646,6 +1655,7 @@ static int query_page_cache(struct addre
 	 * Scan backward and check the near @ra_max pages.
 	 * The count here determines ra_size.
 	 */
+	cond_resched();
 	read_lock_irq(&mapping->tree_lock);
 	index = radix_tree_lookup_head(&mapping->page_tree, offset, ra_max);
 	read_unlock_irq(&mapping->tree_lock);
@@ -1691,6 +1701,7 @@ static int query_page_cache(struct addre
 	if (nr_lookback > offset)
 		nr_lookback = offset;
 
+	cond_resched();
 	radix_tree_cache_init(&cache);
 	read_lock_irq(&mapping->tree_lock);
 	for (count += ra_max; count < nr_lookback; count += ra_max) {
@@ -1737,6 +1748,7 @@ static inline pgoff_t first_absent_page_
 	if (max_scan > index)
 		max_scan = index;
 
+	cond_resched();
 	radix_tree_cache_init(&cache);
 	read_lock_irq(&mapping->tree_lock);
 	for (; origin - index <= max_scan;) {
@@ -1760,6 +1772,7 @@ static inline pgoff_t first_absent_page(
 {
 	pgoff_t ra_index;
 
+	cond_resched();
 	read_lock_irq(&mapping->tree_lock);
 	ra_index = radix_tree_lookup_tail(&mapping->page_tree,
 					index + 1, max_scan);
@@ -2469,6 +2482,7 @@ save_chunk:
 			n <= 3)
 		ret += save_chunk(chunk_head, live_head, page, save_list);
 
+	cond_resched();
 	if (&page->lru != page_list)
 		goto next_chunk;
 
--- linux-2.6.15-rc2-mm1.orig/fs/mpage.c
+++ linux-2.6.15-rc2-mm1/fs/mpage.c
@@ -343,8 +343,10 @@ mpage_readpages(struct address_space *ma
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, get_block);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
 				__pagevec_lru_add(&lru_pvec);
+				cond_resched();
+			}
 		} else {
 			page_cache_release(page);
 		}

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/19] readahead: state based method
  2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
@ 2005-11-25 15:21   ` Eric Dumazet
  2005-11-25 15:31     ` Eric Dumazet
  2005-11-26  3:09     ` Wu Fengguang
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2005-11-25 15:21 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, Andrew Morton

Wu Fengguang a écrit :

>  include/linux/fs.h |    8 +
> 
> --- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
> +++ linux-2.6.15-rc2-mm1/include/linux/fs.h
> @@ -604,13 +604,19 @@ struct file_ra_state {
>  	unsigned long start;		/* Current window */
>  	unsigned long size;
>  	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
> -	unsigned long cache_hit;	/* cache hit count*/
> +	uint64_t      cache_hit;	/* cache hit count*/
>  	unsigned long prev_page;	/* Cache last read() position */
>  	unsigned long ahead_start;	/* Ahead window */
>  	unsigned long ahead_size;
>  	unsigned long ra_pages;		/* Maximum readahead window */
>  	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
>  	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
> +
> +	unsigned long age;
> +	pgoff_t la_index;
> +	pgoff_t ra_index;
> +	pgoff_t lookahead_index;
> +	pgoff_t readahead_index;
>  };

Hum... This sizeof(struct file) increase seems quite large...

Have you ever considered to change struct file so that file_ra_state is not 
embedded, but dynamically allocated (or other strategy) for regular files ?

I mean, sockets, pipes cannot readahead... And some machines use far more 
sockets than regular files.

I wrote such a patch in the past I could resend...

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/19] readahead: state based method
  2005-11-25 15:21   ` Eric Dumazet
@ 2005-11-25 15:31     ` Eric Dumazet
  2005-11-26  3:09     ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2005-11-25 15:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wu Fengguang, linux-kernel, Andrew Morton

Eric Dumazet a écrit :

> Hum... This sizeof(struct file) increase seems quite large...
> 
> Have you ever considered to change struct file so that file_ra_state is 
> not embedded, but dynamically allocated (or other strategy) for regular 
> files ?
> 
> I mean, sockets, pipes cannot readahead... And some machines use far 
> more sockets than regular files.
> 
> I wrote such a patch in the past I could resend...

http://marc.theaimsgroup.com/?l=linux-kernel&m=112435605407020&w=2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
                   ` (18 preceding siblings ...)
  2005-11-25 15:12 ` [PATCH 19/19] io: avoid too much latency from read-ahead Wu Fengguang
@ 2005-11-25 15:43 ` Diego Calleja
  2005-11-25 19:31   ` Lee Revell
  2005-11-26  3:17   ` Wu Fengguang
  19 siblings, 2 replies; 32+ messages in thread
From: Diego Calleja @ 2005-11-25 15:43 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, akpm

El Fri, 25 Nov 2005 23:12:10 +0800,
Wu Fengguang <wfg@mail.ustc.edu.cn> escribió:

> Overview
> ========
> 
> The current read-ahead logic uses an inflexible algorithm with 128KB
> VM_MAX_READAHEAD. Less memory leads to thrashing, more memory helps no
> throughput. The new logic is simply safer and faster. It makes sure
> every single read-ahead request is safe for the current load. Memory
> tight systems are expected to benefit a lot: no thrashing any more.
> It can also help boost I/O throughput for large memory systems, for
> VM_MAX_READAHEAD now defaults to 1MB. The value is no longer tightly
> coupled with the thrashing problem, and therefore constrainted by it.


Recently, a openoffice hacker wrote in his blog that the kernel was
culprit of applications not starting as fast as in other systems.
Most of the reasons he gave were wrong, but there was a interesting
one: When you start your system, you've lots of free memory. Since
you have lots of memory, he said it was reasonable to expect that
kernel would readahead *heavily* everything it can to fill that
memory as soon as possible (hoping that what you readahead'ed was
part of the kde/gnome/openoffice libraries etc) and go back to the
normal behaviour when your free memory is used by caches etc.
"Teorically" it looks like a nice heuristic for desktops. Does
adaptative readahead does something like this?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 16/19] readahead: laptop mode support
  2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
@ 2005-11-25 16:06   ` Bart Samwel
  2005-11-26  3:33     ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Bart Samwel @ 2005-11-25 16:06 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, Andrew Morton

Wu Fengguang wrote:
> When the laptop drive is spinned down, defer look-ahead to spin up time.

Just a n00b question: isn't readahead something that happens at read 
time at the block device level? And doesn't the fact that you're reading 
something imply that the drive is spun up? Or can readahead be triggered 
by reading from cache?

> For crazy laptop users who prefer aggressive read-ahead, here is the way:
> 
> # echo 1000 > /proc/sys/vm/readahead_ratio
> # blockdev --setra 524280 /dev/hda      # this is the max possible value

These amounts of readahead are absolutely useless though. I've done 
measurements about a year ago, that show that at a spindown time of two 
minutes you've basically achieved all the power savings you can get. 
More than 10 minutes of spindown is absolutely useless unless you have a 
desktop drive, because those don't normally support more than 50,000 
spinup cycles. The only apps I can think of that work on this amount of 
data in such a short period of time are all apps where you shouldn't be 
concerned about power drawn by the hard drive. :)

--Bart

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
@ 2005-11-25 19:31   ` Lee Revell
  2005-11-26 23:03     ` Mark van der Made
  2005-11-26  3:17   ` Wu Fengguang
  1 sibling, 1 reply; 32+ messages in thread
From: Lee Revell @ 2005-11-25 19:31 UTC (permalink / raw)
  To: Diego Calleja; +Cc: Wu Fengguang, linux-kernel, akpm

On Fri, 2005-11-25 at 16:43 +0100, Diego Calleja wrote:
> Recently, a openoffice hacker wrote in his blog that the kernel was
> culprit of applications not starting as fast as in other systems. 

Useless without a link ;-)

Lee


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/19] readahead: state based method
  2005-11-25 15:21   ` Eric Dumazet
  2005-11-25 15:31     ` Eric Dumazet
@ 2005-11-26  3:09     ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26  3:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, Andrew Morton

On Fri, Nov 25, 2005 at 04:21:22PM +0100, Eric Dumazet wrote:
> Wu Fengguang a écrit :
> 
> > include/linux/fs.h |    8 +
> >
> >--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
> >+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
> >@@ -604,13 +604,19 @@ struct file_ra_state {
> > 	unsigned long start;		/* Current window */
> > 	unsigned long size;
> > 	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
> >-	unsigned long cache_hit;	/* cache hit count*/
> >+	uint64_t      cache_hit;	/* cache hit count*/
> > 	unsigned long prev_page;	/* Cache last read() position */
> > 	unsigned long ahead_start;	/* Ahead window */
> > 	unsigned long ahead_size;
> > 	unsigned long ra_pages;		/* Maximum readahead window */
> > 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses 
> > 	*/
> > 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses 
> > 	*/
> >+
> >+	unsigned long age;
> >+	pgoff_t la_index;
> >+	pgoff_t ra_index;
> >+	pgoff_t lookahead_index;
> >+	pgoff_t readahead_index;
> > };
> 
> Hum... This sizeof(struct file) increase seems quite large...

Thanks.
I'm not sure if the two read-ahead logics should coexist in long term.
If so, the file_ra_state can be changed as follows to save memory:

struct file_ra_state {
        union { 
                struct {
                        unsigned long start;            /* Current window */
                        unsigned long size;
                        unsigned long ahead_start;      /* Ahead window */
                        unsigned long ahead_size;
                };
                struct {
                        unsigned long mmap_hit;         /* Cache hit stat for mmap accesses */
                        unsigned long mmap_miss;        /* Cache miss stat for mmap accesses */
                };      
                struct {
                        unsigned long age;
                        pgoff_t la_index;
                        pgoff_t ra_index;
                        pgoff_t lookahead_index;
                        pgoff_t readahead_index;
                };
        };
        uint64_t      cache_hit;        /* cache hit count*/
        unsigned long flags;            /* ra flags RA_FLAG_xxx*/
        unsigned long prev_page;        /* Cache last read() position */
        unsigned long ra_pages;         /* Maximum readahead window */
};

The mmap_hit/mmap_miss should be only used in mmap read-around logic.

> Have you ever considered to change struct file so that file_ra_state is not 
> embedded, but dynamically allocated (or other strategy) for regular files ?
> 
> I mean, sockets, pipes cannot readahead... And some machines use far more 
> sockets than regular files.
> 
> I wrote such a patch in the past I could resend...

Yes, I noticed it, and think it generally a good idea. The only problem is that
I'm afraid the patch might make the file_ra_state tightly coupled with file. See
this comment:

 * Note that @filp is purely used for passing on to the ->readpage[s]()
 * handler: it may refer to a different file from @mapping (so we may not use
 * @filp->f_mapping or @filp->f_dentry->d_inode here).
 * Also, @ra may not be equal to &@filp->f_ra.

And this patch:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-rc2/2.6.15-rc2-mm1/broken-out/ext3_readdir-use-generic-readahead.patch
  Linus points out that ext3_readdir's readahead only cuts in when
  ext3_readdir() is operating at the very start of the directory.  So for large
  directories we end up performing no readahead at all and we suck.
  
  So take it all out and use the core VM's page_cache_readahead().  This means
  that ext3 directory reads will use all of readahead's dynamic sizing goop.
  
  Note that we're using the diretory's filp->f_ra to hold the readahead state,
  but readahead is actually being performed against the underlying blockdev's
  address_space.  Fortunately the readahead code is all set up to handle this.

Regards,
Wu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
  2005-11-25 19:31   ` Lee Revell
@ 2005-11-26  3:17   ` Wu Fengguang
  2005-11-26 13:25     ` Tomasz Torcz
  1 sibling, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26  3:17 UTC (permalink / raw)
  To: Diego Calleja; +Cc: linux-kernel, akpm

On Fri, Nov 25, 2005 at 04:43:17PM +0100, Diego Calleja wrote:
> Recently, a openoffice hacker wrote in his blog that the kernel was
> culprit of applications not starting as fast as in other systems.
> Most of the reasons he gave were wrong, but there was a interesting
> one: When you start your system, you've lots of free memory. Since
> you have lots of memory, he said it was reasonable to expect that
> kernel would readahead *heavily* everything it can to fill that
> memory as soon as possible (hoping that what you readahead'ed was
> part of the kde/gnome/openoffice libraries etc) and go back to the
> normal behaviour when your free memory is used by caches etc.
> "Teorically" it looks like a nice heuristic for desktops. Does
> adaptative readahead does something like this?

It's interesting ;)
In fact some distributions do have a read-ahead script to preload files on
startup. The readahead system call should be enough for this purpose:

NAME
       readahead - perform file readahead into page cache

SYNOPSIS
       #include <fcntl.h>

       ssize_t readahead(int fd, off64_t *offset, size_t count);

Thanks,
Wu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 16/19] readahead: laptop mode support
  2005-11-25 16:06   ` Bart Samwel
@ 2005-11-26  3:33     ` Wu Fengguang
  0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26  3:33 UTC (permalink / raw)
  To: Bart Samwel; +Cc: linux-kernel, Andrew Morton

On Fri, Nov 25, 2005 at 05:06:25PM +0100, Bart Samwel wrote:
> Wu Fengguang wrote:
> >When the laptop drive is spinned down, defer look-ahead to spin up time.
> 
> Just a n00b question: isn't readahead something that happens at read 
> time at the block device level? And doesn't the fact that you're reading 
> something imply that the drive is spun up? Or can readahead be triggered 
> by reading from cache?

Yes, both the old and new read-ahead logic issue read-ahead requests before
the pages are immediately needed. It is called look-ahead in the new logic,
which achieves I/O pipelining, and helps hide the I/O latency.

> >For crazy laptop users who prefer aggressive read-ahead, here is the way:
> >
> ># echo 1000 > /proc/sys/vm/readahead_ratio
> ># blockdev --setra 524280 /dev/hda      # this is the max possible value
> 
> These amounts of readahead are absolutely useless though. I've done 
> measurements about a year ago, that show that at a spindown time of two 
> minutes you've basically achieved all the power savings you can get. 
> More than 10 minutes of spindown is absolutely useless unless you have a 
> desktop drive, because those don't normally support more than 50,000 
> spinup cycles. The only apps I can think of that work on this amount of 
> data in such a short period of time are all apps where you shouldn't be 
> concerned about power drawn by the hard drive. :)

Thanks, I have read about the paper, quite informative :)

It's just that some one suggested about the feature, and it's just a matter of
lifting some limit values in the code - so I did it :)

Regards,
Wu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-26  3:17   ` Wu Fengguang
@ 2005-11-26 13:25     ` Tomasz Torcz
  2005-11-26 14:11       ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Torcz @ 2005-11-26 13:25 UTC (permalink / raw)
  To: Wu Fengguang, Diego Calleja, linux-kernel, akpm

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On Sat, Nov 26, 2005 at 11:17:55AM +0800, Wu Fengguang wrote:
> On Fri, Nov 25, 2005 at 04:43:17PM +0100, Diego Calleja wrote:
> > Recently, a openoffice hacker wrote in his blog that the kernel was
> > culprit of applications not starting as fast as in other systems.
> > Most of the reasons he gave were wrong, but there was a interesting
> > one: When you start your system, you've lots of free memory. Since
> > you have lots of memory, he said it was reasonable to expect that
> > kernel would readahead *heavily* everything it can to fill that
> > memory as soon as possible (hoping that what you readahead'ed was
> > part of the kde/gnome/openoffice libraries etc) and go back to the
> > normal behaviour when your free memory is used by caches etc.
> > "Teorically" it looks like a nice heuristic for desktops. Does
> > adaptative readahead does something like this?
> 
> It's interesting ;)
> In fact some distributions do have a read-ahead script to preload files on
> startup. The readahead system call should be enough for this purpose:
> 
> NAME
>        readahead - perform file readahead into page cache

posix_fadvise() with        POSIX_FADV_WILLNEED hint?
              The specified data will be accessed in the near future.

-- 
Tomasz Torcz                 Morality must always be based on practicality.
zdzichu@irc.-nie.spam-.pl                -- Baron Vladimir Harkonnen


[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-26 13:25     ` Tomasz Torcz
@ 2005-11-26 14:11       ` Wu Fengguang
  0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26 14:11 UTC (permalink / raw)
  To: Diego Calleja, linux-kernel, akpm

On Sat, Nov 26, 2005 at 02:25:24PM +0100, Tomasz Torcz wrote:
> > It's interesting ;)
> > In fact some distributions do have a read-ahead script to preload files on
> > startup. The readahead system call should be enough for this purpose:
> > 
> > NAME
> >        readahead - perform file readahead into page cache
> 
> posix_fadvise() with        POSIX_FADV_WILLNEED hint?
>               The specified data will be accessed in the near future.

Nod, this one is better.
They do roughly the same thing, while the latter interface is more portable.

Another note: if you do not know precisely which files to readahead, turn on
aggressive readahead feature temporary might help. There are mainly three
parameters that controls the aggressiveness, the following example values should
be large enough:

echo 16 > /proc/sys/vm/readahead_hit_rate   # overlook hit rate?
echo 1000 > /proc/sys/vm/readahead_ratio    # grow up ra-size quickly?
blockdev --setra 8192 /dev/had              # large ra-size limit?

Thanks,
Wu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-25 19:31   ` Lee Revell
@ 2005-11-26 23:03     ` Mark van der Made
  2005-11-27 19:54       ` Lee Revell
  0 siblings, 1 reply; 32+ messages in thread
From: Mark van der Made @ 2005-11-26 23:03 UTC (permalink / raw)
  To: Lee Revell; +Cc: Diego Calleja, Wu Fengguang, linux-kernel, akpm

On 11/25/05, Lee Revell <rlrevell@joe-job.com> wrote:
> On Fri, 2005-11-25 at 16:43 +0100, Diego Calleja wrote:
> > Recently, a openoffice hacker wrote in his blog that the kernel was
> > culprit of applications not starting as fast as in other systems.
>
> Useless without a link ;-)
>
I think Diego refers to Michael Meeks blog:
http://www.gnome.org/~michael/activity.html#2005-11-04
Michael Meek also speaks about kernel issues in an interview in Linux
Format 72 (nov 2005).

Mark

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/19] Adaptive read-ahead V8
  2005-11-26 23:03     ` Mark van der Made
@ 2005-11-27 19:54       ` Lee Revell
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Revell @ 2005-11-27 19:54 UTC (permalink / raw)
  To: Mark van der Made; +Cc: Diego Calleja, Wu Fengguang, linux-kernel, akpm

On Sun, 2005-11-27 at 00:03 +0100, Mark van der Made wrote:
> On 11/25/05, Lee Revell <rlrevell@joe-job.com> wrote:
> > On Fri, 2005-11-25 at 16:43 +0100, Diego Calleja wrote:
> > > Recently, a openoffice hacker wrote in his blog that the kernel was
> > > culprit of applications not starting as fast as in other systems.
> >
> > Useless without a link ;-)
> >
> I think Diego refers to Michael Meeks blog:
> http://www.gnome.org/~michael/activity.html#2005-11-04
> Michael Meek also speaks about kernel issues in an interview in Linux
> Format 72 (nov 2005).
> 

This link tells a much more interesting story:

http://www.gnome.org/~lcolitti/gnome-startup/analysis/

Seems like most of the problems are in userspace at the application
level.

Maybe the excessive seeking when loading libraries could be solved by
physically rearranging the libraries on disk so that the "hot paths"
don't induce as many seeks.

Lee


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-11-27 19:54 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
2005-11-25 15:12 ` [PATCH 02/19] vm: kswapd incmin Wu Fengguang
2005-11-25 15:12 ` [PATCH 03/19] mm: balance page aging between zones and slabs Wu Fengguang
2005-11-25 15:12 ` [PATCH 04/19] mm: debug page reclaim Wu Fengguang
2005-11-25 15:12 ` [PATCH 05/19] radixtree: sync with mainline Wu Fengguang
2005-11-25 15:12 ` [PATCH 06/19] radixtree: look-aside cache Wu Fengguang
2005-11-25 15:12 ` [PATCH 07/19] readahead: some preparation Wu Fengguang
2005-11-25 15:12 ` [PATCH 08/19] readahead: call scheme Wu Fengguang
2005-11-25 15:12 ` [PATCH 09/19] readahead: parameters Wu Fengguang
2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
2005-11-25 15:21   ` Eric Dumazet
2005-11-25 15:31     ` Eric Dumazet
2005-11-26  3:09     ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 11/19] readahead: context " Wu Fengguang
2005-11-25 15:12 ` [PATCH 12/19] readahead: other methods Wu Fengguang
2005-11-25 15:12 ` [PATCH 13/19] readahead: detect and rescue live pages Wu Fengguang
2005-11-25 15:12 ` [PATCH 14/19] readahead: events accounting Wu Fengguang
2005-11-25 15:12 ` [PATCH 15/19] readahead: page aging accounting Wu Fengguang
2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
2005-11-25 16:06   ` Bart Samwel
2005-11-26  3:33     ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 17/19] readahead: disable look-ahead for loopback file Wu Fengguang
2005-11-25 15:12 ` [PATCH 18/19] readahead: nfsd support Wu Fengguang
2005-11-25 15:12 ` [PATCH 19/19] io: avoid too much latency from read-ahead Wu Fengguang
2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
2005-11-25 19:31   ` Lee Revell
2005-11-26 23:03     ` Mark van der Made
2005-11-27 19:54       ` Lee Revell
2005-11-26  3:17   ` Wu Fengguang
2005-11-26 13:25     ` Tomasz Torcz
2005-11-26 14:11       ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox