public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/23] Adaptive read-ahead V11
@ 2006-03-19  2:34 Wu Fengguang
  2006-03-19  2:34 ` [PATCH 01/23] readahead: kconfig options Wu Fengguang
                   ` (23 more replies)
  0 siblings, 24 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Mornings,

A fresh patch for a fresh new day, and wish you a good appetite ;)

Highlights in this release:
	- The patch series are heavily reworked.
	- The code is re-audited and made cleaner.
	- The old stock read-ahead logic is untouched and will always be
	  available; the new adaptive read-ahead logic will be presented as a
	  compile time selectable feature.

Why do we need this?
In short, the stock read-ahead logic does not cover many important I/O
applications. This patch series present linux users a new option. Please
refer to the first patch in this series for the new features.

Patches in the series:

[PATCH 01/23] readahead: kconfig options
[PATCH 02/23] radixtree: look-aside cache
[PATCH 03/23] radixtree: hole scanning functions
[PATCH 04/23] readahead: page flag PG_readahead
[PATCH 05/23] readahead: refactor do_generic_mapping_read()
[PATCH 06/23] readahead: refactor __do_page_cache_readahead()
[PATCH 07/23] readahead: insert cond_resched() calls
[PATCH 08/23] readahead: common macros
[PATCH 09/23] readahead: events accounting
[PATCH 10/23] readahead: support functions
[PATCH 11/23] readahead: sysctl parameters
[PATCH 12/23] readahead: min/max sizes
[PATCH 13/23] readahead: page cache aging accounting
[PATCH 14/23] readahead: state based method
[PATCH 15/23] readahead: context based method
[PATCH 16/23] readahead: other methods
[PATCH 17/23] readahead: call scheme
[PATCH 18/23] readahead: laptop mode
[PATCH 19/23] readahead: loop case
[PATCH 20/23] readahead: nfsd case
[PATCH 21/23] readahead: debug radix tree new functions
[PATCH 22/23] readahead: debug traces showing accessed file names
[PATCH 23/23] readahead: debug traces showing read patterns

Note that the last three patches are optional and only to help
early stage development.

Patches for stable kernels will soon be available at:
	http://www.vanheusden.com/ara/
Thanks to Folkert van Heusden for providing the free hosting!

Changelog
=========

V11  2006-03-19
- patchset rework
- add kconfig option to make the feature compile-time selectable
- improve radix tree scan functions
- fix bug of using smp_processor_id() in preemptible code
- avoid overflow in compute_thrashing_threshold()
- disable sparse read prefetching if (readahead_hit_rate == 1)
- make thrashing recovery a standalone function
- random cleanups

V10  2005-12-16
- remove delayed page activation
- remove live page protection
- revert mmap readaround to old behavior
- default to original readahead logic
- default to original readahead size
- merge comment fixes from Andreas Mohr
- merge radixtree cleanups from Christoph Lameter
- reduce sizeof(struct file_ra_state) by unnamed union
- stateful method cleanups
- account other read-ahead paths

V9  2005-12-3
- standalone mmap read-around code, a little more smart and tunable
- make stateful method sensible of request size
- decouple readahead_ratio from live pages protection
- let readahead_ratio contribute to ra_size grow speed in stateful method
- account variance of ra_size

V8  2005-11-25

- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values

V7  2005-11-09

- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size

V6  2005-11-01

- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages

V5  2005-10-28

- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities

V4  2005-10-15

- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache

V3  2005-10-06

- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones

V2  2005-09-15

- delayed page activation
- look-ahead: towards pipelined read-ahead

V1  2005-09-13

Initial release which features:
        o stateless (for now)
        o adapts to available memory / read speed
        o free of thrashing (in theory)

And handles:
        o large number of slow streams (FTP server)
	o open/read/close access patterns (NFS server)
        o multiple interleaved, sequential streams in one file
	  (multithread / multimedia / database)

Cheers,
Wu Fengguang
--
Dept. Automation                University of Science and Technology of China

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/23] readahead: kconfig options
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-kconfig-options.patch --]
[-- Type: text/plain, Size: 4248 bytes --]

This patchset introduces a set of adaptive read-ahead methods.
They enable the kernel to better support many important I/O applications.

The functional features include:

- Adaptive read-ahead buffer management
	- aggressive, thrashing safe read-ahead size
		- optimal memory utilisation while achieving good I/O throughput
		- unnecessary to hand tuning VM_MAX_READAHEAD
		- support slow/fast readers at the same time
		- support large number of concurrent readers
	- shrinkable look-ahead size
		- cut down up to 40% memory consumption on overloaded situation

- Support common access patterns
        - multiple streams on one fd
        - backward prefetching
        - sparse reading
        - seeking and reading

- Special case handling
        - nfsd support: the raparams cache is no longer required
	- laptop mode support: defer look-ahead on drive spinned down
        - loopback file support: avoid double look-ahead

The design strategies are:

- Dual methods design
        - stateful method: the fast and default one
	- stateless method: the robust and failsafe one
	- if anything abnormal happens, the stateful method bails out, the
	  stateless method queries the page cache and possibly restart the
	  read-ahead process

- Robust feedback design
	- sense and handle important states so that the logic wont run away
	- detect danger of thrashing and prevent it in advance
        - extensive accounting and debugging traces

This patch:

Add kconfig options to enable/disable:
	- adaptive read-ahead logic
	- adaptive read-ahead debug traces and events accounting

The read-ahead introduction text is cited from the well written LWN article
"Adaptive file readahead" <http://lwn.net/Articles/155510/> :)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/Kconfig |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 55 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/mm/Kconfig
+++ linux-2.6.16-rc6-mm2/mm/Kconfig
@@ -145,3 +145,58 @@ config MIGRATION
 	  while the virtual addresses are not changed. This is useful for
 	  example on NUMA systems to put pages nearer to the processors accessing
 	  the page.
+
+#
+# Adaptive file readahead
+#
+config ADAPTIVE_READAHEAD
+	bool "Adaptive file readahead (EXPERIMENTAL)"
+	default n
+	depends on EXPERIMENTAL
+	help
+	  Readahead is a technique employed by the kernel in an attempt
+	  to improve file reading performance. If the kernel has reason
+	  to believe that a particular file is being read sequentially,
+	  it will attempt to read blocks from the file into memory before
+	  the application requests them. When readahead works, it speeds
+	  up the system's throughput, since the reading application does
+	  not have to wait for its requests. When readahead fails, instead,
+	  it generates useless I/O and occupies memory pages which are
+	  needed for some other purpose. For sequential readings,
+
+	  Normally, the kernel uses a stock readahead logic that is well
+	  understood and well tuned. This option enables a much complex and
+	  feature rich one. It is more aggressive and memory efficient in
+	  doing readahead, and supports some less-common access patterns such
+	  as reading backward and reading sparsely. However, due to the great
+	  diversity of real world applications, it might not fit everyone.
+
+	  Please refer to Documentation/sysctl/vm.txt for tunable parameters.
+
+	  Say Y here if you are building kernel for file servers.
+	  Say N if you are unsure.
+
+config DEBUG_READAHEAD
+	bool "Readahead debug and accounting"
+	default n
+	depends on ADAPTIVE_READAHEAD
+	select DEBUG_FS
+	help
+	  This option injects extra code to dump detailed debug traces and do
+	  readahead events accounting.
+
+	  To actually get the data:
+
+	  mkdir /debug
+	  mount -t debug none /debug
+
+	  After that you can do the following:
+
+	  echo > /debug/readahead/events # reset the counters
+	  cat /debug/readahead/events    # check the counters
+
+	  echo 1 > /debug/readahead/debug_level # show printk traces
+	  echo 2 > /debug/readahead/debug_level # show verbose printk traces
+	  echo 0 > /debug/readahead/debug_level # stop filling my kern.log
+
+	  Say N, unless you have readahead performance problems.

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 02/23] radixtree: look-aside cache
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
  2006-03-19  2:34 ` [PATCH 01/23] readahead: kconfig options Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-20 16:01   ` Christoph Lameter
  2006-03-19  2:34 ` [PATCH 03/23] radixtree: hole scanning functions Wu Fengguang
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Nick Piggin, Christoph Lameter, Wu Fengguang

[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 9379 bytes --]

Introduce a set of lookup functions to radix tree for the read-ahead logic.
Other access patterns with high locality may also benefit from them.

- radix_tree_lookup_node(root, index, level)
	Perform partial lookup, return the @level'th parent of the slot at
	@index.

- radix_tree_cache_xxx()
	Init/Query the cache.
- radix_tree_cache_lookup(root, cache, index)
	Perform lookup with the aid of a look-aside cache.
	For sequential scans, it has a time complexity of 64*O(1) + 1*O(logN).

	Typical usage:

   void func() {
  +       struct radix_tree_cache cache;
  +
  +       radix_tree_cache_init(&cache);
          read_lock_irq(&mapping->tree_lock);
          for(;;) {
  -               page = radix_tree_lookup(&mapping->page_tree, index);
  +               page = radix_tree_cache_lookup(&mapping->page_tree, &cache, index);
          }
          read_unlock_irq(&mapping->tree_lock);
   }                                                                                                                       	

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/radix-tree.h |   78 +++++++++++++++++++++++++++++++
 lib/radix-tree.c           |  110 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 156 insertions(+), 32 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/include/linux/radix-tree.h
+++ linux-2.6.16-rc6-mm2/include/linux/radix-tree.h
@@ -23,12 +23,24 @@
 #include <linux/preempt.h>
 #include <linux/types.h>
 
+#define RADIX_TREE_MAP_SHIFT	6
+#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
+
 struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	*rnode;
 };
 
+/*
+ * Lookaside cache to support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+	unsigned long first_index;
+	struct radix_tree_node *tree_node;
+};
+
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
 	.gfp_mask = (mask),						\
@@ -48,9 +60,14 @@ do {									\
 #define RADIX_TREE_MAX_TAGS 2
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_node(struct radix_tree_root *, unsigned long,
+							unsigned int);
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache);
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index, unsigned int level);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
@@ -73,4 +90,61 @@ static inline void radix_tree_preload_en
 	preempt_enable();
 }
 
+/**
+ *	radix_tree_lookup    -    perform lookup operation on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+							unsigned long index)
+{
+	return radix_tree_lookup_node(root, index, 0);
+}
+
+/**
+ *	radix_tree_cache_init    -    init a look-aside cache
+ *	@cache:		look-aside cache
+ *
+ *	Init the radix tree look-aside cache @cache.
+ */
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+	cache->first_index = RADIX_TREE_MAP_MASK;
+	cache->tree_node = NULL;
+}
+
+/**
+ *	radix_tree_cache_lookup    -    cached lookup on a radix tree
+ *	@root:		radix tree root
+ *	@cache:		look-aside cache
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root,
+ *	and make use of @cache to speedup the lookup process.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+						struct radix_tree_cache *cache,
+						unsigned long index)
+{
+	return radix_tree_cache_lookup_node(root, cache, index, 0);
+}
+
+static inline unsigned int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+	return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+	return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline unsigned long
+radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+	return cache->first_index;
+}
+
 #endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.16-rc6-mm2.orig/lib/radix-tree.c
+++ linux-2.6.16-rc6-mm2/lib/radix-tree.c
@@ -32,15 +32,6 @@
 #include <linux/bitops.h>
 
 
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT	6
-#else
-#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
-#endif
-
-#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
-
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
@@ -289,32 +280,89 @@ int radix_tree_insert(struct radix_tree_
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-static inline void **__lookup_slot(struct radix_tree_root *root,
-				   unsigned long index)
+/**
+ *	radix_tree_lookup_node    -    low level lookup routine
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@level:		stop at that many levels from the tree leaf
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ *	The return value is:
+ *	@level == 0:      page at @index;
+ *	@level == 1:      the corresponding bottom level tree node;
+ *	@level < height:  (@level-1)th parent node of the bottom node
+ *			  that contains @index;
+ *	@level >= height: the root node.
+ */
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+				unsigned long index, unsigned int level)
 {
 	unsigned int height, shift;
-	struct radix_tree_node **slot;
+	struct radix_tree_node *slot;
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
+	slot = root->rnode;
 
-	while (height > 0) {
-		if (*slot == NULL)
+	while (height > level) {
+		if (slot == NULL)
 			return NULL;
 
-		slot = (struct radix_tree_node **)
-			((*slot)->slots +
-				((index >> shift) & RADIX_TREE_MAP_MASK));
+		slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
 
-	return (void **)slot;
+	return slot;
+}
+EXPORT_SYMBOL(radix_tree_lookup_node);
+
+/**
+ *	radix_tree_cache_lookup_node    -    cached lookup node
+ *	@root:		radix tree root
+ *	@cache:		look-aside cache
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root,
+ *	and return the node @level levels from the bottom in the search path.
+ *
+ *	@cache stores the last accessed upper level tree node by this
+ *	function, and is always checked first before searching in the tree.
+ *	It can improve speed for access patterns with strong locality.
+ *
+ *	NOTE:
+ *	- The cache becomes invalid on leaving the lock;
+ *	- Do not intermix calls with different @level.
+ */
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index, unsigned int level)
+{
+	struct radix_tree_node *node;
+        unsigned long i;
+        unsigned long mask;
+
+        if (level >= root->height)
+                return root->rnode;
+
+        i = ((index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK);
+        mask = ~((RADIX_TREE_MAP_SIZE << (level * RADIX_TREE_MAP_SHIFT)) - 1);
+
+	if ((index & mask) == cache->first_index)
+                return cache->tree_node->slots[i];
+
+	node = radix_tree_lookup_node(root, index, level + 1);
+	if (!node)
+		return 0;
+
+	cache->tree_node = node;
+	cache->first_index = (index & mask);
+        return node->slots[i];
 }
+EXPORT_SYMBOL(radix_tree_cache_lookup_node);
 
 /**
  *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
@@ -326,25 +374,27 @@ static inline void **__lookup_slot(struc
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return __lookup_slot(root, index);
+	struct radix_tree_node *node;
+
+	node = radix_tree_lookup_node(root, index, 1);
+	return node->slots + (index & RADIX_TREE_MAP_MASK);
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
 /**
- *	radix_tree_lookup    -    perform lookup operation on a radix tree
- *	@root:		radix tree root
- *	@index:		index key
+ *	radix_tree_cache_count    -    items in the cached node
+ *	@cache:      radix tree look-aside cache
  *
- *	Lookup the item at the position @index in the radix tree @root.
+ *      Query the number of items contained in the cached node.
  */
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache)
 {
-	void **slot;
-
-	slot = __lookup_slot(root, index);
-	return slot != NULL ? *slot : NULL;
+	if (!(cache->first_index & RADIX_TREE_MAP_MASK))
+		return cache->tree_node->count;
+	else
+		return 0;
 }
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_cache_count);
 
 /**
  *	radix_tree_tag_set - set a tag on a radix tree node

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 03/23] radixtree: hole scanning functions
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
  2006-03-19  2:34 ` [PATCH 01/23] readahead: kconfig options Wu Fengguang
  2006-03-19  2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 04/23] readahead: page flag PG_readahead Wu Fengguang
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: radixtree-hole-scanning-functions.patch --]
[-- Type: text/plain, Size: 3778 bytes --]

Introduce a pair of functions to scan radix tree for hole/empty item.

 include/linux/radix-tree.h |    4 +
 lib/radix-tree.c           |  104 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

--- linux-2.6.16-rc6-mm2.orig/include/linux/radix-tree.h
+++ linux-2.6.16-rc6-mm2/include/linux/radix-tree.h
@@ -68,6 +68,10 @@ unsigned int radix_tree_cache_count(stru
 void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
 				struct radix_tree_cache *cache,
 				unsigned long index, unsigned int level);
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
--- linux-2.6.16-rc6-mm2.orig/lib/radix-tree.c
+++ linux-2.6.16-rc6-mm2/lib/radix-tree.c
@@ -397,6 +397,110 @@ unsigned int radix_tree_cache_count(stru
 EXPORT_SYMBOL(radix_tree_cache_count);
 
 /**
+ *	radix_tree_scan_hole_backward    -    scan backward for hole
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan backward from @index for a hole/empty item, stop when
+ *      - hit hole
+ *      - @max_scan or more items scanned
+ *      - hit index 0
+ *
+ *      Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct radix_tree_node *node;
+	unsigned long origin;
+	int i;
+
+	origin = index;
+        radix_tree_cache_init(&cache);
+
+	while (origin - index < max_scan) {
+		node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+		if (!node)
+			break;
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index - RADIX_TREE_MAP_SIZE) |
+					RADIX_TREE_MAP_MASK;
+			goto check_underflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+			if (!node->slots[i])
+				goto out;
+		}
+
+check_underflow:
+		if (unlikely(index == ULONG_MAX)) {
+			index = 0;
+			break;
+		}
+	}
+
+out:
+	return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole_backward);
+
+/**
+ *	radix_tree_scan_hole    -    scan for hole
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan forward from @index for a hole/empty item, stop when
+ *      - hit hole
+ *      - hit EOF
+ *      - hit index ULONG_MAX
+ *      - @max_scan or more items scanned
+ *
+ *      Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct radix_tree_node *node;
+	unsigned long origin;
+	int i;
+
+	origin = index;
+        radix_tree_cache_init(&cache);
+
+	while (index - origin < max_scan) {
+		node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+		if (!node)
+			break;
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index | RADIX_TREE_MAP_MASK) + 1;
+			goto check_overflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE;
+								i++, index++) {
+			if (!node->slots[i])
+				goto out;
+		}
+
+check_overflow:
+		if (unlikely(!index)) {
+			index = ULONG_MAX;
+			break;
+		}
+	}
+out:
+	return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole);
+
+/**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
  *	@index:		index key

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 04/23] readahead: page flag PG_readahead
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (2 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 03/23] radixtree: hole scanning functions Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 05/23] readahead: refactor do_generic_mapping_read() Wu Fengguang
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-page-flag-PG_readahead.patch --]
[-- Type: text/plain, Size: 1851 bytes --]

An new page flag PG_readahead is introduced as a look-ahead mark, which
reminds the caller to give the adaptive read-ahead logic a chance to do
read-ahead ahead of time for I/O pipelining.

It roughly corresponds to `ahead_start' of the stock read-ahead logic.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/page-flags.h |    5 +++++
 mm/page_alloc.c            |    2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

--- linux-2.6.16-rc6-mm2.orig/include/linux/page-flags.h
+++ linux-2.6.16-rc6-mm2/include/linux/page-flags.h
@@ -76,6 +76,7 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
+#define PG_readahead		20	/* Reminder to do readahead */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -343,6 +344,10 @@ extern void __mod_page_state_offset(unsi
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
--- linux-2.6.16-rc6-mm2.orig/mm/page_alloc.c
+++ linux-2.6.16-rc6-mm2/mm/page_alloc.c
@@ -549,7 +549,7 @@ static int prep_new_page(struct page *pa
 	if (PageReserved(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 05/23] readahead: refactor do_generic_mapping_read()
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (3 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 04/23] readahead: page flag PG_readahead Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 06/23] readahead: refactor __do_page_cache_readahead() Wu Fengguang
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-refactor-do_generic_mapping_read.patch --]
[-- Type: text/plain, Size: 2536 bytes --]

In do_generic_mapping_read(), release accessed pages some time later,
so that it can be passed to and used by the adaptive read-ahead code.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/filemap.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/mm/filemap.c
+++ linux-2.6.16-rc6-mm2/mm/filemap.c
@@ -799,10 +799,12 @@ void do_generic_mapping_read(struct addr
 	unsigned long prev_index;
 	loff_t isize;
 	struct page *cached_page;
+	struct page *prev_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
+	prev_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -841,6 +843,11 @@ find_page:
 			handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
+
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -875,7 +882,6 @@ page_ok:
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
 		if (ret == nr && desc->count)
 			continue;
 		goto out;
@@ -887,7 +893,6 @@ page_not_up_to_date:
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);
-			page_cache_release(page);
 			continue;
 		}
 
@@ -917,7 +922,6 @@ readpage:
 					 * invalidate_inode_pages got it
 					 */
 					unlock_page(page);
-					page_cache_release(page);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -938,7 +942,6 @@ readpage:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			page_cache_release(page);
 			goto out;
 		}
 
@@ -947,7 +950,6 @@ readpage:
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
 			if (nr <= offset) {
-				page_cache_release(page);
 				goto out;
 			}
 		}
@@ -957,7 +959,6 @@ readpage:
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
-		page_cache_release(page);
 		goto out;
 
 no_cached_page:
@@ -982,6 +983,9 @@ no_cached_page:
 		}
 		page = cached_page;
 		cached_page = NULL;
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
 		goto readpage;
 	}
 
@@ -991,6 +995,8 @@ out:
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);
+	if (prev_page)
+		page_cache_release(prev_page);
 	if (filp)
 		file_accessed(filp);
 }

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 06/23] readahead: refactor __do_page_cache_readahead()
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (4 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 05/23] readahead: refactor do_generic_mapping_read() Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-refactor-__do_page_cache_readahead.patch --]
[-- Type: text/plain, Size: 2546 bytes --]

Add look-ahead support to __do_page_cache_readahead(), which is needed by
the adaptive read-ahead logic.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   15 +++++++++------
 1 files changed, 9 insertions(+), 6 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -268,7 +268,8 @@ out:
  */
 static int
 __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read)
+			pgoff_t offset, unsigned long nr_to_read,
+			unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
 	struct page *page;
@@ -281,7 +282,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -304,6 +305,8 @@ __do_page_cache_readahead(struct address
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
+		if (page_idx == nr_to_read - lookahead_size)
+			__SetPageReadahead(page);
 		ret++;
 	}
 	read_unlock_irq(&mapping->tree_lock);
@@ -340,7 +343,7 @@ int force_page_cache_readahead(struct ad
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk);
+						offset, this_chunk, 0);
 		if (err < 0) {
 			ret = err;
 			break;
@@ -387,7 +390,7 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
@@ -407,7 +410,7 @@ blockable_page_cache_readahead(struct ad
 	if (!block && bdi_read_congested(mapping->backing_dev_info))
 		return 0;
 
-	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 
 	return check_ra_success(ra, nr_to_read, actual);
 }
@@ -452,7 +455,7 @@ static int make_ahead_window(struct addr
  * @req_size: hint: total size of the read which the caller is performing in
  *            PAGE_CACHE_SIZE units
  *
- * page_cache_readahead() is the main function.  If performs the adaptive
+ * page_cache_readahead() is the main function.  It performs the adaptive
  * readahead window size management and submits the readahead I/O.
  *
  * Note that @filp is purely used for passing on to the ->readpage[s]()

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 07/23] readahead: insert cond_resched() calls
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (5 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 06/23] readahead: refactor __do_page_cache_readahead() Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  3:50   ` Lee Revell
  2006-03-19  2:34 ` [PATCH 08/23] readahead: common macros Wu Fengguang
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Con Kolivas, Wu Fengguang

[-- Attachment #1: readahead-insert-cond_resched-calls.patch --]
[-- Type: text/plain, Size: 2171 bytes --]

Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
complex, it becomes necessary to insert some cond_resched() calls in the
read-ahead path.

If desktop users still feel audio jitters with the new read-ahead code,
please try one of the following ways to get rid of it:

1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
2) reduce the read-ahead request size by running
	blockdev --setra 256 /dev/hda # or whatever device you are using

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


This patch is recommended by Con Kolivas to improve respond time for desktop.
Thanks!

 fs/mpage.c     |    4 +++-
 mm/readahead.c |    9 +++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -145,8 +145,10 @@ int read_cache_pages(struct address_spac
 			continue;
 		}
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
+		if (!pagevec_add(&lru_pvec, page)) {
+			cond_resched();
 			__pagevec_lru_add(&lru_pvec);
+		}
 		if (ret) {
 			while (!list_empty(pages)) {
 				struct page *victim;
@@ -184,8 +186,10 @@ static int read_pages(struct address_spa
 					page->index, GFP_KERNEL)) {
 			ret = mapping->a_ops->readpage(filp, page);
 			if (ret != AOP_TRUNCATED_PAGE) {
-				if (!pagevec_add(&lru_pvec, page))
+				if (!pagevec_add(&lru_pvec, page)) {
+					cond_resched();
 					__pagevec_lru_add(&lru_pvec);
+				}
 				continue;
 			} /* else fall through to release */
 		}
@@ -299,6 +303,7 @@ __do_page_cache_readahead(struct address
 			continue;
 
 		read_unlock_irq(&mapping->tree_lock);
+		cond_resched();
 		page = page_cache_alloc_cold(mapping);
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
--- linux-2.6.16-rc6-mm2.orig/fs/mpage.c
+++ linux-2.6.16-rc6-mm2/fs/mpage.c
@@ -407,8 +407,10 @@ mpage_readpages(struct address_space *ma
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else {
 			page_cache_release(page);
 		}

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 08/23] readahead: common macros
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (6 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 09/23] readahead: events accounting Wu Fengguang
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-common-macros.patch --]
[-- Type: text/plain, Size: 1556 bytes --]

Define some common used macros for the read-ahead logics.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -14,6 +14,17 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/nfsd/const.h>
+
+/* The default max/min read-ahead pages. */
+#define KB(size)	(((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
+#define MAX_RA_PAGES	KB(VM_MAX_READAHEAD)
+#define MIN_RA_PAGES	KB(VM_MIN_READAHEAD)
+#define MIN_NFSD_PAGES	KB(NFSSVC_MAXBLKSIZE/1024)
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -21,7 +32,7 @@ void default_unplug_io_fn(struct backing
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+	.ra_pages	= MAX_RA_PAGES,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
@@ -49,7 +60,7 @@ static inline unsigned long get_max_read
 
 static inline unsigned long get_min_readahead(struct file_ra_state *ra)
 {
-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	return MIN_RA_PAGES;
 }
 
 static inline void reset_ahead_window(struct file_ra_state *ra)

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 09/23] readahead: events accounting
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (7 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 08/23] readahead: common macros Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 10/23] readahead: support functions Wu Fengguang
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, J?rn Engel, Ingo Oeser

[-- Attachment #1: readahead-events-accounting.patch --]
[-- Type: text/plain, Size: 9539 bytes --]

A debugfs file named `readahead/events' is created according to advises from
J?rn Engel, Andrew Morton and Ingo Oeser.

It reveals various read-ahead activities/events, and is vital to the testing.

---------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.

- Preparations

## First compile kernel with CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 2 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

The commented out commands can uncover more detailed file accesses,
which are useful sometimes. Note that the log file can grow huge!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  270 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 269 insertions(+), 1 deletion(-)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -26,6 +26,261 @@
 #define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
 #define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
 
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+/*
+ * Detailed classification of read-ahead behaviors.
+ */
+#define RA_CLASS_SHIFT 4
+#define RA_CLASS_MASK  ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+	RA_CLASS_ALL,
+	RA_CLASS_NEWFILE,
+	RA_CLASS_STATE,
+	RA_CLASS_CONTEXT,
+	RA_CLASS_CONTEXT_AGGRESSIVE,
+	RA_CLASS_BACKWARD,
+	RA_CLASS_THRASHING,
+	RA_CLASS_SEEK,
+	RA_CLASS_END,
+};
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef CONFIG_DEBUG_READAHEAD
+#include <linux/init.h>
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+	RA_EVENT_CACHE_MISS,		/* read cache misses */
+	RA_EVENT_READRANDOM,		/* random reads */
+	RA_EVENT_IO_CONGESTION,		/* io congestion */
+	RA_EVENT_IO_CACHE_HIT,		/* canceled io due to cache hit */
+	RA_EVENT_IO_BLOCK,		/* read on locked page */
+
+	RA_EVENT_READAHEAD,		/* read-ahead issued */
+	RA_EVENT_READAHEAD_HIT,		/* read-ahead page hit */
+	RA_EVENT_LOOKAHEAD,		/* look-ahead issued */
+	RA_EVENT_LOOKAHEAD_HIT,		/* look-ahead mark hit */
+	RA_EVENT_LOOKAHEAD_NOACTION,	/* look-ahead mark ignored */
+	RA_EVENT_READAHEAD_MMAP,	/* read-ahead for memory mapped file */
+	RA_EVENT_READAHEAD_EOF,		/* read-ahead reaches EOF */
+	RA_EVENT_READAHEAD_SHRINK,	/* ra_size under previous la_size */
+	RA_EVENT_READAHEAD_THRASHING,	/* read-ahead thrashing happened */
+	RA_EVENT_READAHEAD_MUTILATE,	/* read-ahead request mutilated */
+	RA_EVENT_READAHEAD_RESCUE,	/* read-ahead rescued */
+
+	RA_EVENT_END
+};
+
+static const char * const ra_event_name[] = {
+	"cache_miss",
+	"read_random",
+	"io_congestion",
+	"io_cache_hit",
+	"io_block",
+	"readahead",
+	"readahead_hit",
+	"lookahead",
+	"lookahead_hit",
+	"lookahead_ignore",
+	"readahead_mmap",
+	"readahead_eof",
+	"readahead_shrink",
+	"readahead_thrash",
+	"readahead_mutilt",
+	"readahead_rescue",
+};
+
+static const char * const ra_class_name[] = {
+	"total",
+	"newfile",
+	"state",
+	"context",
+	"contexta",
+	"backward",
+	"onthrash",
+	"onraseek",
+	"none",
+};
+
+static unsigned long ra_events[RA_CLASS_END+1][RA_EVENT_END+1][2];
+
+static inline void ra_account(struct file_ra_state *ra,
+					enum ra_event e, int pages)
+{
+	enum ra_class c;
+
+	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		pages = -pages;
+	} else if (ra)
+		c = ra->flags & RA_CLASS_MASK;
+	else
+		c = RA_CLASS_END;
+
+	if (!c)
+		c = RA_CLASS_END;
+
+	ra_events[c][e][0] += 1;
+	ra_events[c][e][1] += pages;
+
+	if (e == RA_EVENT_READAHEAD)
+		ra_events[c][RA_EVENT_END][1] += pages * pages;
+}
+
+static int ra_events_show(struct seq_file *s, void *_)
+{
+	int i;
+	int c;
+	int e;
+	static const char event_fmt[] = "%-16s";
+	static const char class_fmt[] = "%10s";
+	static const char item_fmt[] = "%10lu";
+	static const char percent_format[] = "%9lu%%";
+	static const char * const table_name[] = {
+		"[table requests]",
+		"[table pages]",
+		"[table summary]"};
+
+	for (i = 0; i <= 1; i++) {
+		for (e = 0; e <= RA_EVENT_END; e++) {
+			ra_events[0][e][i] = 0;
+			for (c = 1; c < RA_CLASS_END; c++)
+				ra_events[0][e][i] += ra_events[c][e][i];
+		}
+
+		seq_printf(s, event_fmt, table_name[i]);
+		for (c = 0; c <= RA_CLASS_END; c++)
+			seq_printf(s, class_fmt, ra_class_name[c]);
+		seq_puts(s, "\n");
+
+		for (e = 0; e < RA_EVENT_END; e++) {
+			if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+				continue;
+			if (e == RA_EVENT_IO_BLOCK && i == 1)
+				continue;
+
+			seq_printf(s, event_fmt, ra_event_name[e]);
+			for (c = 0; c <= RA_CLASS_END; c++)
+				seq_printf(s, item_fmt, ra_events[c][e][i]);
+			seq_puts(s, "\n");
+		}
+		seq_puts(s, "\n");
+	}
+
+	seq_printf(s, event_fmt, table_name[2]);
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, class_fmt, ra_class_name[c]);
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "random_rate");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_READRANDOM][0] * 100) /
+			((ra_events[c][RA_EVENT_READRANDOM][0] +
+			  ra_events[c][RA_EVENT_READAHEAD][0]) | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "ra_hit_rate");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+			(ra_events[c][RA_EVENT_READAHEAD][1] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "la_hit_rate");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+			(ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "var_ra_size");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_END][1] -
+			 ra_events[c][RA_EVENT_READAHEAD][1] *
+			(ra_events[c][RA_EVENT_READAHEAD][1] /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1))) /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_ra_size");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_READAHEAD][1] +
+			 ra_events[c][RA_EVENT_READAHEAD][0] / 2) /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_la_size");
+	for (c = 0; c <= RA_CLASS_END; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_LOOKAHEAD][1] +
+			 ra_events[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+			(ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int ra_events_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, ra_events_show, NULL);
+}
+
+static ssize_t ra_events_write(struct file *file, const char __user *buf,
+						size_t size, loff_t *offset)
+{
+	memset(ra_events, 0, sizeof(ra_events));
+	return 1;
+}
+
+struct file_operations ra_events_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ra_events_open,
+	.write		= ra_events_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+u32 readahead_debug_level = 0;
+u32 disable_stateful_method = 0;
+
+static int __init readahead_init(void)
+{
+	struct dentry *root;
+
+	root = debugfs_create_dir("readahead", NULL);
+
+	debugfs_create_file("events", 0644, root, NULL, &ra_events_fops);
+
+	debugfs_create_u32("debug_level", 0644, root, &readahead_debug_level);
+	debugfs_create_bool("disable_stateful_method", 0644, root,
+			    &disable_stateful_method);
+
+	return 0;
+}
+
+module_init(readahead_init)
+#else
+#define ra_account(ra, e, pages)	do { } while (0)
+#define readahead_debug_level 		(0)
+#define disable_stateful_method		(0)
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+#define dprintk(args...) \
+	do { if (readahead_debug_level >= 1) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+	do { if (readahead_debug_level >= 2) printk(KERN_DEBUG args); } while(0)
+
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
@@ -368,6 +623,9 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
 	return ret;
 }
 
@@ -403,10 +661,16 @@ static inline int check_ra_success(struc
 int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read)
 {
+	unsigned long ret;
+
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+	ret = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+	ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
+	return ret;
 }
 
 /*
@@ -428,6 +692,10 @@ blockable_page_cache_readahead(struct ad
 
 	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 
+	ra_account(NULL, RA_EVENT_READAHEAD, actual);
+	dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+			mapping->host->i_ino, offset, nr_to_read, actual);
+
 	return check_ra_success(ra, nr_to_read, actual);
 }
 

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 10/23] readahead: support functions
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (8 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 09/23] readahead: events accounting Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 11/23] readahead: sysctl parameters Wu Fengguang
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-support-functions.patch --]
[-- Type: text/plain, Size: 4649 bytes --]

Several support functions of adaptive read-ahead.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |   11 +++++
 mm/readahead.c     |  116 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 127 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -1016,6 +1016,17 @@ void handle_ra_miss(struct address_space
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+extern int readahead_ratio;
+#else
+#define readahead_ratio 1
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+static inline int prefer_adaptive_readahead(void)
+{
+	return readahead_ratio >= 10;
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -875,3 +875,119 @@ unsigned long max_sane_readahead(unsigne
 	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
+
+/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ *   1. state based   - the default one
+ *   2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ *                 a read-ahead chunk
+ *    +-----------------------------------------+
+ *    |       # PG_readahead                    |
+ *    +-----------------------------------------+
+ *            ^ When this page is read, notify me for the next read-ahead.
+ *
+ *
+ * Here are some variable names used frequently:
+ *
+ *                                   |<------- la_size ------>|
+ *                  +-----------------------------------------+
+ *                  |                #                        |
+ *                  +-----------------------------------------+
+ *      ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * The nature of read-ahead allows false tests to occur occasionally.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+							pgoff_t offset)
+{
+	return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+							pgoff_t offset)
+{
+	struct page *page;
+
+	read_lock_irq(&mapping->tree_lock);
+	page = __find_page(mapping, offset);
+	read_unlock_irq(&mapping->tree_lock);
+	return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+	int pgrescue;
+	pgoff_t index;
+	struct zone *zone;
+	struct address_space *mapping;
+
+	BUG_ON(!nr_pages || !page);
+	pgrescue = 0;
+	index = page_index(page);
+	mapping = page_mapping(page);
+
+	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+			mapping->host->i_ino, index, nr_pages);
+
+	for(;;) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (!PageLRU(page))
+			goto out_unlock;
+
+		while (page_mapping(page) == mapping &&
+				page_index(page) == index) {
+			struct page *the_page = page;
+			page = next_page(page);
+			if (!PageActive(the_page) &&
+					!PageLocked(the_page) &&
+					page_count(the_page) == 1) {
+				list_move(&the_page->lru, &zone->inactive_list);
+				pgrescue++;
+			}
+			index++;
+			if (!--nr_pages)
+				goto out_unlock;
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+
+		cond_resched();
+		page = find_page(mapping, index);
+		if (!page)
+			goto out;
+	}
+out_unlock:
+	spin_unlock_irq(&zone->lru_lock);
+out:
+	ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+	return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 11/23] readahead: sysctl parameters
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (9 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 10/23] readahead: support functions Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 12/23] readahead: min/max sizes Wu Fengguang
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-parameter-sysctl-variables.patch --]
[-- Type: text/plain, Size: 5040 bytes --]

Add new sysctl entries in /proc/sys/vm:

- readahead_ratio = 50
	i.e. set read-ahead size to <=50% thrashing threshold
- readahead_hit_rate = 2
	i.e. read-ahead hit ratio >=50% is deemed ok

readahead_ratio also provides a way to select read-ahead logic at runtime:

	condition			    action
==========================================================================
readahead_ratio == 0		disable read-ahead
readahead_ratio <= 9		select the (old) stock read-ahead logic
readahead_ratio >= 10		select the (new) adaptive read-ahead logic

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 Documentation/sysctl/vm.txt |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/sysctl.h      |    2 ++
 kernel/sysctl.c             |   28 ++++++++++++++++++++++++++++
 mm/readahead.c              |   18 ++++++++++++++++++
 4 files changed, 84 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -28,6 +28,24 @@
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
 /*
+ * Adaptive read-ahead parameters.
+ */
+
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 2;
+EXPORT_SYMBOL(readahead_hit_rate);
+
+/*
  * Detailed classification of read-ahead behaviors.
  */
 #define RA_CLASS_SHIFT 4
--- linux-2.6.16-rc6-mm2.orig/include/linux/sysctl.h
+++ linux-2.6.16-rc6-mm2/include/linux/sysctl.h
@@ -187,6 +187,8 @@ enum
 	VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
 	VM_SWAP_PREFETCH=33,	/* swap prefetch */
+	VM_READAHEAD_RATIO=34,	/* percent of read-ahead size to thrashing-threshold */
+	VM_READAHEAD_HIT_RATE=35, /* one accessed page legitimizes so many read-ahead pages */
 };
 
 
--- linux-2.6.16-rc6-mm2.orig/kernel/sysctl.c
+++ linux-2.6.16-rc6-mm2/kernel/sysctl.c
@@ -74,6 +74,12 @@ extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
 extern int percpu_pagelist_fraction;
 
+#if defined(CONFIG_ADAPTIVE_READAHEAD)
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+static int one = 1;
+#endif
+
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
 extern int proc_unknown_nmi_panic(ctl_table *, int, struct file *,
@@ -926,6 +932,28 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+	{
+		.ctl_name	= VM_READAHEAD_RATIO,
+		.procname	= "readahead_ratio",
+		.data		= &readahead_ratio,
+		.maxlen		= sizeof(readahead_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+	{
+		.ctl_name	= VM_READAHEAD_HIT_RATE,
+		.procname	= "readahead_hit_rate",
+		.data		= &readahead_hit_rate,
+		.maxlen		= sizeof(readahead_hit_rate),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
--- linux-2.6.16-rc6-mm2.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.16-rc6-mm2/Documentation/sysctl/vm.txt
@@ -30,6 +30,8 @@ Currently, these files are in /proc/sys/
 - zone_reclaim_mode
 - zone_reclaim_interval
 - swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
 
 ==============================================================
 
@@ -189,3 +191,37 @@ copying back pages from swap into the sw
 practice it can take many minutes before the vm is idle enough.
 
 The default value is 1.
+
+==============================================================
+
+readahead_ratio
+
+This limits readahead size to percent of the thrashing-threshold,
+which is dynamicly estimated from the _history_ read speed and
+system load, to deduce the _future_ readahead request size.
+
+Set it to a smaller value if you have not enough memory for all the
+concurrent readers, or the I/O loads fluctuate a lot. But if there's
+plenty of memory(>2MB per reader), enlarge it may help speedup reads.
+
+readahead_ratio also selects the readahead logic:
+0:	disable readahead totally
+1-9:	select the stock readahead logic
+10-inf:	select the adaptive readahead logic
+
+The default value is 50; reasonable values would be 50-100.
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (readahead-pages : accessed-pages).
+Useful only when (readahead_ratio >= 10). If the previous readahead
+request has bad hit rate, the kernel will be reluctant to do the next
+readahead.
+
+A larger value helps catch more sparse access patterns. Be aware that
+readahead of the sparse patterns sacrifices memory for speed.
+
+The default value is 2.
+It is recommended to keep the value below (max-readahead-pages / 8).

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 12/23] readahead: min/max sizes
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (10 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 11/23] readahead: sysctl parameters Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 13/23] readahead: page cache aging accounting Wu Fengguang
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-parameter-minmax-sizes.patch --]
[-- Type: text/plain, Size: 2003 bytes --]

- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
  This value is no longer tightly coupled with the thrashing problem,
  therefore constrained by it. The adaptive read-ahead logic merely takes
  it as an upper bound, and will not stick to it under memory pressure.

- Slightly enlarge minimal/initial read-ahead size on big memory systems.
  Memory bounty systems are less likely to suffer from thrashing on small
  read-ahead sizes. A bigger initial value helps speed up the ra_size
  growing progress.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |    4 ++++
 mm/readahead.c     |   17 +++++++++++++++++
 2 files changed, 21 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -998,7 +998,11 @@ extern int filemap_populate(struct vm_ar
 int write_one_page(struct page *page, int wait);
 
 /* readahead.c */
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+#define VM_MAX_READAHEAD	1024	/* kbytes */
+#else
 #define VM_MAX_READAHEAD	128	/* kbytes */
+#endif
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 #define VM_MAX_CACHE_HIT    	256	/* max pages in a row in cache before
 					 * turning readahead off */
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1008,4 +1008,21 @@ out:
 	return nr_pages;
 }
 
+/*
+ * ra_min is mainly determined by the size of cache memory.
+ * Table of concrete numbers for 4KB page size:
+ *   inactive + free (MB):    4   8   16   32   64  128  256  512 1024
+ *            ra_min (KB):   16  16   16   16   20   24   32   48   64
+ */
+static inline void get_readahead_bounds(struct file_ra_state *ra,
+					unsigned long *ra_min,
+					unsigned long *ra_max)
+{
+	unsigned long pages;
+
+	pages = max_sane_readahead(KB(1024*1024));
+	*ra_max = min(min(pages, 0xFFFFUL), ra->ra_pages);
+	*ra_min = min(min(MIN_RA_PAGES + (pages>>13), KB(128)), *ra_max/2);
+}
+
 #endif /* CONFIG_ADAPTIVE_READAHEAD */

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 13/23] readahead: page cache aging accounting
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (11 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 12/23] readahead: min/max sizes Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 14/23] readahead: state based method Wu Fengguang
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-aging-accounting.patch --]
[-- Type: text/plain, Size: 4401 bytes --]

Collect info about the global available memory and its consumption speed.
The data are used by the stateful method to estimate the thrashing threshold.

They are the decisive factor of the correctness/accuracy of the resulting
read-ahead size.

- The accountings are done on a per-node basis, for the current vm subsystem
  allocates memory in a node affined manner.

- The readahead_aging is mainly increased on first access of the read-ahead
  pages, which makes it goes up constantly and smoothly, which helps improve
  the accuracy on small/fast read-aheads.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |    9 +++++++++
 mm/memory.c        |    1 +
 mm/readahead.c     |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap.c          |    2 ++
 mm/vmscan.c        |    3 +++
 5 files changed, 66 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -1031,6 +1031,15 @@ static inline int prefer_adaptive_readah
 	return readahead_ratio >= 10;
 }
 
+DECLARE_PER_CPU(unsigned long, readahead_aging);
+static inline void inc_readahead_aging(void)
+{
+	if (prefer_adaptive_readahead()) {
+		per_cpu(readahead_aging, get_cpu())++;
+		put_cpu();
+	}
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.16-rc6-mm2.orig/mm/memory.c
+++ linux-2.6.16-rc6-mm2/mm/memory.c
@@ -1984,6 +1984,7 @@ static int do_anonymous_page(struct mm_s
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto release;
+		inc_readahead_aging();
 		inc_mm_counter(mm, anon_rss);
 		lru_cache_add_active(page);
 		page_add_new_anon_rmap(page, vma, address);
--- linux-2.6.16-rc6-mm2.orig/mm/vmscan.c
+++ linux-2.6.16-rc6-mm2/mm/vmscan.c
@@ -440,6 +440,9 @@ static unsigned long shrink_page_list(st
 		if (PageWriteback(page))
 			goto keep_locked;
 
+		if (!PageReferenced(page))
+			inc_readahead_aging();
+
 		referenced = page_referenced(page, 1);
 		/* In active use or really unfreeable?  Activate it. */
 		if (referenced && page_mapping_inuse(page))
--- linux-2.6.16-rc6-mm2.orig/mm/swap.c
+++ linux-2.6.16-rc6-mm2/mm/swap.c
@@ -128,6 +128,8 @@ void fastcall mark_page_accessed(struct 
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
+		if (PageLRU(page))
+			inc_readahead_aging();
 	}
 }
 
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -46,6 +46,13 @@ int readahead_hit_rate = 2;
 EXPORT_SYMBOL(readahead_hit_rate);
 
 /*
+ * Measures the aging process of cold pages.
+ * Mainly increased on fresh page references to make it smooth.
+ */
+DEFINE_PER_CPU(unsigned long, readahead_aging);
+EXPORT_PER_CPU_SYMBOL(readahead_aging);
+
+/*
  * Detailed classification of read-ahead behaviors.
  */
 #define RA_CLASS_SHIFT 4
@@ -1009,6 +1016,50 @@ out:
 }
 
 /*
+ * State based calculation of read-ahead request.
+ *
+ * This figure shows the meaning of file_ra_state members:
+ *
+ *             chunk A                            chunk B
+ *  +---------------------------+-------------------------------------------+
+ *  |             #             |                   #                       |
+ *  +---------------------------+-------------------------------------------+
+ *                ^             ^                   ^                       ^
+ *              la_index      ra_index     lookahead_index         readahead_index
+ */
+
+/*
+ * The node's effective length of inactive_list(s).
+ */
+static unsigned long node_free_and_cold_pages(void)
+{
+	unsigned int i;
+	unsigned long sum = 0;
+	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		sum += zones[i].nr_inactive +
+			zones[i].free_pages - zones[i].pages_low;
+
+	return sum;
+}
+
+/*
+ * The node's accumulated aging activities.
+ */
+static unsigned long node_readahead_aging(void)
+{
+	unsigned long cpu;
+	unsigned long sum = 0;
+	cpumask_t mask = node_to_cpumask(numa_node_id());
+
+	for_each_cpu_mask(cpu, mask)
+		sum += per_cpu(readahead_aging, cpu);
+
+	return sum;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory.
  * Table of concrete numbers for 4KB page size:
  *   inactive + free (MB):    4   8   16   32   64  128  256  512 1024

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 14/23] readahead: state based method
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (12 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 13/23] readahead: page cache aging accounting Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 15/23] readahead: context " Wu Fengguang
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: readahead-method-stateful.patch --]
[-- Type: text/plain, Size: 12309 bytes --]

This is the fast code path of adaptive read-ahead.

MAJOR STEPS
        - estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.

THE REFERENCE MODEL
        1. inactive list has constant length and page flow speed
        2. the observed stream receives a steady flow of read requests
        3. no page activation, so that the inactive list forms a pipe

With that we get the picture showed below.

|<------------------------- constant length ------------------------->|
<<<<<<<<<<<<<<<<<<<<<<<<< steady flow of pages <<<<<<<<<<<<<<<<<<<<<<<<
+---------------------------------------------------------------------+
|tail                        inactive list                        head|
|   =======                  ==========----                           |
|   chunk A(stale pages)     chunk B(stale + fresh pages)             |
+---------------------------------------------------------------------+

REAL WORLD ISSUES

Real world workloads will always have fluctuations (violation of assumption
1 and 2). To counteract it, a tunable parameter readahead_ratio is introduced
to make the estimation conservative enough. Violation of assumption 3 will
not lead to thrashing, it is there just for simplicity of discussion.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/fs.h |   41 +++++--
 mm/readahead.c     |  286 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 317 insertions(+), 10 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/include/linux/fs.h
+++ linux-2.6.16-rc6-mm2/include/linux/fs.h
@@ -611,19 +611,40 @@ struct fown_struct {
  * Track a single file's readahead state
  */
 struct file_ra_state {
-	unsigned long start;		/* Current window */
-	unsigned long size;
-	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
-	unsigned long cache_hit;	/* cache hit count*/
-	unsigned long prev_page;	/* Cache last read() position */
-	unsigned long ahead_start;	/* Ahead window */
-	unsigned long ahead_size;
-	unsigned long ra_pages;		/* Maximum readahead window */
-	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
-	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
+	union {
+		struct { /* conventional read-ahead */
+			unsigned long start;		/* Current window */
+			unsigned long size;
+			unsigned long ahead_start;	/* Ahead window */
+			unsigned long ahead_size;
+			unsigned long cache_hit;        /* cache hit count */
+		};
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+		struct { /* adaptive read-ahead */
+			pgoff_t la_index;
+			pgoff_t ra_index;
+			pgoff_t lookahead_index;
+			pgoff_t readahead_index;
+			unsigned long age;
+			uint64_t cache_hits;
+		};
+#endif
+	};
+
+	/* mmap read-around */
+	unsigned long mmap_hit;         /* Cache hit stat for mmap accesses */
+	unsigned long mmap_miss;        /* Cache miss stat for mmap accesses */
+
+	/* common ones */
+	unsigned long flags;            /* ra flags RA_FLAG_xxx*/
+	unsigned long prev_page;        /* Cache last read() position */
+	unsigned long ra_pages;         /* Maximum readahead window */
 };
 #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
 #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
+#define RA_FLAG_MMAP		(1UL<<31)	/* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD	(1UL<<30)	/* disable look-ahead */
+#define RA_FLAG_NFSD		(1UL<<29)	/* request from nfsd */
 
 struct file {
 	/*
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -16,6 +16,7 @@
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
 #include <linux/nfsd/const.h>
+#include <asm/div64.h>
 
 /* The default max/min read-ahead pages. */
 #define KB(size)	(((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
@@ -1060,6 +1061,291 @@ static unsigned long node_readahead_agin
 }
 
 /*
+ * The 64bit cache_hits stores three accumulated values and a counter value.
+ * MSB                                                                   LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static inline int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+	return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
+}
+
+/*
+ * Conceptual code:
+ * ra_cache_hit(ra, 1) += ra_cache_hit(ra, 0);
+ * ra_cache_hit(ra, 0) = 0;
+ */
+static inline void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+	int n;
+
+	n = ra_cache_hit(ra, 0);
+	ra->cache_hits -= n;
+	n <<= 16;
+	ra->cache_hits += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static inline int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+	return ra_cache_hit(ra, 0) * readahead_hit_rate >=
+					(ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+	if (index < ra->la_index || index >= ra->readahead_index)
+		return 0;
+
+	if (index >= ra->ra_index)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static inline void ra_set_class(struct file_ra_state *ra,
+				enum ra_class ra_class)
+{
+	unsigned long flags_mask;
+	unsigned long flags;
+	unsigned long old_ra_class;
+
+	flags_mask = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+	flags = ra->flags & flags_mask;
+
+	old_ra_class = (ra->flags & RA_CLASS_MASK) << RA_CLASS_SHIFT;
+
+	ra->flags = flags | old_ra_class | ra_class;
+
+	ra_addup_cache_hit(ra);
+	if (ra_class != RA_CLASS_STATE)
+		ra->cache_hits <<= 16;
+
+	ra->age = node_readahead_aging();
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static inline void ra_set_index(struct file_ra_state *ra,
+				pgoff_t la_index, pgoff_t ra_index)
+{
+	ra->la_index = la_index;
+	ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static inline void ra_set_size(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+	/* Disable look-ahead for loopback file. */
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		la_size = 0;
+
+	ra->readahead_index = ra->ra_index + ra_size;
+	ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+			struct address_space *mapping, struct file *filp)
+{
+	pgoff_t eof_index;
+	unsigned long ra_size;
+	unsigned long la_size;
+	int actual;
+	enum ra_class ra_class;
+
+	ra_class = (ra->flags & RA_CLASS_MASK);
+	BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
+
+	eof_index = ((i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT) + 1;
+	ra_size = ra->readahead_index - ra->ra_index;
+	la_size = ra->readahead_index - ra->lookahead_index;
+
+	/* Snap to EOF. */
+	if (unlikely(ra->ra_index >= eof_index))
+		return 0;
+	if (ra->readahead_index + ra_size / 2 > eof_index) {
+		if (ra_class == RA_CLASS_CONTEXT_AGGRESSIVE &&
+				eof_index > ra->lookahead_index + 1)
+			la_size = eof_index - ra->lookahead_index;
+		else
+			la_size = 0;
+		ra_size = eof_index - ra->ra_index;
+		ra_set_size(ra, ra_size, la_size);
+	}
+
+	actual = __do_page_cache_readahead(mapping, filp,
+					ra->ra_index, ra_size, la_size);
+
+#ifdef CONFIG_DEBUG_READAHEAD
+	if (ra->flags & RA_FLAG_MMAP)
+		ra_account(ra, RA_EVENT_READAHEAD_MMAP, actual);
+	if (ra->readahead_index == eof_index)
+		ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+	if (la_size)
+		ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+	if (ra_size > actual)
+		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
+	ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+			ra_class_name[ra_class],
+			mapping->host->i_ino, ra->la_index,
+			ra->ra_index, ra_size, la_size, actual);
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+	return actual;
+}
+
+/*
+ * Determine the ra request from primitive values.
+ *
+ * It applies the following rules:
+ *   - Substract ra_size by the old look-ahead to get real safe read-ahead;
+ *   - Set new la_size according to the (still large) ra_size;
+ *   - Apply upper limits;
+ *   - Make sure stream_shift is not too small.
+ *     (So that the next global_shift will not be too small.)
+ *
+ * Input:
+ * ra_size stores the estimated thrashing-threshold.
+ * la_size stores the look-ahead size of previous request.
+ */
+static inline int adjust_rala(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	unsigned long stream_shift = *la_size;
+
+	if (*ra_size > *la_size)
+		*ra_size -= *la_size;
+	else {
+		ra_account(NULL, RA_EVENT_READAHEAD_SHRINK, *ra_size);
+		return 0;
+	}
+
+	*la_size = *ra_size / LOOKAHEAD_RATIO;
+
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	stream_shift += (*ra_size - *la_size);
+	if (stream_shift < *ra_size / 4)
+		*la_size -= (*ra_size / 4 - stream_shift);
+
+	return 1;
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ *    It is returned to make the next read-ahead request.
+ * 2. the remained safe space for the current chunk
+ *    It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will vibrate
+ * more on light load(with small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * This figure illustrates the formula used in the function:
+ * While the stream reads stream_shift pages inside the chunks,
+ * the chunks are shifted global_shift pages inside inactive_list.
+ *
+ *      chunk A                    chunk B
+ *                          |<=============== global_shift ================|
+ *  +-------------+         +-------------------+                          |
+ *  |       #     |         |           #       |            inactive_list |
+ *  +-------------+         +-------------------+                     head |
+ *          |---->|         |---------->|
+ *             |                  |
+ *             +-- stream_shift --+
+ */
+static inline unsigned long compute_thrashing_threshold(
+						struct file_ra_state *ra,
+						unsigned long *remain)
+{
+	unsigned long global_size;
+	unsigned long global_shift;
+	unsigned long stream_shift;
+	unsigned long ra_size;
+	uint64_t ll;
+
+	global_size = node_free_and_cold_pages();
+	global_shift = node_readahead_aging() - ra->age;
+	global_shift |= 1UL;
+	stream_shift = ra_cache_hit(ra, 0);
+
+	ll = (uint64_t) stream_shift * (global_size >> 9) * readahead_ratio * 5;
+	do_div(ll, global_shift);
+	ra_size = ll;
+
+	if (global_size > global_shift) {
+		ll = (uint64_t) stream_shift * (global_size - global_shift);
+		do_div(ll, global_shift);
+		*remain = ll;
+	} else
+		*remain = 0;
+
+	ddprintk("compute_thrashing_threshold: "
+			"at %lu ra %lu=%lu*%lu/%lu, remain %lu for %lu\n",
+			ra->readahead_index, ra_size,
+			stream_shift, global_size, global_shift,
+			*remain, ra->readahead_index - ra->lookahead_index);
+
+	return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static inline unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+			struct file_ra_state *ra,
+			struct page *page, pgoff_t index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	unsigned long ra_old;
+	unsigned long la_size;
+	unsigned long remain_space;
+	unsigned long growth_limit;
+
+	la_size = ra->readahead_index - index;
+	ra_old = ra->readahead_index - ra->ra_index;
+	growth_limit = ra_size + ra_max / 16 +
+				(2 + readahead_ratio / 64) * ra_old;
+	ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+	if (page && remain_space <= la_size && la_size > 1) {
+		rescue_pages(page, la_size);
+		return 0;
+	}
+
+	if (!adjust_rala(min(ra_max, growth_limit), &ra_size, &la_size))
+		return 0;
+
+	ra_set_class(ra, RA_CLASS_STATE);
+	ra_set_index(ra, index, ra->readahead_index);
+	ra_set_size(ra, ra_size, la_size);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory.
  * Table of concrete numbers for 4KB page size:
  *   inactive + free (MB):    4   8   16   32   64  128  256  512 1024

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 15/23] readahead: context based method
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (13 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 14/23] readahead: state based method Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 16/23] readahead: other methods Wu Fengguang
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 16504 bytes --]

This is the slow code path of adaptive read-ahead.

No valid state info is available, so the page cache is queried to obtain
the required position/timing info. This kind of estimation is more conservative
than the stateful method, and also fluctuates more on load variance.

MAJOR STEPS
        - look back/forward to find the ra_index;
        - look back to estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.

ALGORITHM DYNAMICS

* startup
When a sequential read is detected, chunk size is set to readahead-min
and grows up with each readahead.  The grow speed is controlled by
readahead-ratio.  When readahead-ratio == 100, the new logic grows chunk
sizes exponentially -- like the current logic, but lags behind it at
early steps.

* stabilize
When chunk size reaches readahead-max, or comes close to
        (readahead-ratio * thrashing-threshold)
it stops growing and stay there.

The main difference with the stock readahead logic occurs at and after
the time chunk size stops growing:
     -  The current logic grows chunk size exponentially in normal and
        decreases it by 2 each time thrashing is seen. That can lead to
        thrashing with almost every readahead for very slow streams.
     -  The new logic can stop at a size below the thrashing-threshold,
        and stay there stable.

* on stream speed up or system load fall
thrashing-threshold follows up and chunk size is likely to be enlarged.

* on stream slow down or system load rocket up
thrashing-threshold falls down.
If thrashing happened, the next read would be treated as a random read,
and with another read the chunk-size-growing-phase is restarted.

For a slow stream that has (thrashing-threshold < readahead-max):
      - When readahead-ratio = 100, there is only one chunk in cache at
        most time;
      - When readahead-ratio = 50, there are two chunks in cache at most
        time.
      - Lowing readahead-ratio helps gracefully cut down the chunk size
        without thrashing.

OVERHEADS
The context based method has some overheads over the stateful method, due
to more lockings and memory scans.

Running oprofile on the following command shows the following differences:

	# diff sparse sparse1

	total oprofile samples	run1	run2
	stateful method		560482	558696
	stateless method	564463	559413

So the average overhead is about 0.4%.

Detailed diffprofile data:

# diffprofile oprofile.50.stateful oprofile.50.stateless
      2998    41.1% isolate_lru_pages
      2669    26.4% shrink_zone
      1822    14.7% system_call
      1419    27.6% radix_tree_delete
      1376    14.8% _raw_write_lock
      1279    27.4% free_pages_bulk
      1111    12.0% _raw_write_unlock
      1035    43.3% free_hot_cold_page
       849    15.3% unlock_page
       786    29.6% page_referenced
       710     4.6% kmap_atomic
       651    26.4% __pagevec_release_nonlru
       586    16.1% __rmqueue
       578    11.3% find_get_page
       481    15.5% page_waitqueue
       440     6.6% add_to_page_cache
       420    33.7% fget_light
       260     4.3% get_page_from_freelist
       223    13.7% find_busiest_group
       221    35.1% mutex_debug_check_no_locks_freed
       211     0.0% radix_tree_scan_hole
       198    35.5% delay_tsc
       195    14.8% ext3_get_branch
       182    12.6% profile_tick
       173     0.0% radix_tree_cache_lookup_node
       164    22.9% find_next_bit
       162    50.3% page_cache_readahead_adaptive
...
       106     0.0% radix_tree_scan_hole_backward
...
       -51    -7.6% radix_tree_preload
...
       -68    -2.1% radix_tree_insert
...
       -87    -2.0% mark_page_accessed
       -88    -2.0% __pagevec_lru_add
      -103    -7.7% softlockup_tick
      -107   -71.8% free_block
      -122   -77.7% do_IRQ
      -132   -82.0% do_timer
      -140   -47.1% ack_edge_ioapic_vector
      -168   -81.2% handle_IRQ_event
      -192   -35.2% irq_entries_start
      -204   -14.8% rw_verify_area
      -214   -13.2% account_system_time
      -233    -9.5% radix_tree_lookup_node
      -234   -16.6% scheduler_tick
      -259   -58.7% __do_IRQ
      -266    -6.8% put_page
      -318   -29.3% rcu_pending
      -333    -3.0% do_generic_mapping_read
      -337   -28.3% hrtimer_run_queues
      -493   -27.0% __rcu_pending
     -1038    -9.4% default_idle
     -3323    -3.5% __copy_to_user_ll
    -10331    -5.9% do_mpage_readpage

# diffprofile oprofile.50.stateful2 oprofile.50.stateless2
      1739     1.1% do_mpage_readpage
       833     0.9% __copy_to_user_ll
       340    21.3% find_busiest_group
       288     9.5% free_hot_cold_page
       261     4.6% _raw_read_unlock
       239     3.9% get_page_from_freelist
       201     0.0% radix_tree_scan_hole
       163    14.3% raise_softirq
       160     0.0% radix_tree_cache_lookup_node
       160    11.8% update_process_times
       136     9.3% fget_light
       121    35.1% page_cache_readahead_adaptive
       117    36.0% restore_all
       117     2.8% mark_page_accessed
       109     6.4% rebalance_tick
       107     9.4% sys_read
       102     0.0% radix_tree_scan_hole_backward
...
        63     4.0% readahead_cache_hit
...
       -10   -15.9% radix_tree_node_alloc
...
       -39    -1.7% radix_tree_lookup_node
       -39   -10.3% irq_entries_start
       -43    -1.3% radix_tree_insert
...
       -47    -4.6% __do_page_cache_readahead
       -64    -9.3% radix_tree_preload
       -65    -5.4% rw_verify_area
       -65    -2.2% vfs_read
       -70    -4.7% timer_interrupt
       -71    -1.0% __wake_up_bit
       -73    -1.1% radix_tree_delete
       -79   -12.6% __mod_page_state_offset
       -94    -1.8% __find_get_block
       -94    -2.2% __pagevec_lru_add
      -102    -1.7% free_pages_bulk
      -116    -1.3% _raw_read_lock
      -123    -7.4% do_sync_read
      -130    -8.4% ext3_get_blocks_handle
      -142    -3.8% put_page
      -146    -7.9% mpage_readpages
      -147    -5.6% apic_timer_interrupt
      -168    -1.6% _raw_write_unlock
      -172    -5.0% page_referenced
      -206    -3.2% unlock_page
      -212   -15.0% restore_nocheck
      -213    -2.1% default_idle
      -245    -5.0% __rmqueue
      -278    -4.3% find_get_page
      -282    -2.1% system_call
      -287   -11.8% run_timer_softirq
      -300    -2.7% _raw_write_lock
      -420    -3.2% shrink_zone
      -661    -5.7% isolate_lru_pages

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  343 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 343 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1346,6 +1346,349 @@ state_based_readahead(struct address_spa
 }
 
 /*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ *   chunk A           chunk B                      chunk C                 head
+ *
+ *   l01 l11           l12   l21                    l22
+ *| |-->|-->|       |------>|-->|                |------>|
+ *| +-------+       +-----------+                +-------------+               |
+ *| |   #   |       |       #   |                |       #     |               |
+ *| +-------+       +-----------+                +-------------+               |
+ *| |<==============|<===========================|<============================|
+ *        L0                     L1                            L2
+ *
+ * Let f(l) = L be a map from
+ * 	l: the number of pages read by the stream
+ * to
+ * 	L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * 	f(l01) <= L0
+ * 	f(l11 + l12) = L1
+ * 	f(l21 + l22) = L2
+ * 	...
+ * 	f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ *			   <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+#define PAGE_REFCNT_0           0
+#define PAGE_REFCNT_1           (1 << PG_referenced)
+#define PAGE_REFCNT_2           (1 << PG_active)
+#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
+
+/*
+ * STATUS   REFERENCE COUNT
+ *  __                   0
+ *  _R       PAGE_REFCNT_1
+ *  A_       PAGE_REFCNT_2
+ *  AR       PAGE_REFCNT_3
+ *
+ *  A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+        return page->flags & PAGE_REFCNT_MASK;
+}
+
+/*
+ * STATUS   REFERENCE COUNT      TYPE
+ *  __                   0      fresh
+ *  _R       PAGE_REFCNT_1      stale
+ *  A_       PAGE_REFCNT_2      disturbed once
+ *  AR       PAGE_REFCNT_3      disturbed twice
+ *
+ *  A/R: Active / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+	if (!page || PageActive(page))
+		return 0;
+
+	return page_refcnt(page);
+}
+
+static inline char page_refcnt_symbol(struct page *page)
+{
+	if (!page)
+		return 'X';
+
+	switch (page_refcnt(page)) {
+		case 0:
+			return '_';
+		case PAGE_REFCNT_1:
+			return '-';
+		case PAGE_REFCNT_2:
+			return '=';
+		case PAGE_REFCNT_3:
+			return '#';
+		default:
+			return '?';
+	}
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+				pgoff_t first_index, pgoff_t last_index)
+{
+	struct page *page;
+	int size = last_index - first_index + 1;
+	int count = 0;
+	int i;
+
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+
+	/*
+	 * The first page may well is chunk head and has been accessed,
+	 * so it is index 0 that makes the estimation optimistic. This
+	 * behavior guarantees a readahead when (size < ra_max) and
+	 * (readahead_hit_rate >= 16).
+	 */
+	for (i = 0; i < 16;) {
+		page = __find_page(mapping, first_index +
+						size * ((i++ * 29) & 15) / 16);
+		if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+			break;
+	}
+
+	read_unlock_irq(&mapping->tree_lock);
+
+	return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static unsigned long query_page_cache_segment(struct address_space *mapping,
+				struct file_ra_state *ra,
+				unsigned long *remain, pgoff_t offset,
+				unsigned long ra_min, unsigned long ra_max)
+{
+	pgoff_t index;
+	unsigned long count;
+	unsigned long nr_lookback;
+	struct radix_tree_cache cache;
+
+	/*
+	 * Scan backward and check the near @ra_max pages.
+	 * The count here determines ra_size.
+	 */
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+	index = radix_tree_scan_hole_backward(&mapping->page_tree,
+							offset, ra_max);
+	read_unlock_irq(&mapping->tree_lock);
+
+	*remain = offset - index;
+
+	if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
+		count = *remain;
+	else if (count_cache_hit(mapping, index + 1, offset) *
+						readahead_hit_rate >= *remain)
+		count = *remain;
+	else
+		count = ra_min;
+
+	/*
+	 * Unnecessary to count more?
+	 */
+	if (count < ra_max)
+		goto out;
+
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		goto out;
+
+	/*
+	 * Check the far pages coarsely.
+	 * The big count here helps increase la_size.
+	 */
+	nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+						100 / (readahead_ratio + 1);
+
+	cond_resched();
+	radix_tree_cache_init(&cache);
+	read_lock_irq(&mapping->tree_lock);
+	for (count += ra_max; count < nr_lookback; count += ra_max) {
+		struct radix_tree_node *node;
+		node = radix_tree_cache_lookup_node(&mapping->page_tree,
+						&cache, offset - count, 1);
+		if (!node)
+			break;
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+out:
+	/*
+	 *  For sequential read that extends from index 0, the counted value
+	 *  may well be far under the true threshold, so return it unmodified
+	 *  for further process in adjust_rala_aggressive().
+	 */
+	if (count >= offset)
+		count = offset;
+	else
+		count = max(ra_min, count * readahead_ratio / 100);
+
+	ddprintk("query_page_cache_segment: "
+			"ino=%lu, idx=%lu, count=%lu, remain=%lu\n",
+			mapping->host->i_ino, offset, count, *remain);
+
+	return count;
+}
+
+/*
+ * Find past-the-end index of the segment before @index.
+ */
+static inline pgoff_t find_segtail_backward(struct address_space *mapping,
+					pgoff_t index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct page *page;
+	pgoff_t origin;
+
+	origin = index;
+	if (max_scan > index)
+		max_scan = index;
+
+	cond_resched();
+	radix_tree_cache_init(&cache);
+	read_lock_irq(&mapping->tree_lock);
+	for (; origin - index < max_scan;) {
+		page = radix_tree_cache_lookup(&mapping->page_tree,
+							&cache, --index);
+		if (page) {
+			read_unlock_irq(&mapping->tree_lock);
+			return index + 1;
+		}
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+	return 0;
+}
+
+/*
+ * Find past-the-end index of the segment at @index.
+ */
+static inline pgoff_t find_segtail(struct address_space *mapping,
+					pgoff_t index, unsigned long max_scan)
+{
+	pgoff_t ra_index;
+
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+	ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+	read_unlock_irq(&mapping->tree_lock);
+
+	if (ra_index <= index + max_scan)
+		return ra_index;
+	else
+		return 0;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not chosen to make the whole next chunk safe (as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static inline int adjust_rala_aggressive(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	pgoff_t index = *ra_size;
+
+	*ra_size -= min(*ra_size, *la_size);
+	*ra_size = *ra_size * readahead_ratio / 100;
+	*la_size = index * readahead_ratio / 100;
+	*ra_size += *la_size;
+
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ */
+static inline int
+try_context_based_readahead(struct address_space *mapping,
+			struct file_ra_state *ra, struct page *prev_page,
+			struct page *page, pgoff_t index,
+			unsigned long ra_min, unsigned long ra_max)
+{
+	pgoff_t ra_index;
+	unsigned long ra_size;
+	unsigned long la_size;
+	unsigned long remain_pages;
+
+	/* Where to start read-ahead?
+	 * NFSv3 daemons may process adjacent requests in parallel,
+	 * leading to many locally disordered, globally sequential reads.
+	 * So do not require nearby history pages to be present or accessed.
+	 */
+	if (page) {
+		ra_index = find_segtail(mapping, index, ra_max * 5 / 4);
+		if (!ra_index)
+			return -1;
+	} else if (prev_page || find_page(mapping, index - 1)) {
+		ra_index = index;
+	} else if (readahead_hit_rate > 1) {
+		ra_index = find_segtail_backward(mapping, index,
+						readahead_hit_rate + ra_min);
+		if (!ra_index)
+			return 0;
+		ra_min += 2 * (index - ra_index);
+		index = ra_index;	/* pretend the request starts here */
+	} else
+		return 0;
+
+	ra_size = query_page_cache_segment(mapping, ra, &remain_pages,
+							index, ra_min, ra_max);
+
+	la_size = ra_index - index;
+	if (page && remain_pages <= la_size &&
+			remain_pages < index && la_size > 1) {
+		rescue_pages(page, la_size);
+		return -1;
+	}
+
+	if (ra_size == index) {
+		if (!adjust_rala_aggressive(ra_max, &ra_size, &la_size))
+			return -1;
+		ra_set_class(ra, RA_CLASS_CONTEXT_AGGRESSIVE);
+	} else {
+		if (!adjust_rala(ra_max, &ra_size, &la_size))
+			return -1;
+		ra_set_class(ra, RA_CLASS_CONTEXT);
+	}
+
+	ra_set_index(ra, index, ra_index);
+	ra_set_size(ra, ra_size, la_size);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory.
  * Table of concrete numbers for 4KB page size:
  *   inactive + free (MB):    4   8   16   32   64  128  256  512 1024

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 16/23] readahead: other methods
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (14 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 15/23] readahead: context " Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 17/23] readahead: call scheme Wu Fengguang
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-others.patch --]
[-- Type: text/plain, Size: 5066 bytes --]

Various read-ahead strategies for:
	- fresh read from start of file
	- backward prefetching
	- seek and read one block pattern(db workload)
	- quickly recover from thrashing

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 153 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1689,6 +1689,159 @@ try_context_based_readahead(struct addre
 }
 
 /*
+ * Read-ahead on start of file.
+ *
+ * The strategies here are most important for small files.
+ * 1. Set a moderately large read-ahead size;
+ * 2. Issue the next read-ahead request as soon as possible.
+ *
+ * But be careful, there are some applications that dip into only the very head
+ * of a file. The most important thing is to prevent them from triggering the
+ * next (much larger) read-ahead request, which leads to lots of cache misses.
+ * Two pages should be enough for them, correct me if I'm wrong.
+ */
+static inline unsigned long
+newfile_readahead(struct address_space *mapping,
+		struct file *filp, struct file_ra_state *ra,
+		unsigned long req_size, unsigned long ra_min)
+{
+	unsigned long ra_size;
+	unsigned long la_size;
+
+	if (req_size > ra_min) /* larger value risks thrashing */
+		req_size = ra_min;
+
+	ra_size = 4 * req_size;
+	la_size = 2 * req_size;
+
+	ra_set_class(ra, RA_CLASS_NEWFILE);
+	ra_set_index(ra, 0, 0);
+	ra_set_size(ra, ra_size, la_size);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * Backward prefetching.
+ * No look ahead and thrashing threshold estimation for stepping backward
+ * pattern: should be unnecessary.
+ */
+static inline int
+try_read_backward(struct file_ra_state *ra, pgoff_t begin_index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	pgoff_t end_index;
+
+	/* Are we reading backward? */
+	if (begin_index > ra->prev_page)
+		return 0;
+
+	if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD &&
+					ra_has_index(ra, ra->prev_page)) {
+		ra_size += 2 * ra_cache_hit(ra, 0);
+		end_index = ra->la_index;
+	} else {
+		ra_size += ra_size + ra_size * (readahead_hit_rate - 1) / 2;
+		end_index = ra->prev_page;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	/* Read traces close enough to be covered by the prefetching? */
+	if (end_index > begin_index + ra_size)
+		return 0;
+
+	begin_index = end_index - ra_size;
+
+	ra_set_class(ra, RA_CLASS_BACKWARD);
+	ra_set_index(ra, begin_index, begin_index);
+	ra_set_size(ra, ra_size, 0);
+
+	return 1;
+}
+
+/*
+ * Readahead thrashing recovery.
+ */
+static inline unsigned long
+thrashing_recovery_readahead(struct address_space *mapping,
+				struct file *filp, struct file_ra_state *ra,
+				pgoff_t index, unsigned long ra_max)
+{
+	unsigned long ra_size;
+
+	if (readahead_debug_level && find_page(mapping, index - 1))
+		ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+						ra->readahead_index - index);
+	ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+						ra->readahead_index - index);
+
+	/*
+	 * Some thrashing occur in (ra_index, la_index], in which case the
+	 * old read-ahead chunk is lost soon after the new one is allocated.
+	 * Ensure that we recover all needed pages in the old chunk.
+	 */
+	if (index < ra->ra_index)
+		ra_size = ra->ra_index - index;
+	else {
+		/* After thrashing, we know the exact thrashing-threshold. */
+		ra_size = ra_cache_hit(ra, 0);
+
+		/* And we'd better be a bit conservative. */
+		ra_size = ra_size * 3 / 4;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	ra_set_class(ra, RA_CLASS_THRASHING);
+	ra_set_index(ra, index, index);
+	ra_set_size(ra, ra_size, ra_size / LOOKAHEAD_RATIO);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ * Databases are known to have this seek-and-read-one-block pattern.
+ */
+static inline int
+try_readahead_on_seek(struct file_ra_state *ra, pgoff_t index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	unsigned long hit0 = ra_cache_hit(ra, 0);
+	unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+	unsigned long hit2 = ra_cache_hit(ra, 2);
+	unsigned long hit3 = ra_cache_hit(ra, 3);
+
+	/* There's a previous read-ahead request? */
+	if (!ra_has_index(ra, ra->prev_page))
+		return 0;
+
+	/* The previous read-ahead sequences have similiar sizes? */
+	if (!(ra_size < hit1 && hit1 > hit2 / 2 &&
+				hit2 > hit3 / 2 &&
+				hit3 > hit1 / 2))
+		return 0;
+
+	hit1 = max(hit1, hit2);
+
+	/* Follow the same prefetching direction. */
+	if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD)
+		index = ((index > hit1 - ra_size) ? index - hit1 + ra_size : 0);
+
+	ra_size = min(hit1, ra_max);
+
+	ra_set_class(ra, RA_CLASS_SEEK);
+	ra_set_index(ra, index, index);
+	ra_set_size(ra, ra_size, 0);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory.
  * Table of concrete numbers for 4KB page size:
  *   inactive + free (MB):    4   8   16   32   64  128  256  512 1024

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 17/23] readahead: call scheme
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (15 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 16/23] readahead: other methods Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 18/23] readahead: laptop mode Wu Fengguang
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 9514 bytes --]

The read-ahead logic is called when the reading hits
        - a look-ahead mark;
        - a non-present page.

ra.prev_page should be properly setup on entrance, and ra_access() should be
called on every page reference to maintain the cache_hits counter.

This call scheme achieves the following goals:
        - makes all stateful/stateless methods happy;
        - eliminates the cache hit problem naturally;
        - lives in harmony with application managed read-aheads via
          fadvise/madvise.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |    7 ++
 mm/filemap.c       |   47 ++++++++++++++--
 mm/readahead.c     |  155 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 205 insertions(+), 4 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -1019,10 +1019,17 @@ unsigned long page_cache_readahead(struc
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t first_index, pgoff_t index, pgoff_t last_index);
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page);
 extern int readahead_ratio;
 #else
+#define readahead_cache_hit(ra, page) do { } while (0)
 #define readahead_ratio 1
 #endif /* CONFIG_ADAPTIVE_READAHEAD */
 
--- linux-2.6.16-rc6-mm2.orig/mm/filemap.c
+++ linux-2.6.16-rc6-mm2/mm/filemap.c
@@ -833,14 +833,32 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
+
+		if (!prefer_adaptive_readahead() && index == next_index)
 			next_index = page_cache_readahead(mapping, &ra, filp,
 					index, last_index - index);
 
 find_page:
 		page = find_get_page(mapping, index);
+		if (prefer_adaptive_readahead()) {
+			if (unlikely(page == NULL)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, NULL,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+				page = find_get_page(mapping, index);
+			} else if (PageReadahead(page)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, page,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+			}
+		}
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
 
@@ -848,6 +866,7 @@ find_page:
 			page_cache_release(prev_page);
 		prev_page = page;
 
+		readahead_cache_hit(&ra, page);
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -991,6 +1010,8 @@ no_cached_page:
 
 out:
 	*_ra = ra;
+	if (prefer_adaptive_readahead())
+		_ra->prev_page = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
@@ -1275,6 +1296,7 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	ra->flags |= RA_FLAG_MMAP;
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
 retry_all:
@@ -1292,19 +1314,33 @@ retry_all:
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
+	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
+
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
 	page = find_get_page(mapping, pgoff);
+	if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+		if (!page) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, NULL,
+						pgoff, pgoff, pgoff + 1);
+			page = find_get_page(mapping, pgoff);
+		} else if (PageReadahead(page)) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, page,
+						pgoff, pgoff, pgoff + 1);
+		}
+	}
 	if (!page) {
 		unsigned long ra_pages;
 
 		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, ra, pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1341,6 +1377,7 @@ retry_find:
 	if (!did_readaround)
 		ra->mmap_hit++;
 
+	readahead_cache_hit(ra, page);
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.
@@ -1355,6 +1392,8 @@ success:
 	mark_page_accessed(page);
 	if (type)
 		*type = majmin;
+	if (prefer_adaptive_readahead())
+		ra->prev_page = page->index;
 	return page;
 
 outside_data_content:
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1858,4 +1858,159 @@ static inline void get_readahead_bounds(
 	*ra_min = min(min(MIN_RA_PAGES + (pages>>13), KB(128)), *ra_max/2);
 }
 
+/**
+ * page_cache_readahead_adaptive - adaptive read-ahead main function
+ * @mapping, @ra, @filp: the same as page_cache_readahead()
+ * @prev_page: the page at @index-1, may be NULL to let the function find it
+ * @page: the page at @index, or NULL if non-present
+ * @begin_index, @index, @end_index: offsets into @mapping
+ * 		[@begin_index, @end_index) is the read the caller is performing
+ *	 	@index indicates the page to be read now
+ *
+ * page_cache_readahead_adaptive() is the entry point of the adaptive
+ * read-ahead logic. It tries a set of methods in turn to determine the
+ * appropriate readahead action and submits the readahead I/O.
+ *
+ * The caller is expected to point ra->prev_page to the previously accessed
+ * page, and to call it on two conditions:
+ * 1. @page == NULL
+ *    A cache miss happened, some pages have to be read in
+ * 2. @page != NULL && PageReadahead(@page)
+ *    A look-ahead mark encountered, this is set by a previous read-ahead
+ *    invocation to instruct the caller to give the function a chance to
+ *    check up and do next read-ahead in advance.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t begin_index, pgoff_t index, pgoff_t end_index)
+{
+	unsigned long size;
+	unsigned long ra_min;
+	unsigned long ra_max;
+	int ret;
+
+	might_sleep();
+
+	if (page) {
+		if(!TestClearPageReadahead(page))
+			return 0;
+		if (bdi_read_congested(mapping->backing_dev_info)) {
+			ra_account(ra, RA_EVENT_IO_CONGESTION,
+							end_index - index);
+			return 0;
+		}
+	}
+
+	if (page)
+		ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+				ra->readahead_index - ra->lookahead_index);
+	else if (index)
+		ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+	size = end_index - index;
+	get_readahead_bounds(ra, &ra_min, &ra_max);
+
+	/* readahead disabled? */
+	if (unlikely(!ra_max || !readahead_ratio)) {
+		size = max_sane_readahead(size);
+		goto readit;
+	}
+
+	/*
+	 * Start of file.
+	 */
+	if (index == 0)
+		return newfile_readahead(mapping, filp, ra, end_index, ra_min);
+
+	/*
+	 * State based sequential read-ahead.
+	 */
+	if (!disable_stateful_method &&
+			index == ra->lookahead_index && ra_cache_hit_ok(ra))
+		return state_based_readahead(mapping, filp, ra, page,
+							index, size, ra_max);
+
+	/*
+	 * Recover from possible thrashing.
+	 */
+	if (!page && index == ra->prev_page + 1 && ra_has_index(ra, index))
+		return thrashing_recovery_readahead(mapping, filp, ra,
+								index, ra_max);
+
+	/*
+	 * Backward read-ahead.
+	 */
+	if (!page && begin_index == index &&
+				try_read_backward(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Context based sequential read-ahead.
+	 */
+	ret = try_context_based_readahead(mapping, ra, prev_page, page,
+							index, ra_min, ra_max);
+	if (ret > 0)
+		return ra_dispatch(ra, mapping, filp);
+	if (ret < 0)
+		return 0;
+
+	/* No action on look ahead time? */
+	if (page) {
+		ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+						ra->readahead_index - index);
+		return 0;
+	}
+
+	/*
+	 * Random read that follows a sequential one.
+	 */
+	if (try_readahead_on_seek(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Random read.
+	 */
+	if (size > ra_max)
+		size = ra_max;
+
+readit:
+	size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+	ra_account(ra, RA_EVENT_READRANDOM, size);
+	dprintk("readrandom(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+			mapping->host->i_ino, mapping->nrpages,
+			begin_index, index, end_index, size);
+
+	return size;
+}
+
+/**
+ * readahead_cache_hit - adaptive read-ahead feedback function
+ * @ra: file_ra_state which holds the readahead state
+ * @page: the page just accessed
+ *
+ * readahead_cache_hit() is the feedback route of the adaptive read-ahead
+ * logic. It must be called on every access on the read-ahead pages.
+ */
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page)
+{
+	if (PageActive(page) || PageReferenced(page))
+		return;
+
+	if (!PageUptodate(page))
+		ra_account(ra, RA_EVENT_IO_BLOCK, 1);
+
+	if (!ra_has_index(ra, page->index))
+		return;
+
+	ra->cache_hits++;
+
+	if (page->index >= ra->ra_index)
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+	else
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
 #endif /* CONFIG_ADAPTIVE_READAHEAD */

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 18/23] readahead: laptop mode
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (16 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 17/23] readahead: call scheme Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 19/23] readahead: loop case Wu Fengguang
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Bart Samwel

[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3063 bytes --]

When the laptop drive is spinned down, defer look-ahead to spin up time.

The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.

------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:

# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda      # this is the max possible value

Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
  which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/writeback.h |    6 ++++++
 mm/page-writeback.c       |    2 +-
 mm/readahead.c            |   30 ++++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 1 deletion(-)

--- linux-2.6.16-rc6-mm2.orig/include/linux/writeback.h
+++ linux-2.6.16-rc6-mm2/include/linux/writeback.h
@@ -85,6 +85,12 @@ void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
 
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+	return !timer_pending(&laptop_mode_wb_timer);
+}
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
--- linux-2.6.16-rc6-mm2.orig/mm/page-writeback.c
+++ linux-2.6.16-rc6-mm2/mm/page-writeback.c
@@ -377,7 +377,7 @@ static void wb_timer_fn(unsigned long un
 static void laptop_timer_fn(unsigned long unused);
 
 static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
  * Periodic writeback of "old" data.
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1017,6 +1017,31 @@ out:
 }
 
 /*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static inline int renew_lookahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				pgoff_t index, pgoff_t new_index)
+{
+	struct page *page;
+
+	if (index == ra->lookahead_index &&
+			new_index >= ra->readahead_index)
+		return 1;
+
+	page = find_page(mapping, new_index);
+	if (!page)
+		return 1;
+
+	__SetPageReadahead(page);
+	if (ra->lookahead_index == index)
+		ra->lookahead_index = new_index;
+
+	return 0;
+}
+
+/*
  * State based calculation of read-ahead request.
  *
  * This figure shows the meaning of file_ra_state members:
@@ -1901,6 +1926,11 @@ page_cache_readahead_adaptive(struct add
 							end_index - index);
 			return 0;
 		}
+		if (laptop_mode && laptop_spinned_down()) {
+			if (!renew_lookahead(mapping, ra, index,
+						index + LAPTOP_POLL_INTERVAL))
+				return 0;
+		}
 	}
 
 	if (page)

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 19/23] readahead: loop case
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (17 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 18/23] readahead: laptop mode Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 20/23] readahead: nfsd case Wu Fengguang
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-loop-case.patch --]
[-- Type: text/plain, Size: 894 bytes --]

Disable look-ahead for loop file.

Loopback files normally contain filesystems, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

I'd like to thank Tero Grundstr?m for uncovering the loopback problem.

 drivers/block/loop.c |    6 ++++++
 1 files changed, 6 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/drivers/block/loop.c
+++ linux-2.6.16-rc6-mm2/drivers/block/loop.c
@@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
 	mapping = file->f_mapping;
 	inode = mapping->host;
 
+	/*
+	 * The upper layer should already do proper look-ahead,
+	 * one more look-ahead here only ruins the cache hit rate.
+	 */
+	file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
 	if (!(file->f_mode & FMODE_WRITE))
 		lo_flags |= LO_FLAGS_READ_ONLY;
 

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 20/23] readahead: nfsd case
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (18 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 19/23] readahead: loop case Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 21/23] readahead: debug radix tree new functions Wu Fengguang
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Neil Brown

[-- Attachment #1: readahead-nfsd-case.patch --]
[-- Type: text/plain, Size: 5353 bytes --]

Adaptive read-ahead for nfsd:
1) disable nfsd raparms: the new logic do not rely on it
2) disable look-ahead on start of file: leave it to the client

For the case of NFS service, the new read-ahead logic
+ can handle disordered nfsd requests
+ can handle concurrent sequential requests on large files
  with the help of look-ahead
- will have much ado about the concurrent ones on small files

--------------------------------
Notes about the concurrent nfsd requests issue:

nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:

1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache

A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlapped read-ahead requests, which lead to many conflicts at step 4.

There's no much luck to eliminate the concurrent problem in general and
efficient ways. But experiments show that mount with tcp,rsize=32768 can
cut down the overhead a lot.

--------------------------------
Here are the benchmark outputs. The test cases cover
- small/big files
- small/big rsize mount option
- serialized/parallel nfsd processing

`serialized' means running the following command to enforce serialized
nfsd requests processing:

# for pid in `pidof nfsd`; do taskset -p 1 $pid; done

8 nfsd; local mount with tcp,rsize=8192
=======================================

SERIALIZED, SMALL FILES
readahead_ratio = 0, ra_max = 128kb (old logic, the ra_max is not quite relevant)
96.51s real  11.32s system  3.27s user  160334+2829 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb (new read-ahead logic)
94.88s real  11.53s system  3.20s user  152415+3777 cs  diff -r $NFSDIR $NFSDIR2

SERIALIZED, BIG FILES
readahead_ratio = 0, ra_max = 128kb
56.52s real  3.38s system  1.23s user  47930+5256 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
32.54s real  5.71s system  1.38s user  23851+17007 cs  diff $NFSFILE $NFSFILE2

PARALLEL, SMALL FILES
readahead_ratio = 0, ra_max = 128kb
99.87s real  11.41s system  3.15s user  173945+9163 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
100.14s real  12.06s system  3.16s user  170865+13406 cs  diff -r $NFSDIR $NFSDIR2

PARALLEL, BIG FILES
readahead_ratio = 0, ra_max = 128kb
63.35s real  5.68s system  1.57s user  82594+48747 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
33.87s real  10.17s system  1.55s user  72291+100079 cs  diff $NFSFILE $NFSFILE2

8 nfsd; local mount with tcp,rsize=32768
========================================
Note that the normal data are now much better, and come close to that of the
serialized ones.

PARALLEL/NORMAL

readahead_ratio = 8, ra_max = 1024kb (old logic)
48.36s real  2.22s system  1.51s user  7209+4110 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb (new logic)
30.04s real  2.46s system  1.33s user  5420+2492 cs  diff $NFSFILE $NFSFILE2

readahead_ratio = 8, ra_max = 1024kb
92.99s real  10.32s system  3.23s user  145004+1826 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
90.96s real  10.68s system  3.22s user  144414+2520 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null

SERIALIZED

readahead_ratio = 8, ra_max = 1024kb
47.58s real  2.10s system  1.27s user  7933+1357 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
29.46s real  2.41s system  1.38s user  5590+2613 cs  diff $NFSFILE $NFSFILE2

readahead_ratio = 8, ra_max = 1024kb
93.02s real  10.67s system  3.25s user  144850+2286 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
91.15s real  11.04s system  3.31s user  144432+2814 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


Greg Banks gives a valuable recommend on the test cases, which helps me to
get the more complete picture. Thanks!

 fs/nfsd/vfs.c  |    6 +++++-
 mm/readahead.c |    9 +++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

--- linux-2.6.16-rc6-mm2.orig/fs/nfsd/vfs.c
+++ linux-2.6.16-rc6-mm2/fs/nfsd/vfs.c
@@ -833,10 +833,14 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
 #endif
 
 	/* Get readahead parameters */
-	ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+	if (prefer_adaptive_readahead())
+		ra = NULL;
+	else
+		ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
 
 	if (ra && ra->p_set)
 		file->f_ra = ra->p_ra;
+	file->f_ra.flags |= RA_FLAG_NFSD;
 
 	if (file->f_op->sendfile) {
 		svc_pushback_unused_pages(rqstp);
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1736,8 +1736,13 @@ newfile_readahead(struct address_space *
 	if (req_size > ra_min) /* larger value risks thrashing */
 		req_size = ra_min;
 
-	ra_size = 4 * req_size;
-	la_size = 2 * req_size;
+	if (unlikely(ra->flags & RA_FLAG_NFSD)) {
+		ra_size = MIN_NFSD_PAGES;
+		la_size = 0;
+	} else {
+		ra_size = 4 * req_size;
+		la_size = 2 * req_size;
+	}
 
 	ra_set_class(ra, RA_CLASS_NEWFILE);
 	ra_set_index(ra, 0, 0);

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 21/23] readahead: debug radix tree new functions
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (19 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 20/23] readahead: nfsd case Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 22/23] readahead: debug traces showing accessed file names Wu Fengguang
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-radix-tree.patch --]
[-- Type: text/plain, Size: 2114 bytes --]

Do some sanity checkings on the newly added radix tree code.

It is meant to be removed after a while.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -80,6 +80,8 @@ enum ra_class {
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 
+#define DEBUG_READAHEAD_RADIXTREE
+
 /* Read-ahead events to be accounted. */
 enum ra_event {
 	RA_EVENT_CACHE_MISS,		/* read cache misses */
@@ -1515,6 +1517,13 @@ static unsigned long query_page_cache_se
 	read_lock_irq(&mapping->tree_lock);
 	index = radix_tree_scan_hole_backward(&mapping->page_tree,
 							offset, ra_max);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	WARN_ON(index > offset);
+	if (index != offset)
+		WARN_ON(!__find_page(mapping, index + 1));
+	if (index && offset - index < ra_max)
+		WARN_ON(__find_page(mapping, index));
+#endif
 	read_unlock_irq(&mapping->tree_lock);
 
 	*remain = offset - index;
@@ -1550,6 +1559,11 @@ static unsigned long query_page_cache_se
 		struct radix_tree_node *node;
 		node = radix_tree_cache_lookup_node(&mapping->page_tree,
 						&cache, offset - count, 1);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+		if (node != radix_tree_lookup_node(&mapping->page_tree,
+							offset - count, 1))
+			BUG();
+#endif
 		if (!node)
 			break;
 	}
@@ -1614,6 +1628,16 @@ static inline pgoff_t find_segtail(struc
 	cond_resched();
 	read_lock_irq(&mapping->tree_lock);
 	ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	BUG_ON(!__find_page(mapping, index));
+	WARN_ON(ra_index < index);
+	if (ra_index != index && !__find_page(mapping, ra_index - 1))
+		printk(KERN_ERR "radix_tree_scan_hole(index=%lu ra_index=%lu "
+				"max_scan=%lu nrpages=%lu) fooled!\n",
+				index, ra_index, max_scan, mapping->nrpages);
+	if (ra_index != ~0UL && ra_index - index < max_scan)
+		WARN_ON(__find_page(mapping, ra_index));
+#endif
 	read_unlock_irq(&mapping->tree_lock);
 
 	if (ra_index <= index + max_scan)

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 22/23] readahead: debug traces showing accessed file names
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (20 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 21/23] readahead: debug radix tree new functions Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  2:34 ` [PATCH 23/23] readahead: debug traces showing read patterns Wu Fengguang
  2006-03-19  3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-traces-file-list.patch --]
[-- Type: text/plain, Size: 1011 bytes --]

Print file names on their first read-ahead, for tracing file access patterns.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+)

--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1229,6 +1229,20 @@ static int ra_dispatch(struct file_ra_st
 		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
 	ra_account(ra, RA_EVENT_READAHEAD, actual);
 
+	if (!ra->ra_index && filp->f_dentry->d_inode) {
+		char *fn;
+		static char path[1024];
+		unsigned long size;
+
+		size = (i_size_read(filp->f_dentry->d_inode)+1023)/1024;
+		fn = d_path(filp->f_dentry, filp->f_vfsmnt, path, 1000);
+		if (!IS_ERR(fn))
+			ddprintk("ino %lu is %s size %luK by %s(%d)\n",
+					filp->f_dentry->d_inode->i_ino,
+					fn, size,
+					current->comm, current->pid);
+	}
+
 	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
 			ra_class_name[ra_class],
 			mapping->host->i_ino, ra->la_index,

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 23/23] readahead: debug traces showing read patterns
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (21 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 22/23] readahead: debug traces showing accessed file names Wu Fengguang
@ 2006-03-19  2:34 ` Wu Fengguang
  2006-03-19  3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
  23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  2:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-traces-access-pattern.patch --]
[-- Type: text/plain, Size: 2681 bytes --]

Print all relavant read requests to help discover the access pattern.

If you are experiencing performance problems, or want to help improve
the read-ahead logic, please send me the trace data. Thanks.

- Preparations

# Compile kernel with option CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
echo 8 > /debug/readahead/debug_level # show verbose printk traces
# do one benchmark/task
echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/filemap.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletion(-)

--- linux-2.6.16-rc6-mm2.orig/mm/filemap.c
+++ linux-2.6.16-rc6-mm2/mm/filemap.c
@@ -45,6 +45,12 @@ static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+#ifdef CONFIG_DEBUG_READAHEAD
+extern u32 readahead_debug_level;
+#else
+#define readahead_debug_level 0
+#endif /* CONFIG_DEBUG_READAHEAD */
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -815,6 +821,10 @@ void do_generic_mapping_read(struct addr
 	if (!isize)
 		goto out;
 
+	if (readahead_debug_level >= 5)
+		printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+			inode->i_ino, index, last_index - index);
+
 	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 	for (;;) {
 		struct page *page;
@@ -867,6 +877,11 @@ find_page:
 		prev_page = page;
 
 		readahead_cache_hit(&ra, page);
+		if (readahead_debug_level >= 7)
+			printk(KERN_DEBUG "read-page(ino=%lu, idx=%lu, io=%s)\n",
+				inode->i_ino, index,
+				PageUptodate(page) ? "hit" : "miss");
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1317,7 +1332,6 @@ retry_all:
 	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
-
 	/*
 	 * Do we have something in the page cache already?
 	 */
@@ -1378,6 +1392,13 @@ retry_find:
 		ra->mmap_hit++;
 
 	readahead_cache_hit(ra, page);
+	if (readahead_debug_level >= 6)
+		printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s, io=%s)\n",
+			inode->i_ino, pgoff,
+			VM_RandomReadHint(area) ? "random" :
+			(VM_SequentialReadHint(area) ? "sequential" : "none"),
+			PageUptodate(page) ? "hit" : "miss");
+
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
                   ` (22 preceding siblings ...)
  2006-03-19  2:34 ` [PATCH 23/23] readahead: debug traces showing read patterns Wu Fengguang
@ 2006-03-19  3:10 ` Jon Smirl
  2006-03-19  3:47   ` Wu Fengguang
  2006-03-27 21:38   ` Matt Heler
  23 siblings, 2 replies; 37+ messages in thread
From: Jon Smirl @ 2006-03-19  3:10 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel

This is probably a readahead problem. The lighttpd people that are
encountering this problem are not regular lkml readers.

http://bugzilla.kernel.org/show_bug.cgi?id=5949

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19  3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
@ 2006-03-19  3:47   ` Wu Fengguang
  2006-03-19  4:10     ` Jon Smirl
  2006-03-27 21:38   ` Matt Heler
  1 sibling, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  3:47 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux-kernel

On Sat, Mar 18, 2006 at 10:10:43PM -0500, Jon Smirl wrote:
> This is probably a readahead problem. The lighttpd people that are
> encountering this problem are not regular lkml readers.
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=5949

[QUOTE]
        My general conclusion is that since they were able to write a user
        space implementation that avoids the problem something must be broken
        in the kernel readahead logic for sendfile().

Maybe the user space solution does the trick by using a larger window size?

IMHO, the stock read-ahead is not designed with extremely high concurrency in
mind. However, 100 streams should not be a problem at all.

Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/23] readahead: insert cond_resched() calls
  2006-03-19  2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
@ 2006-03-19  3:50   ` Lee Revell
  2006-03-19  5:32     ` Wu Fengguang
  2006-03-20 13:31     ` Wu Fengguang
  0 siblings, 2 replies; 37+ messages in thread
From: Lee Revell @ 2006-03-19  3:50 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel, Con Kolivas

On Sun, 2006-03-19 at 10:34 +0800, Wu Fengguang wrote:
> plain text document attachment
> (readahead-insert-cond_resched-calls.patch)
> Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
> complex, it becomes necessary to insert some cond_resched() calls in the
> read-ahead path.
> 
> If desktop users still feel audio jitters with the new read-ahead code,
> please try one of the following ways to get rid of it:
> 
> 1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
> 2) reduce the read-ahead request size by running
> 	blockdev --setra 256 /dev/hda # or whatever device you are using

Do you have any numbers on this (say, from Ingo's latency tracer)?  Have
you confirmed it's not a latency regression with the default settings?

Lee


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19  3:47   ` Wu Fengguang
@ 2006-03-19  4:10     ` Jon Smirl
  2006-03-19  5:09       ` Wu Fengguang
  0 siblings, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-03-19  4:10 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel

On 3/18/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> On Sat, Mar 18, 2006 at 10:10:43PM -0500, Jon Smirl wrote:
> > This is probably a readahead problem. The lighttpd people that are
> > encountering this problem are not regular lkml readers.
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=5949
>
> [QUOTE]
>         My general conclusion is that since they were able to write a user
>         space implementation that avoids the problem something must be broken
>         in the kernel readahead logic for sendfile().
>
> Maybe the user space solution does the trick by using a larger window size?
>
> IMHO, the stock read-ahead is not designed with extremely high concurrency in
> mind. However, 100 streams should not be a problem at all.

Has anyone checked to see if the readahead logic is working as
expected from sendfile? IO from sendfile is a different type of
context than IO from user space, there could be sendfile specific
problems. If window size is the trick, shouldn't sendfile
automatically adapt it's window size? I don't think you can control
the sendfile window size from user space.

The goal of sendfile is to be the most optimal way to send a file over
the network. This is a case where user space code is easily beating
it.

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19  4:10     ` Jon Smirl
@ 2006-03-19  5:09       ` Wu Fengguang
  2006-03-19 15:53         ` Jon Smirl
  0 siblings, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  5:09 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux-kernel

On Sat, Mar 18, 2006 at 11:10:33PM -0500, Jon Smirl wrote:
> > Maybe the user space solution does the trick by using a larger window size?
> >
> > IMHO, the stock read-ahead is not designed with extremely high concurrency in
> > mind. However, 100 streams should not be a problem at all.
> 
> Has anyone checked to see if the readahead logic is working as
> expected from sendfile? IO from sendfile is a different type of
> context than IO from user space, there could be sendfile specific

AFAIK, sendfile() and read() use the same readahead logic, which
handles them equally good.  And there is another readaround logic
which handles unhinted mmapped reads.

> problems. If window size is the trick, shouldn't sendfile
> automatically adapt it's window size? I don't think you can control
> the sendfile window size from user space.

For whole file readings, the stock readahead logic by default uses a fixed
window size of VM_MAX_READAHEAD=128KB, the adaptive readahead logic
uses an adaptive window size with a high limit of VM_MAX_READAHEAD=1024KB.

The VM_MAX_READAHEAD in the kernel is used to init the .ra_pages
attribute of block devices, which can later be altered in _runtime_.
To set a 512KB window size limit for hda, one can do it in two ways:
1) blockdev --setra 1024 /dev/hda
2) echo 512 > /sys/block/had/queue/read_ahead_kb

Cheers,
Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/23] readahead: insert cond_resched() calls
  2006-03-19  3:50   ` Lee Revell
@ 2006-03-19  5:32     ` Wu Fengguang
  2006-03-20 13:31     ` Wu Fengguang
  1 sibling, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19  5:32 UTC (permalink / raw)
  To: Lee Revell; +Cc: Andrew Morton, linux-kernel, Con Kolivas

On Sat, Mar 18, 2006 at 10:50:42PM -0500, Lee Revell wrote:
> On Sun, 2006-03-19 at 10:34 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (readahead-insert-cond_resched-calls.patch)
> > Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
> > complex, it becomes necessary to insert some cond_resched() calls in the
> > read-ahead path.
> > 
> > If desktop users still feel audio jitters with the new read-ahead code,
> > please try one of the following ways to get rid of it:
> > 
> > 1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
> > 2) reduce the read-ahead request size by running
> > 	blockdev --setra 256 /dev/hda # or whatever device you are using
> 
> Do you have any numbers on this (say, from Ingo's latency tracer)?  Have
> you confirmed it's not a latency regression with the default settings?

Sorry, I did the testing simply by playing mp3s while doing heavy I/O.
The 128KB window size do not need these cond_resched()s, yet the
1024KB window size do.  With this patch, I do not feel any latency
issues with CONFIG_PREEMPT_NONE. 

But I'm not sure it is the case for others, for there has been user
reports of audio jitters(without further informations, hence still an
open problem).

Anyway, numbers will be collected soon ...

Cheers,
Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19  5:09       ` Wu Fengguang
@ 2006-03-19 15:53         ` Jon Smirl
  2006-03-20 13:54           ` Wu Fengguang
  0 siblings, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-03-19 15:53 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel

On 3/19/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> On Sat, Mar 18, 2006 at 11:10:33PM -0500, Jon Smirl wrote:
> > > Maybe the user space solution does the trick by using a larger window size?
> > >
> > > IMHO, the stock read-ahead is not designed with extremely high concurrency in
> > > mind. However, 100 streams should not be a problem at all.
> >
> > Has anyone checked to see if the readahead logic is working as
> > expected from sendfile? IO from sendfile is a different type of
> > context than IO from user space, there could be sendfile specific
>
> AFAIK, sendfile() and read() use the same readahead logic, which
> handles them equally good.  And there is another readaround logic
> which handles unhinted mmapped reads.

In another thread someone made a mention that this problem may have
something to do with the pools of memory being used for sendfile. The
readahead from sendfile is going into a moderately sized pool. When
you get 100 of them going at once the other threads flush the
readahead data out of the pool before it can be used and thus trigger
the thrashing seek storm. Is this true, that sendfile data is read
ahead into a fixed sized pool? If so, the readahead algorithms would
need to reduce the sendfile window sizes to stop the pool from
thrashing.

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/23] readahead: insert cond_resched() calls
  2006-03-19  3:50   ` Lee Revell
  2006-03-19  5:32     ` Wu Fengguang
@ 2006-03-20 13:31     ` Wu Fengguang
  1 sibling, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-20 13:31 UTC (permalink / raw)
  To: Lee Revell; +Cc: Andrew Morton, linux-kernel, Con Kolivas

On Sat, Mar 18, 2006 at 10:50:42PM -0500, Lee Revell wrote:
> On Sun, 2006-03-19 at 10:34 +0800, Wu Fengguang wrote:
> > Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
> > complex, it becomes necessary to insert some cond_resched() calls in the
> > read-ahead path.
> 
> Do you have any numbers on this (say, from Ingo's latency tracer)?  Have
> you confirmed it's not a latency regression with the default settings?

Now I have some from Ingo's latency tracer. When doing 1M sized readaheads,
the original kernel have ~5ms latency. This patch reduced that number to ~1ms.

The trace data are collected on the following command:
        cp 3G_sparse_file /dev/null

The resulted maximum latencies are:

wfg ~% head -n5 latency_trace1
preemption latency trace v1.1.5 on 2.6.16
--------------------------------------------------------------------
 latency: 1169 us, #319/319, CPU#0 | (M:server VP:0, KP:0, SP:0 HP:0 #P:1)
    -----------------
    | task: xmms-5077 (uid:1000 nice:0 policy:0 rt_prio:0)
wfg ~% head -n5 latency_trace2
preemption latency trace v1.1.5 on 2.6.16
--------------------------------------------------------------------
 latency: 4835 us, #14102/14102, CPU#0 | (M:server VP:0, KP:0, SP:0 HP:0 #P:1)
    -----------------
    | task: xmms-5065 (uid:1000 nice:0 policy:0 rt_prio:0)

The folloing commands show that the 4835us latency is mainly caused by
        __do_page_cache_readahead():    from 575us to 1723us
        mpage_readpages():              from 1724us to 4795us

wfg ~% grep readahead latency_trace1
wfg ~% grep readpages latency_trace1
wfg ~% grep readahead latency_trace2
      cp-5068  0dn..  575us : _read_lock_irq (__do_page_cache_readahead)
      cp-5068  0dn..  576us : radix_tree_lookup_node (__do_page_cache_readahead)
      cp-5068  0dn..  576us : _read_unlock_irq (__do_page_cache_readahead)
      cp-5068  0dn..  577us : __alloc_pages (__do_page_cache_readahead)
      cp-5068  0dn..  580us : _read_lock_irq (__do_page_cache_readahead)
      cp-5068  0dn..  580us : radix_tree_lookup_node (__do_page_cache_readahead)
      cp-5068  0dn..  581us : _read_unlock_irq (__do_page_cache_readahead)
      cp-5068  0dn..  581us : __alloc_pages (__do_page_cache_readahead)
      cp-5068  0dn..  583us : _read_lock_irq (__do_page_cache_readahead)
......[899 LINES SKIPPED]
      cp-5068  0dn.. 1712us : radix_tree_lookup_node (__do_page_cache_readahead)
      cp-5068  0dn.. 1712us : _read_unlock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1713us : __alloc_pages (__do_page_cache_readahead)
      cp-5068  0dn.. 1715us : _read_lock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1716us : radix_tree_lookup_node (__do_page_cache_readahead)
      cp-5068  0dn.. 1716us : _read_unlock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1716us : __alloc_pages (__do_page_cache_readahead)
      cp-5068  0dn.. 1719us : _read_lock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1719us : radix_tree_lookup_node (__do_page_cache_readahead)
      cp-5068  0dn.. 1719us : _read_unlock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1720us : __alloc_pages (__do_page_cache_readahead)
      cp-5068  0dn.. 1722us : _read_lock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1723us : _read_unlock_irq (__do_page_cache_readahead)
      cp-5068  0dn.. 1723us : read_pages (__do_page_cache_readahead)
      cp-5068  0dn.. 4800us : readahead_cache_hit (do_generic_mapping_read)
wfg ~% grep readpages latency_trace2
      cp-5068  0dn.. 1724us : ext3_readpages (read_pages)
      cp-5068  0dn.. 1724us : mpage_readpages (ext3_readpages)
      cp-5068  0dn.. 1725us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 1728us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 1742us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 1744us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 1753us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 1755us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 1764us : add_to_page_cache (mpage_readpages)
......[508 LINES SKIPPED]
      cp-5068  0dn.. 4716us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4726us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 4728us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4737us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 4739us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4748us : __pagevec_lru_add (mpage_readpages)
      cp-5068  0dn.. 4750us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 4752us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4761us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 4763us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4772us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 4774us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4784us : add_to_page_cache (mpage_readpages)
      cp-5068  0dn.. 4785us : do_mpage_readpage (mpage_readpages)
      cp-5068  0dn.. 4795us : __pagevec_lru_add (mpage_readpages)

Cheers,
Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19 15:53         ` Jon Smirl
@ 2006-03-20 13:54           ` Wu Fengguang
  0 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-20 13:54 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux-kernel

On Sun, Mar 19, 2006 at 10:53:46AM -0500, Jon Smirl wrote:
> In another thread someone made a mention that this problem may have
> something to do with the pools of memory being used for sendfile. The
> readahead from sendfile is going into a moderately sized pool. When
> you get 100 of them going at once the other threads flush the
> readahead data out of the pool before it can be used and thus trigger
> the thrashing seek storm. Is this true, that sendfile data is read
> ahead into a fixed sized pool? If so, the readahead algorithms would

The pages are kept in a cache pool which is made of all the free memory.
E.g. the following command shows a system with 331M cache pool:

% free -m
             total       used       free     shared    buffers     cached
Mem:           488        482          5          0          7        331
-/+ buffers/cache:        142        345
Swap:          127          0        127

That would be more than enough for the stock read-ahead to handle 100
concurrent readers.

> need to reduce the sendfile window sizes to stop the pool from
> thrashing.

Sure, it is the desired behavior. This patch provides exactly this feature :)

Cheers,
Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/23] radixtree: look-aside cache
  2006-03-19  2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
@ 2006-03-20 16:01   ` Christoph Lameter
  2006-03-21  2:19     ` Wu Fengguang
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Lameter @ 2006-03-20 16:01 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel, Nick Piggin

On Sun, 19 Mar 2006, Wu Fengguang wrote:

> Signed-off-by: Christoph Lameter <clameter@sgi.com>

Hmm... This signoff exists because you are using some bit of earlier 
patches by me?

Typically the _node endings mean that one can specify a node number where 
to allocate memory.

Wont partial lookups like this complicate further work on the radixtree? 
Could you  get your speed optimizations into radixtree.c so that others 
can use it in the future?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/23] radixtree: look-aside cache
  2006-03-20 16:01   ` Christoph Lameter
@ 2006-03-21  2:19     ` Wu Fengguang
  0 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-21  2:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, Nick Piggin

On Mon, Mar 20, 2006 at 08:01:01AM -0800, Christoph Lameter wrote:
> On Sun, 19 Mar 2006, Wu Fengguang wrote:
> 
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Hmm... This signoff exists because you are using some bit of earlier 
> patches by me?

Yes, the __lookup_slot() part of your patch named
        [PATCH] radix-tree: Remove unnecessary indirections and clean up code
is integrated into this patch, for Andrew found that it simply cannot be a
standalone patch:
        http://www.ussg.iu.edu/hypermail/linux/kernel/0512.0/0922.html

> Typically the _node endings mean that one can specify a node number where 
> to allocate memory.

Ah, I named it _node for this function is a generalized radix_tree_lookup().
The main difference is that radix_tree_lookup() returns the data in leaf node,
whereas radix_tree_lookup_node() returns the data or one of its parent nodes.

If _node is a misleading name, will _parent or _level do the job?

> Wont partial lookups like this complicate further work on the radixtree? 

Yes the new function is a bit of confusing ;)
But the change of code is minimal. Conceptually it works as follows:

-void *radix_tree_lookup(struct radix_tree_root *root,
-                               unsigned long index)                                              
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+                               unsigned long index, unsigned int level)
 {
        unsigned int height, shift;
        struct radix_tree_node *slot;
@@ -300,7 +304,7 @@ static inline void **__lookup_slot(struc
        shift = (height-1) * RADIX_TREE_MAP_SHIFT;
        slot = root->rnode;

-       while (height > 0) {
+       while (height > level) {
                if (slot == NULL)
                        return NULL;

The main concern might be Nick's RCU patch. Since this function or
the look-aside cache concept do not collide with RCU's basic
assumptions, the impact should be minimal.

> Could you  get your speed optimizations into radixtree.c so that others 
> can use it in the future?

Sure.
There are three set of optimized functions that can be used by others:

- radix_tree_lookup_node(): used by the radix tree look-aside cache
                            code and context based read-ahead.
- radix_tree_cache_lookup()/radix_tree_cache_lookup_node():
- radix_tree_scan_hole()/radix_tree_scan_hole_backward():
  currently only used by context based read-ahead.

In the future I'd like to introduce a new function named
radix_tree_scan_data() to replace __lookup(), and rebase
radix_tree_gang_lookup() on it.

This change makes possible for a better __do_page_cache_readahead()
implementation based on radix_tree_scan_hole()/radix_tree_scan_data().

Cheers,
Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-19  3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
  2006-03-19  3:47   ` Wu Fengguang
@ 2006-03-27 21:38   ` Matt Heler
  2006-03-28  3:44     ` Wu Fengguang
  1 sibling, 1 reply; 37+ messages in thread
From: Matt Heler @ 2006-03-27 21:38 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Wu Fengguang, linux-kernel

We use lighttpd on our servers, and I can say with 100% that this problem 
happens alot. Because of this , we were forced to use to userspace mechanism 
that the lighttpd author made to cirvumvent this issue. However with this 
patch, I'm unable to produce any of the problems we had experienced before. 
IO-Wait has dropped significantly from 80% to 20%. 
I'd be happy to send over some benchmarks if need be.

Matt Heler

On Saturday 18 March 2006 10:10 pm, Jon Smirl wrote:
> This is probably a readahead problem. The lighttpd people that are
> encountering this problem are not regular lkml readers.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5949
>
> --
> Jon Smirl
> jonsmirl@gmail.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/23] Adaptive read-ahead V11
  2006-03-27 21:38   ` Matt Heler
@ 2006-03-28  3:44     ` Wu Fengguang
  0 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-28  3:44 UTC (permalink / raw)
  To: Matt Heler; +Cc: Jon Smirl, linux-kernel

On Mon, Mar 27, 2006 at 04:38:33PM -0500, Matt Heler wrote:
> We use lighttpd on our servers, and I can say with 100% that this problem 
> happens alot. Because of this , we were forced to use to userspace mechanism 
> that the lighttpd author made to cirvumvent this issue. However with this 
> patch, I'm unable to produce any of the problems we had experienced before. 
> IO-Wait has dropped significantly from 80% to 20%. 
> I'd be happy to send over some benchmarks if need be.

Thanks, your production service would be the best benchmark ;)

Would you send some performance numbers and the basic server configuration?

Wu

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2006-03-28  3:15 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-19  2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
2006-03-19  2:34 ` [PATCH 01/23] readahead: kconfig options Wu Fengguang
2006-03-19  2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
2006-03-20 16:01   ` Christoph Lameter
2006-03-21  2:19     ` Wu Fengguang
2006-03-19  2:34 ` [PATCH 03/23] radixtree: hole scanning functions Wu Fengguang
2006-03-19  2:34 ` [PATCH 04/23] readahead: page flag PG_readahead Wu Fengguang
2006-03-19  2:34 ` [PATCH 05/23] readahead: refactor do_generic_mapping_read() Wu Fengguang
2006-03-19  2:34 ` [PATCH 06/23] readahead: refactor __do_page_cache_readahead() Wu Fengguang
2006-03-19  2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
2006-03-19  3:50   ` Lee Revell
2006-03-19  5:32     ` Wu Fengguang
2006-03-20 13:31     ` Wu Fengguang
2006-03-19  2:34 ` [PATCH 08/23] readahead: common macros Wu Fengguang
2006-03-19  2:34 ` [PATCH 09/23] readahead: events accounting Wu Fengguang
2006-03-19  2:34 ` [PATCH 10/23] readahead: support functions Wu Fengguang
2006-03-19  2:34 ` [PATCH 11/23] readahead: sysctl parameters Wu Fengguang
2006-03-19  2:34 ` [PATCH 12/23] readahead: min/max sizes Wu Fengguang
2006-03-19  2:34 ` [PATCH 13/23] readahead: page cache aging accounting Wu Fengguang
2006-03-19  2:34 ` [PATCH 14/23] readahead: state based method Wu Fengguang
2006-03-19  2:34 ` [PATCH 15/23] readahead: context " Wu Fengguang
2006-03-19  2:34 ` [PATCH 16/23] readahead: other methods Wu Fengguang
2006-03-19  2:34 ` [PATCH 17/23] readahead: call scheme Wu Fengguang
2006-03-19  2:34 ` [PATCH 18/23] readahead: laptop mode Wu Fengguang
2006-03-19  2:34 ` [PATCH 19/23] readahead: loop case Wu Fengguang
2006-03-19  2:34 ` [PATCH 20/23] readahead: nfsd case Wu Fengguang
2006-03-19  2:34 ` [PATCH 21/23] readahead: debug radix tree new functions Wu Fengguang
2006-03-19  2:34 ` [PATCH 22/23] readahead: debug traces showing accessed file names Wu Fengguang
2006-03-19  2:34 ` [PATCH 23/23] readahead: debug traces showing read patterns Wu Fengguang
2006-03-19  3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
2006-03-19  3:47   ` Wu Fengguang
2006-03-19  4:10     ` Jon Smirl
2006-03-19  5:09       ` Wu Fengguang
2006-03-19 15:53         ` Jon Smirl
2006-03-20 13:54           ` Wu Fengguang
2006-03-27 21:38   ` Matt Heler
2006-03-28  3:44     ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox