* [PATCH 01/23] readahead: kconfig options
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
` (22 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-kconfig-options.patch --]
[-- Type: text/plain, Size: 4248 bytes --]
This patchset introduces a set of adaptive read-ahead methods.
They enable the kernel to better support many important I/O applications.
The functional features include:
- Adaptive read-ahead buffer management
- aggressive, thrashing safe read-ahead size
- optimal memory utilisation while achieving good I/O throughput
- unnecessary to hand tuning VM_MAX_READAHEAD
- support slow/fast readers at the same time
- support large number of concurrent readers
- shrinkable look-ahead size
- cut down up to 40% memory consumption on overloaded situation
- Support common access patterns
- multiple streams on one fd
- backward prefetching
- sparse reading
- seeking and reading
- Special case handling
- nfsd support: the raparams cache is no longer required
- laptop mode support: defer look-ahead on drive spinned down
- loopback file support: avoid double look-ahead
The design strategies are:
- Dual methods design
- stateful method: the fast and default one
- stateless method: the robust and failsafe one
- if anything abnormal happens, the stateful method bails out, the
stateless method queries the page cache and possibly restart the
read-ahead process
- Robust feedback design
- sense and handle important states so that the logic wont run away
- detect danger of thrashing and prevent it in advance
- extensive accounting and debugging traces
This patch:
Add kconfig options to enable/disable:
- adaptive read-ahead logic
- adaptive read-ahead debug traces and events accounting
The read-ahead introduction text is cited from the well written LWN article
"Adaptive file readahead" <http://lwn.net/Articles/155510/> :)
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/Kconfig | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/mm/Kconfig
+++ linux-2.6.16-rc6-mm2/mm/Kconfig
@@ -145,3 +145,58 @@ config MIGRATION
while the virtual addresses are not changed. This is useful for
example on NUMA systems to put pages nearer to the processors accessing
the page.
+
+#
+# Adaptive file readahead
+#
+config ADAPTIVE_READAHEAD
+ bool "Adaptive file readahead (EXPERIMENTAL)"
+ default n
+ depends on EXPERIMENTAL
+ help
+ Readahead is a technique employed by the kernel in an attempt
+ to improve file reading performance. If the kernel has reason
+ to believe that a particular file is being read sequentially,
+ it will attempt to read blocks from the file into memory before
+ the application requests them. When readahead works, it speeds
+ up the system's throughput, since the reading application does
+ not have to wait for its requests. When readahead fails, instead,
+ it generates useless I/O and occupies memory pages which are
+ needed for some other purpose. For sequential readings,
+
+ Normally, the kernel uses a stock readahead logic that is well
+ understood and well tuned. This option enables a much complex and
+ feature rich one. It is more aggressive and memory efficient in
+ doing readahead, and supports some less-common access patterns such
+ as reading backward and reading sparsely. However, due to the great
+ diversity of real world applications, it might not fit everyone.
+
+ Please refer to Documentation/sysctl/vm.txt for tunable parameters.
+
+ Say Y here if you are building kernel for file servers.
+ Say N if you are unsure.
+
+config DEBUG_READAHEAD
+ bool "Readahead debug and accounting"
+ default n
+ depends on ADAPTIVE_READAHEAD
+ select DEBUG_FS
+ help
+ This option injects extra code to dump detailed debug traces and do
+ readahead events accounting.
+
+ To actually get the data:
+
+ mkdir /debug
+ mount -t debug none /debug
+
+ After that you can do the following:
+
+ echo > /debug/readahead/events # reset the counters
+ cat /debug/readahead/events # check the counters
+
+ echo 1 > /debug/readahead/debug_level # show printk traces
+ echo 2 > /debug/readahead/debug_level # show verbose printk traces
+ echo 0 > /debug/readahead/debug_level # stop filling my kern.log
+
+ Say N, unless you have readahead performance problems.
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 02/23] radixtree: look-aside cache
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
2006-03-19 2:34 ` [PATCH 01/23] readahead: kconfig options Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-20 16:01 ` Christoph Lameter
2006-03-19 2:34 ` [PATCH 03/23] radixtree: hole scanning functions Wu Fengguang
` (21 subsequent siblings)
23 siblings, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Nick Piggin, Christoph Lameter, Wu Fengguang
[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 9379 bytes --]
Introduce a set of lookup functions to radix tree for the read-ahead logic.
Other access patterns with high locality may also benefit from them.
- radix_tree_lookup_node(root, index, level)
Perform partial lookup, return the @level'th parent of the slot at
@index.
- radix_tree_cache_xxx()
Init/Query the cache.
- radix_tree_cache_lookup(root, cache, index)
Perform lookup with the aid of a look-aside cache.
For sequential scans, it has a time complexity of 64*O(1) + 1*O(logN).
Typical usage:
void func() {
+ struct radix_tree_cache cache;
+
+ radix_tree_cache_init(&cache);
read_lock_irq(&mapping->tree_lock);
for(;;) {
- page = radix_tree_lookup(&mapping->page_tree, index);
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache, index);
}
read_unlock_irq(&mapping->tree_lock);
}
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/radix-tree.h | 78 +++++++++++++++++++++++++++++++
lib/radix-tree.c | 110 ++++++++++++++++++++++++++++++++-------------
2 files changed, 156 insertions(+), 32 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/include/linux/radix-tree.h
+++ linux-2.6.16-rc6-mm2/include/linux/radix-tree.h
@@ -23,12 +23,24 @@
#include <linux/preempt.h>
#include <linux/types.h>
+#define RADIX_TREE_MAP_SHIFT 6
+#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
+
struct radix_tree_root {
unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node *rnode;
};
+/*
+ * Lookaside cache to support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+ unsigned long first_index;
+ struct radix_tree_node *tree_node;
+};
+
#define RADIX_TREE_INIT(mask) { \
.height = 0, \
.gfp_mask = (mask), \
@@ -48,9 +60,14 @@ do { \
#define RADIX_TREE_MAX_TAGS 2
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_node(struct radix_tree_root *, unsigned long,
+ unsigned int);
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache);
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
@@ -73,4 +90,61 @@ static inline void radix_tree_preload_en
preempt_enable();
}
+/**
+ * radix_tree_lookup - perform lookup operation on a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+ unsigned long index)
+{
+ return radix_tree_lookup_node(root, index, 0);
+}
+
+/**
+ * radix_tree_cache_init - init a look-aside cache
+ * @cache: look-aside cache
+ *
+ * Init the radix tree look-aside cache @cache.
+ */
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+ cache->first_index = RADIX_TREE_MAP_MASK;
+ cache->tree_node = NULL;
+}
+
+/**
+ * radix_tree_cache_lookup - cached lookup on a radix tree
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root,
+ * and make use of @cache to speedup the lookup process.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index)
+{
+ return radix_tree_cache_lookup_node(root, cache, index, 0);
+}
+
+static inline unsigned int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+ return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+ return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline unsigned long
+radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+ return cache->first_index;
+}
+
#endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.16-rc6-mm2.orig/lib/radix-tree.c
+++ linux-2.6.16-rc6-mm2/lib/radix-tree.c
@@ -32,15 +32,6 @@
#include <linux/bitops.h>
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT 6
-#else
-#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
-#endif
-
-#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
-
#define RADIX_TREE_TAG_LONGS \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
@@ -289,32 +280,89 @@ int radix_tree_insert(struct radix_tree_
}
EXPORT_SYMBOL(radix_tree_insert);
-static inline void **__lookup_slot(struct radix_tree_root *root,
- unsigned long index)
+/**
+ * radix_tree_lookup_node - low level lookup routine
+ * @root: radix tree root
+ * @index: index key
+ * @level: stop at that many levels from the tree leaf
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ * The return value is:
+ * @level == 0: page at @index;
+ * @level == 1: the corresponding bottom level tree node;
+ * @level < height: (@level-1)th parent node of the bottom node
+ * that contains @index;
+ * @level >= height: the root node.
+ */
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+ unsigned long index, unsigned int level)
{
unsigned int height, shift;
- struct radix_tree_node **slot;
+ struct radix_tree_node *slot;
height = root->height;
if (index > radix_tree_maxindex(height))
return NULL;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;
+ slot = root->rnode;
- while (height > 0) {
- if (*slot == NULL)
+ while (height > level) {
+ if (slot == NULL)
return NULL;
- slot = (struct radix_tree_node **)
- ((*slot)->slots +
- ((index >> shift) & RADIX_TREE_MAP_MASK));
+ slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
}
- return (void **)slot;
+ return slot;
+}
+EXPORT_SYMBOL(radix_tree_lookup_node);
+
+/**
+ * radix_tree_cache_lookup_node - cached lookup node
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root,
+ * and return the node @level levels from the bottom in the search path.
+ *
+ * @cache stores the last accessed upper level tree node by this
+ * function, and is always checked first before searching in the tree.
+ * It can improve speed for access patterns with strong locality.
+ *
+ * NOTE:
+ * - The cache becomes invalid on leaving the lock;
+ * - Do not intermix calls with different @level.
+ */
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level)
+{
+ struct radix_tree_node *node;
+ unsigned long i;
+ unsigned long mask;
+
+ if (level >= root->height)
+ return root->rnode;
+
+ i = ((index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK);
+ mask = ~((RADIX_TREE_MAP_SIZE << (level * RADIX_TREE_MAP_SHIFT)) - 1);
+
+ if ((index & mask) == cache->first_index)
+ return cache->tree_node->slots[i];
+
+ node = radix_tree_lookup_node(root, index, level + 1);
+ if (!node)
+ return 0;
+
+ cache->tree_node = node;
+ cache->first_index = (index & mask);
+ return node->slots[i];
}
+EXPORT_SYMBOL(radix_tree_cache_lookup_node);
/**
* radix_tree_lookup_slot - lookup a slot in a radix tree
@@ -326,25 +374,27 @@ static inline void **__lookup_slot(struc
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
- return __lookup_slot(root, index);
+ struct radix_tree_node *node;
+
+ node = radix_tree_lookup_node(root, index, 1);
+ return node->slots + (index & RADIX_TREE_MAP_MASK);
}
EXPORT_SYMBOL(radix_tree_lookup_slot);
/**
- * radix_tree_lookup - perform lookup operation on a radix tree
- * @root: radix tree root
- * @index: index key
+ * radix_tree_cache_count - items in the cached node
+ * @cache: radix tree look-aside cache
*
- * Lookup the item at the position @index in the radix tree @root.
+ * Query the number of items contained in the cached node.
*/
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache)
{
- void **slot;
-
- slot = __lookup_slot(root, index);
- return slot != NULL ? *slot : NULL;
+ if (!(cache->first_index & RADIX_TREE_MAP_MASK))
+ return cache->tree_node->count;
+ else
+ return 0;
}
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_cache_count);
/**
* radix_tree_tag_set - set a tag on a radix tree node
--
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 02/23] radixtree: look-aside cache
2006-03-19 2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
@ 2006-03-20 16:01 ` Christoph Lameter
2006-03-21 2:19 ` Wu Fengguang
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Lameter @ 2006-03-20 16:01 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel, Nick Piggin
On Sun, 19 Mar 2006, Wu Fengguang wrote:
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
Hmm... This signoff exists because you are using some bit of earlier
patches by me?
Typically the _node endings mean that one can specify a node number where
to allocate memory.
Wont partial lookups like this complicate further work on the radixtree?
Could you get your speed optimizations into radixtree.c so that others
can use it in the future?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 02/23] radixtree: look-aside cache
2006-03-20 16:01 ` Christoph Lameter
@ 2006-03-21 2:19 ` Wu Fengguang
0 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-21 2:19 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, Nick Piggin
On Mon, Mar 20, 2006 at 08:01:01AM -0800, Christoph Lameter wrote:
> On Sun, 19 Mar 2006, Wu Fengguang wrote:
>
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> Hmm... This signoff exists because you are using some bit of earlier
> patches by me?
Yes, the __lookup_slot() part of your patch named
[PATCH] radix-tree: Remove unnecessary indirections and clean up code
is integrated into this patch, for Andrew found that it simply cannot be a
standalone patch:
http://www.ussg.iu.edu/hypermail/linux/kernel/0512.0/0922.html
> Typically the _node endings mean that one can specify a node number where
> to allocate memory.
Ah, I named it _node for this function is a generalized radix_tree_lookup().
The main difference is that radix_tree_lookup() returns the data in leaf node,
whereas radix_tree_lookup_node() returns the data or one of its parent nodes.
If _node is a misleading name, will _parent or _level do the job?
> Wont partial lookups like this complicate further work on the radixtree?
Yes the new function is a bit of confusing ;)
But the change of code is minimal. Conceptually it works as follows:
-void *radix_tree_lookup(struct radix_tree_root *root,
- unsigned long index)
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+ unsigned long index, unsigned int level)
{
unsigned int height, shift;
struct radix_tree_node *slot;
@@ -300,7 +304,7 @@ static inline void **__lookup_slot(struc
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
slot = root->rnode;
- while (height > 0) {
+ while (height > level) {
if (slot == NULL)
return NULL;
The main concern might be Nick's RCU patch. Since this function or
the look-aside cache concept do not collide with RCU's basic
assumptions, the impact should be minimal.
> Could you get your speed optimizations into radixtree.c so that others
> can use it in the future?
Sure.
There are three set of optimized functions that can be used by others:
- radix_tree_lookup_node(): used by the radix tree look-aside cache
code and context based read-ahead.
- radix_tree_cache_lookup()/radix_tree_cache_lookup_node():
- radix_tree_scan_hole()/radix_tree_scan_hole_backward():
currently only used by context based read-ahead.
In the future I'd like to introduce a new function named
radix_tree_scan_data() to replace __lookup(), and rebase
radix_tree_gang_lookup() on it.
This change makes possible for a better __do_page_cache_readahead()
implementation based on radix_tree_scan_hole()/radix_tree_scan_data().
Cheers,
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 03/23] radixtree: hole scanning functions
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
2006-03-19 2:34 ` [PATCH 01/23] readahead: kconfig options Wu Fengguang
2006-03-19 2:34 ` [PATCH 02/23] radixtree: look-aside cache Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 04/23] readahead: page flag PG_readahead Wu Fengguang
` (20 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: radixtree-hole-scanning-functions.patch --]
[-- Type: text/plain, Size: 3778 bytes --]
Introduce a pair of functions to scan radix tree for hole/empty item.
include/linux/radix-tree.h | 4 +
lib/radix-tree.c | 104 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 108 insertions(+)
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
--- linux-2.6.16-rc6-mm2.orig/include/linux/radix-tree.h
+++ linux-2.6.16-rc6-mm2/include/linux/radix-tree.h
@@ -68,6 +68,10 @@ unsigned int radix_tree_cache_count(stru
void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
struct radix_tree_cache *cache,
unsigned long index, unsigned int level);
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
--- linux-2.6.16-rc6-mm2.orig/lib/radix-tree.c
+++ linux-2.6.16-rc6-mm2/lib/radix-tree.c
@@ -397,6 +397,110 @@ unsigned int radix_tree_cache_count(stru
EXPORT_SYMBOL(radix_tree_cache_count);
/**
+ * radix_tree_scan_hole_backward - scan backward for hole
+ * @root: radix tree root
+ * @index: index key
+ * @max_scan: advice on max items to scan (it may scan a little more)
+ *
+ * Scan backward from @index for a hole/empty item, stop when
+ * - hit hole
+ * - @max_scan or more items scanned
+ * - hit index 0
+ *
+ * Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ unsigned long origin;
+ int i;
+
+ origin = index;
+ radix_tree_cache_init(&cache);
+
+ while (origin - index < max_scan) {
+ node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+ if (!node)
+ break;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index - RADIX_TREE_MAP_SIZE) |
+ RADIX_TREE_MAP_MASK;
+ goto check_underflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+check_underflow:
+ if (unlikely(index == ULONG_MAX)) {
+ index = 0;
+ break;
+ }
+ }
+
+out:
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole_backward);
+
+/**
+ * radix_tree_scan_hole - scan for hole
+ * @root: radix tree root
+ * @index: index key
+ * @max_scan: advice on max items to scan (it may scan a little more)
+ *
+ * Scan forward from @index for a hole/empty item, stop when
+ * - hit hole
+ * - hit EOF
+ * - hit index ULONG_MAX
+ * - @max_scan or more items scanned
+ *
+ * Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ unsigned long origin;
+ int i;
+
+ origin = index;
+ radix_tree_cache_init(&cache);
+
+ while (index - origin < max_scan) {
+ node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+ if (!node)
+ break;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index | RADIX_TREE_MAP_MASK) + 1;
+ goto check_overflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE;
+ i++, index++) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+check_overflow:
+ if (unlikely(!index)) {
+ index = ULONG_MAX;
+ break;
+ }
+ }
+out:
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole);
+
+/**
* radix_tree_tag_set - set a tag on a radix tree node
* @root: radix tree root
* @index: index key
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 04/23] readahead: page flag PG_readahead
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (2 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 03/23] radixtree: hole scanning functions Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 05/23] readahead: refactor do_generic_mapping_read() Wu Fengguang
` (19 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-page-flag-PG_readahead.patch --]
[-- Type: text/plain, Size: 1851 bytes --]
An new page flag PG_readahead is introduced as a look-ahead mark, which
reminds the caller to give the adaptive read-ahead logic a chance to do
read-ahead ahead of time for I/O pipelining.
It roughly corresponds to `ahead_start' of the stock read-ahead logic.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/page-flags.h | 5 +++++
mm/page_alloc.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
--- linux-2.6.16-rc6-mm2.orig/include/linux/page-flags.h
+++ linux-2.6.16-rc6-mm2/include/linux/page-flags.h
@@ -76,6 +76,7 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
+#define PG_readahead 20 /* Reminder to do readahead */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -343,6 +344,10 @@ extern void __mod_page_state_offset(unsi
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageReadahead(page) test_bit(PG_readahead, &(page)->flags)
+#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
--- linux-2.6.16-rc6-mm2.orig/mm/page_alloc.c
+++ linux-2.6.16-rc6-mm2/mm/page_alloc.c
@@ -549,7 +549,7 @@ static int prep_new_page(struct page *pa
if (PageReserved(page))
return 1;
- page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 05/23] readahead: refactor do_generic_mapping_read()
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (3 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 04/23] readahead: page flag PG_readahead Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 06/23] readahead: refactor __do_page_cache_readahead() Wu Fengguang
` (18 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-refactor-do_generic_mapping_read.patch --]
[-- Type: text/plain, Size: 2536 bytes --]
In do_generic_mapping_read(), release accessed pages some time later,
so that it can be passed to and used by the adaptive read-ahead code.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/filemap.c | 18 ++++++++++++------
1 files changed, 12 insertions(+), 6 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/mm/filemap.c
+++ linux-2.6.16-rc6-mm2/mm/filemap.c
@@ -799,10 +799,12 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
loff_t isize;
struct page *cached_page;
+ struct page *prev_page;
int error;
struct file_ra_state ra = *_ra;
cached_page = NULL;
+ prev_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -841,6 +843,11 @@ find_page:
handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
+
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
+
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -875,7 +882,6 @@ page_ok:
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
- page_cache_release(page);
if (ret == nr && desc->count)
continue;
goto out;
@@ -887,7 +893,6 @@ page_not_up_to_date:
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
unlock_page(page);
- page_cache_release(page);
continue;
}
@@ -917,7 +922,6 @@ readpage:
* invalidate_inode_pages got it
*/
unlock_page(page);
- page_cache_release(page);
goto find_page;
}
unlock_page(page);
@@ -938,7 +942,6 @@ readpage:
isize = i_size_read(inode);
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
if (unlikely(!isize || index > end_index)) {
- page_cache_release(page);
goto out;
}
@@ -947,7 +950,6 @@ readpage:
if (index == end_index) {
nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
if (nr <= offset) {
- page_cache_release(page);
goto out;
}
}
@@ -957,7 +959,6 @@ readpage:
readpage_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
- page_cache_release(page);
goto out;
no_cached_page:
@@ -982,6 +983,9 @@ no_cached_page:
}
page = cached_page;
cached_page = NULL;
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
goto readpage;
}
@@ -991,6 +995,8 @@ out:
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (cached_page)
page_cache_release(cached_page);
+ if (prev_page)
+ page_cache_release(prev_page);
if (filp)
file_accessed(filp);
}
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 06/23] readahead: refactor __do_page_cache_readahead()
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (4 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 05/23] readahead: refactor do_generic_mapping_read() Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
` (17 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-refactor-__do_page_cache_readahead.patch --]
[-- Type: text/plain, Size: 2546 bytes --]
Add look-ahead support to __do_page_cache_readahead(), which is needed by
the adaptive read-ahead logic.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 15 +++++++++------
1 files changed, 9 insertions(+), 6 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -268,7 +268,8 @@ out:
*/
static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
- pgoff_t offset, unsigned long nr_to_read)
+ pgoff_t offset, unsigned long nr_to_read,
+ unsigned long lookahead_size)
{
struct inode *inode = mapping->host;
struct page *page;
@@ -281,7 +282,7 @@ __do_page_cache_readahead(struct address
if (isize == 0)
goto out;
- end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
/*
* Preallocate as many pages as we will need.
@@ -304,6 +305,8 @@ __do_page_cache_readahead(struct address
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
+ if (page_idx == nr_to_read - lookahead_size)
+ __SetPageReadahead(page);
ret++;
}
read_unlock_irq(&mapping->tree_lock);
@@ -340,7 +343,7 @@ int force_page_cache_readahead(struct ad
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
err = __do_page_cache_readahead(mapping, filp,
- offset, this_chunk);
+ offset, this_chunk, 0);
if (err < 0) {
ret = err;
break;
@@ -387,7 +390,7 @@ int do_page_cache_readahead(struct addre
if (bdi_read_congested(mapping->backing_dev_info))
return -1;
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
}
/*
@@ -407,7 +410,7 @@ blockable_page_cache_readahead(struct ad
if (!block && bdi_read_congested(mapping->backing_dev_info))
return 0;
- actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
return check_ra_success(ra, nr_to_read, actual);
}
@@ -452,7 +455,7 @@ static int make_ahead_window(struct addr
* @req_size: hint: total size of the read which the caller is performing in
* PAGE_CACHE_SIZE units
*
- * page_cache_readahead() is the main function. If performs the adaptive
+ * page_cache_readahead() is the main function. It performs the adaptive
* readahead window size management and submits the readahead I/O.
*
* Note that @filp is purely used for passing on to the ->readpage[s]()
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 07/23] readahead: insert cond_resched() calls
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (5 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 06/23] readahead: refactor __do_page_cache_readahead() Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 3:50 ` Lee Revell
2006-03-19 2:34 ` [PATCH 08/23] readahead: common macros Wu Fengguang
` (16 subsequent siblings)
23 siblings, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Con Kolivas, Wu Fengguang
[-- Attachment #1: readahead-insert-cond_resched-calls.patch --]
[-- Type: text/plain, Size: 2171 bytes --]
Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
complex, it becomes necessary to insert some cond_resched() calls in the
read-ahead path.
If desktop users still feel audio jitters with the new read-ahead code,
please try one of the following ways to get rid of it:
1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
2) reduce the read-ahead request size by running
blockdev --setra 256 /dev/hda # or whatever device you are using
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
This patch is recommended by Con Kolivas to improve respond time for desktop.
Thanks!
fs/mpage.c | 4 +++-
mm/readahead.c | 9 +++++++--
2 files changed, 10 insertions(+), 3 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -145,8 +145,10 @@ int read_cache_pages(struct address_spac
continue;
}
ret = filler(data, page);
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
+ cond_resched();
__pagevec_lru_add(&lru_pvec);
+ }
if (ret) {
while (!list_empty(pages)) {
struct page *victim;
@@ -184,8 +186,10 @@ static int read_pages(struct address_spa
page->index, GFP_KERNEL)) {
ret = mapping->a_ops->readpage(filp, page);
if (ret != AOP_TRUNCATED_PAGE) {
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
+ cond_resched();
__pagevec_lru_add(&lru_pvec);
+ }
continue;
} /* else fall through to release */
}
@@ -299,6 +303,7 @@ __do_page_cache_readahead(struct address
continue;
read_unlock_irq(&mapping->tree_lock);
+ cond_resched();
page = page_cache_alloc_cold(mapping);
read_lock_irq(&mapping->tree_lock);
if (!page)
--- linux-2.6.16-rc6-mm2.orig/fs/mpage.c
+++ linux-2.6.16-rc6-mm2/fs/mpage.c
@@ -407,8 +407,10 @@ mpage_readpages(struct address_space *ma
&last_block_in_bio, &map_bh,
&first_logical_block,
get_block);
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
+ cond_resched();
__pagevec_lru_add(&lru_pvec);
+ }
} else {
page_cache_release(page);
}
--
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 07/23] readahead: insert cond_resched() calls
2006-03-19 2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
@ 2006-03-19 3:50 ` Lee Revell
2006-03-19 5:32 ` Wu Fengguang
2006-03-20 13:31 ` Wu Fengguang
0 siblings, 2 replies; 37+ messages in thread
From: Lee Revell @ 2006-03-19 3:50 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel, Con Kolivas
On Sun, 2006-03-19 at 10:34 +0800, Wu Fengguang wrote:
> plain text document attachment
> (readahead-insert-cond_resched-calls.patch)
> Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
> complex, it becomes necessary to insert some cond_resched() calls in the
> read-ahead path.
>
> If desktop users still feel audio jitters with the new read-ahead code,
> please try one of the following ways to get rid of it:
>
> 1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
> 2) reduce the read-ahead request size by running
> blockdev --setra 256 /dev/hda # or whatever device you are using
Do you have any numbers on this (say, from Ingo's latency tracer)? Have
you confirmed it's not a latency regression with the default settings?
Lee
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 07/23] readahead: insert cond_resched() calls
2006-03-19 3:50 ` Lee Revell
@ 2006-03-19 5:32 ` Wu Fengguang
2006-03-20 13:31 ` Wu Fengguang
1 sibling, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 5:32 UTC (permalink / raw)
To: Lee Revell; +Cc: Andrew Morton, linux-kernel, Con Kolivas
On Sat, Mar 18, 2006 at 10:50:42PM -0500, Lee Revell wrote:
> On Sun, 2006-03-19 at 10:34 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (readahead-insert-cond_resched-calls.patch)
> > Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
> > complex, it becomes necessary to insert some cond_resched() calls in the
> > read-ahead path.
> >
> > If desktop users still feel audio jitters with the new read-ahead code,
> > please try one of the following ways to get rid of it:
> >
> > 1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
> > 2) reduce the read-ahead request size by running
> > blockdev --setra 256 /dev/hda # or whatever device you are using
>
> Do you have any numbers on this (say, from Ingo's latency tracer)? Have
> you confirmed it's not a latency regression with the default settings?
Sorry, I did the testing simply by playing mp3s while doing heavy I/O.
The 128KB window size do not need these cond_resched()s, yet the
1024KB window size do. With this patch, I do not feel any latency
issues with CONFIG_PREEMPT_NONE.
But I'm not sure it is the case for others, for there has been user
reports of audio jitters(without further informations, hence still an
open problem).
Anyway, numbers will be collected soon ...
Cheers,
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 07/23] readahead: insert cond_resched() calls
2006-03-19 3:50 ` Lee Revell
2006-03-19 5:32 ` Wu Fengguang
@ 2006-03-20 13:31 ` Wu Fengguang
1 sibling, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-20 13:31 UTC (permalink / raw)
To: Lee Revell; +Cc: Andrew Morton, linux-kernel, Con Kolivas
On Sat, Mar 18, 2006 at 10:50:42PM -0500, Lee Revell wrote:
> On Sun, 2006-03-19 at 10:34 +0800, Wu Fengguang wrote:
> > Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
> > complex, it becomes necessary to insert some cond_resched() calls in the
> > read-ahead path.
>
> Do you have any numbers on this (say, from Ingo's latency tracer)? Have
> you confirmed it's not a latency regression with the default settings?
Now I have some from Ingo's latency tracer. When doing 1M sized readaheads,
the original kernel have ~5ms latency. This patch reduced that number to ~1ms.
The trace data are collected on the following command:
cp 3G_sparse_file /dev/null
The resulted maximum latencies are:
wfg ~% head -n5 latency_trace1
preemption latency trace v1.1.5 on 2.6.16
--------------------------------------------------------------------
latency: 1169 us, #319/319, CPU#0 | (M:server VP:0, KP:0, SP:0 HP:0 #P:1)
-----------------
| task: xmms-5077 (uid:1000 nice:0 policy:0 rt_prio:0)
wfg ~% head -n5 latency_trace2
preemption latency trace v1.1.5 on 2.6.16
--------------------------------------------------------------------
latency: 4835 us, #14102/14102, CPU#0 | (M:server VP:0, KP:0, SP:0 HP:0 #P:1)
-----------------
| task: xmms-5065 (uid:1000 nice:0 policy:0 rt_prio:0)
The folloing commands show that the 4835us latency is mainly caused by
__do_page_cache_readahead(): from 575us to 1723us
mpage_readpages(): from 1724us to 4795us
wfg ~% grep readahead latency_trace1
wfg ~% grep readpages latency_trace1
wfg ~% grep readahead latency_trace2
cp-5068 0dn.. 575us : _read_lock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 576us : radix_tree_lookup_node (__do_page_cache_readahead)
cp-5068 0dn.. 576us : _read_unlock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 577us : __alloc_pages (__do_page_cache_readahead)
cp-5068 0dn.. 580us : _read_lock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 580us : radix_tree_lookup_node (__do_page_cache_readahead)
cp-5068 0dn.. 581us : _read_unlock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 581us : __alloc_pages (__do_page_cache_readahead)
cp-5068 0dn.. 583us : _read_lock_irq (__do_page_cache_readahead)
......[899 LINES SKIPPED]
cp-5068 0dn.. 1712us : radix_tree_lookup_node (__do_page_cache_readahead)
cp-5068 0dn.. 1712us : _read_unlock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1713us : __alloc_pages (__do_page_cache_readahead)
cp-5068 0dn.. 1715us : _read_lock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1716us : radix_tree_lookup_node (__do_page_cache_readahead)
cp-5068 0dn.. 1716us : _read_unlock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1716us : __alloc_pages (__do_page_cache_readahead)
cp-5068 0dn.. 1719us : _read_lock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1719us : radix_tree_lookup_node (__do_page_cache_readahead)
cp-5068 0dn.. 1719us : _read_unlock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1720us : __alloc_pages (__do_page_cache_readahead)
cp-5068 0dn.. 1722us : _read_lock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1723us : _read_unlock_irq (__do_page_cache_readahead)
cp-5068 0dn.. 1723us : read_pages (__do_page_cache_readahead)
cp-5068 0dn.. 4800us : readahead_cache_hit (do_generic_mapping_read)
wfg ~% grep readpages latency_trace2
cp-5068 0dn.. 1724us : ext3_readpages (read_pages)
cp-5068 0dn.. 1724us : mpage_readpages (ext3_readpages)
cp-5068 0dn.. 1725us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 1728us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 1742us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 1744us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 1753us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 1755us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 1764us : add_to_page_cache (mpage_readpages)
......[508 LINES SKIPPED]
cp-5068 0dn.. 4716us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4726us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 4728us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4737us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 4739us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4748us : __pagevec_lru_add (mpage_readpages)
cp-5068 0dn.. 4750us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 4752us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4761us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 4763us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4772us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 4774us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4784us : add_to_page_cache (mpage_readpages)
cp-5068 0dn.. 4785us : do_mpage_readpage (mpage_readpages)
cp-5068 0dn.. 4795us : __pagevec_lru_add (mpage_readpages)
Cheers,
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 08/23] readahead: common macros
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (6 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 07/23] readahead: insert cond_resched() calls Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 09/23] readahead: events accounting Wu Fengguang
` (15 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-common-macros.patch --]
[-- Type: text/plain, Size: 1556 bytes --]
Define some common used macros for the read-ahead logics.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -14,6 +14,17 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/nfsd/const.h>
+
+/* The default max/min read-ahead pages. */
+#define KB(size) (((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
+#define MAX_RA_PAGES KB(VM_MAX_READAHEAD)
+#define MIN_RA_PAGES KB(VM_MIN_READAHEAD)
+#define MIN_NFSD_PAGES KB(NFSSVC_MAXBLKSIZE/1024)
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -21,7 +32,7 @@ void default_unplug_io_fn(struct backing
EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
- .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+ .ra_pages = MAX_RA_PAGES,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
@@ -49,7 +60,7 @@ static inline unsigned long get_max_read
static inline unsigned long get_min_readahead(struct file_ra_state *ra)
{
- return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ return MIN_RA_PAGES;
}
static inline void reset_ahead_window(struct file_ra_state *ra)
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 09/23] readahead: events accounting
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (7 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 08/23] readahead: common macros Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 10/23] readahead: support functions Wu Fengguang
` (14 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, J?rn Engel, Ingo Oeser
[-- Attachment #1: readahead-events-accounting.patch --]
[-- Type: text/plain, Size: 9539 bytes --]
A debugfs file named `readahead/events' is created according to advises from
J?rn Engel, Andrew Morton and Ingo Oeser.
It reveals various read-ahead activities/events, and is vital to the testing.
---------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.
- Preparations
## First compile kernel with CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug
- For each session with distinct access pattern
echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 2 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2
The commented out commands can uncover more detailed file accesses,
which are useful sometimes. Note that the log file can grow huge!
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 269 insertions(+), 1 deletion(-)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -26,6 +26,261 @@
#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+/*
+ * Detailed classification of read-ahead behaviors.
+ */
+#define RA_CLASS_SHIFT 4
+#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+ RA_CLASS_ALL,
+ RA_CLASS_NEWFILE,
+ RA_CLASS_STATE,
+ RA_CLASS_CONTEXT,
+ RA_CLASS_CONTEXT_AGGRESSIVE,
+ RA_CLASS_BACKWARD,
+ RA_CLASS_THRASHING,
+ RA_CLASS_SEEK,
+ RA_CLASS_END,
+};
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef CONFIG_DEBUG_READAHEAD
+#include <linux/init.h>
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+ RA_EVENT_CACHE_MISS, /* read cache misses */
+ RA_EVENT_READRANDOM, /* random reads */
+ RA_EVENT_IO_CONGESTION, /* io congestion */
+ RA_EVENT_IO_CACHE_HIT, /* canceled io due to cache hit */
+ RA_EVENT_IO_BLOCK, /* read on locked page */
+
+ RA_EVENT_READAHEAD, /* read-ahead issued */
+ RA_EVENT_READAHEAD_HIT, /* read-ahead page hit */
+ RA_EVENT_LOOKAHEAD, /* look-ahead issued */
+ RA_EVENT_LOOKAHEAD_HIT, /* look-ahead mark hit */
+ RA_EVENT_LOOKAHEAD_NOACTION, /* look-ahead mark ignored */
+ RA_EVENT_READAHEAD_MMAP, /* read-ahead for memory mapped file */
+ RA_EVENT_READAHEAD_EOF, /* read-ahead reaches EOF */
+ RA_EVENT_READAHEAD_SHRINK, /* ra_size under previous la_size */
+ RA_EVENT_READAHEAD_THRASHING, /* read-ahead thrashing happened */
+ RA_EVENT_READAHEAD_MUTILATE, /* read-ahead request mutilated */
+ RA_EVENT_READAHEAD_RESCUE, /* read-ahead rescued */
+
+ RA_EVENT_END
+};
+
+static const char * const ra_event_name[] = {
+ "cache_miss",
+ "read_random",
+ "io_congestion",
+ "io_cache_hit",
+ "io_block",
+ "readahead",
+ "readahead_hit",
+ "lookahead",
+ "lookahead_hit",
+ "lookahead_ignore",
+ "readahead_mmap",
+ "readahead_eof",
+ "readahead_shrink",
+ "readahead_thrash",
+ "readahead_mutilt",
+ "readahead_rescue",
+};
+
+static const char * const ra_class_name[] = {
+ "total",
+ "newfile",
+ "state",
+ "context",
+ "contexta",
+ "backward",
+ "onthrash",
+ "onraseek",
+ "none",
+};
+
+static unsigned long ra_events[RA_CLASS_END+1][RA_EVENT_END+1][2];
+
+static inline void ra_account(struct file_ra_state *ra,
+ enum ra_event e, int pages)
+{
+ enum ra_class c;
+
+ if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+ c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+ pages = -pages;
+ } else if (ra)
+ c = ra->flags & RA_CLASS_MASK;
+ else
+ c = RA_CLASS_END;
+
+ if (!c)
+ c = RA_CLASS_END;
+
+ ra_events[c][e][0] += 1;
+ ra_events[c][e][1] += pages;
+
+ if (e == RA_EVENT_READAHEAD)
+ ra_events[c][RA_EVENT_END][1] += pages * pages;
+}
+
+static int ra_events_show(struct seq_file *s, void *_)
+{
+ int i;
+ int c;
+ int e;
+ static const char event_fmt[] = "%-16s";
+ static const char class_fmt[] = "%10s";
+ static const char item_fmt[] = "%10lu";
+ static const char percent_format[] = "%9lu%%";
+ static const char * const table_name[] = {
+ "[table requests]",
+ "[table pages]",
+ "[table summary]"};
+
+ for (i = 0; i <= 1; i++) {
+ for (e = 0; e <= RA_EVENT_END; e++) {
+ ra_events[0][e][i] = 0;
+ for (c = 1; c < RA_CLASS_END; c++)
+ ra_events[0][e][i] += ra_events[c][e][i];
+ }
+
+ seq_printf(s, event_fmt, table_name[i]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ for (e = 0; e < RA_EVENT_END; e++) {
+ if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+ continue;
+ if (e == RA_EVENT_IO_BLOCK && i == 1)
+ continue;
+
+ seq_printf(s, event_fmt, ra_event_name[e]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt, ra_events[c][e][i]);
+ seq_puts(s, "\n");
+ }
+ seq_puts(s, "\n");
+ }
+
+ seq_printf(s, event_fmt, table_name[2]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "random_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_events[c][RA_EVENT_READRANDOM][0] * 100) /
+ ((ra_events[c][RA_EVENT_READRANDOM][0] +
+ ra_events[c][RA_EVENT_READAHEAD][0]) | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "ra_hit_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_events[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+ (ra_events[c][RA_EVENT_READAHEAD][1] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "la_hit_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_events[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+ (ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "var_ra_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_events[c][RA_EVENT_END][1] -
+ ra_events[c][RA_EVENT_READAHEAD][1] *
+ (ra_events[c][RA_EVENT_READAHEAD][1] /
+ (ra_events[c][RA_EVENT_READAHEAD][0] | 1))) /
+ (ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_ra_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_events[c][RA_EVENT_READAHEAD][1] +
+ ra_events[c][RA_EVENT_READAHEAD][0] / 2) /
+ (ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_la_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_events[c][RA_EVENT_LOOKAHEAD][1] +
+ ra_events[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+ (ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ return 0;
+}
+
+static int ra_events_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, ra_events_show, NULL);
+}
+
+static ssize_t ra_events_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_events, 0, sizeof(ra_events));
+ return 1;
+}
+
+struct file_operations ra_events_fops = {
+ .owner = THIS_MODULE,
+ .open = ra_events_open,
+ .write = ra_events_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+u32 readahead_debug_level = 0;
+u32 disable_stateful_method = 0;
+
+static int __init readahead_init(void)
+{
+ struct dentry *root;
+
+ root = debugfs_create_dir("readahead", NULL);
+
+ debugfs_create_file("events", 0644, root, NULL, &ra_events_fops);
+
+ debugfs_create_u32("debug_level", 0644, root, &readahead_debug_level);
+ debugfs_create_bool("disable_stateful_method", 0644, root,
+ &disable_stateful_method);
+
+ return 0;
+}
+
+module_init(readahead_init)
+#else
+#define ra_account(ra, e, pages) do { } while (0)
+#define readahead_debug_level (0)
+#define disable_stateful_method (0)
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+#define dprintk(args...) \
+ do { if (readahead_debug_level >= 1) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+ do { if (readahead_debug_level >= 2) printk(KERN_DEBUG args); } while(0)
+
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
@@ -368,6 +623,9 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
return ret;
}
@@ -403,10 +661,16 @@ static inline int check_ra_success(struc
int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read)
{
+ unsigned long ret;
+
if (bdi_read_congested(mapping->backing_dev_info))
return -1;
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+ ret = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+ ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
+ return ret;
}
/*
@@ -428,6 +692,10 @@ blockable_page_cache_readahead(struct ad
actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+ ra_account(NULL, RA_EVENT_READAHEAD, actual);
+ dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+ mapping->host->i_ino, offset, nr_to_read, actual);
+
return check_ra_success(ra, nr_to_read, actual);
}
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 10/23] readahead: support functions
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (8 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 09/23] readahead: events accounting Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 11/23] readahead: sysctl parameters Wu Fengguang
` (13 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-support-functions.patch --]
[-- Type: text/plain, Size: 4649 bytes --]
Several support functions of adaptive read-ahead.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 11 +++++
mm/readahead.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 127 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -1016,6 +1016,17 @@ void handle_ra_miss(struct address_space
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+extern int readahead_ratio;
+#else
+#define readahead_ratio 1
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+static inline int prefer_adaptive_readahead(void)
+{
+ return readahead_ratio >= 10;
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -875,3 +875,119 @@ unsigned long max_sane_readahead(unsigne
__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
+
+/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ * 1. state based - the default one
+ * 2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ * a read-ahead chunk
+ * +-----------------------------------------+
+ * | # PG_readahead |
+ * +-----------------------------------------+
+ * ^ When this page is read, notify me for the next read-ahead.
+ *
+ *
+ * Here are some variable names used frequently:
+ *
+ * |<------- la_size ------>|
+ * +-----------------------------------------+
+ * | # |
+ * +-----------------------------------------+
+ * ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * The nature of read-ahead allows false tests to occur occasionally.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ struct page *page;
+
+ read_lock_irq(&mapping->tree_lock);
+ page = __find_page(mapping, offset);
+ read_unlock_irq(&mapping->tree_lock);
+ return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+ int pgrescue;
+ pgoff_t index;
+ struct zone *zone;
+ struct address_space *mapping;
+
+ BUG_ON(!nr_pages || !page);
+ pgrescue = 0;
+ index = page_index(page);
+ mapping = page_mapping(page);
+
+ dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+ mapping->host->i_ino, index, nr_pages);
+
+ for(;;) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (!PageLRU(page))
+ goto out_unlock;
+
+ while (page_mapping(page) == mapping &&
+ page_index(page) == index) {
+ struct page *the_page = page;
+ page = next_page(page);
+ if (!PageActive(the_page) &&
+ !PageLocked(the_page) &&
+ page_count(the_page) == 1) {
+ list_move(&the_page->lru, &zone->inactive_list);
+ pgrescue++;
+ }
+ index++;
+ if (!--nr_pages)
+ goto out_unlock;
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+
+ cond_resched();
+ page = find_page(mapping, index);
+ if (!page)
+ goto out;
+ }
+out_unlock:
+ spin_unlock_irq(&zone->lru_lock);
+out:
+ ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+ return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 11/23] readahead: sysctl parameters
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (9 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 10/23] readahead: support functions Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 12/23] readahead: min/max sizes Wu Fengguang
` (12 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-parameter-sysctl-variables.patch --]
[-- Type: text/plain, Size: 5040 bytes --]
Add new sysctl entries in /proc/sys/vm:
- readahead_ratio = 50
i.e. set read-ahead size to <=50% thrashing threshold
- readahead_hit_rate = 2
i.e. read-ahead hit ratio >=50% is deemed ok
readahead_ratio also provides a way to select read-ahead logic at runtime:
condition action
==========================================================================
readahead_ratio == 0 disable read-ahead
readahead_ratio <= 9 select the (old) stock read-ahead logic
readahead_ratio >= 10 select the (new) adaptive read-ahead logic
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
Documentation/sysctl/vm.txt | 36 ++++++++++++++++++++++++++++++++++++
include/linux/sysctl.h | 2 ++
kernel/sysctl.c | 28 ++++++++++++++++++++++++++++
mm/readahead.c | 18 ++++++++++++++++++
4 files changed, 84 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -28,6 +28,24 @@
#ifdef CONFIG_ADAPTIVE_READAHEAD
/*
+ * Adaptive read-ahead parameters.
+ */
+
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 2;
+EXPORT_SYMBOL(readahead_hit_rate);
+
+/*
* Detailed classification of read-ahead behaviors.
*/
#define RA_CLASS_SHIFT 4
--- linux-2.6.16-rc6-mm2.orig/include/linux/sysctl.h
+++ linux-2.6.16-rc6-mm2/include/linux/sysctl.h
@@ -187,6 +187,8 @@ enum
VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
VM_SWAP_PREFETCH=33, /* swap prefetch */
+ VM_READAHEAD_RATIO=34, /* percent of read-ahead size to thrashing-threshold */
+ VM_READAHEAD_HIT_RATE=35, /* one accessed page legitimizes so many read-ahead pages */
};
--- linux-2.6.16-rc6-mm2.orig/kernel/sysctl.c
+++ linux-2.6.16-rc6-mm2/kernel/sysctl.c
@@ -74,6 +74,12 @@ extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+#if defined(CONFIG_ADAPTIVE_READAHEAD)
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+static int one = 1;
+#endif
+
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
extern int proc_unknown_nmi_panic(ctl_table *, int, struct file *,
@@ -926,6 +932,28 @@ static ctl_table vm_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+ {
+ .ctl_name = VM_READAHEAD_RATIO,
+ .procname = "readahead_ratio",
+ .data = &readahead_ratio,
+ .maxlen = sizeof(readahead_ratio),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = VM_READAHEAD_HIT_RATE,
+ .procname = "readahead_hit_rate",
+ .data = &readahead_hit_rate,
+ .maxlen = sizeof(readahead_hit_rate),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ },
+#endif
{ .ctl_name = 0 }
};
--- linux-2.6.16-rc6-mm2.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.16-rc6-mm2/Documentation/sysctl/vm.txt
@@ -30,6 +30,8 @@ Currently, these files are in /proc/sys/
- zone_reclaim_mode
- zone_reclaim_interval
- swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
==============================================================
@@ -189,3 +191,37 @@ copying back pages from swap into the sw
practice it can take many minutes before the vm is idle enough.
The default value is 1.
+
+==============================================================
+
+readahead_ratio
+
+This limits readahead size to percent of the thrashing-threshold,
+which is dynamicly estimated from the _history_ read speed and
+system load, to deduce the _future_ readahead request size.
+
+Set it to a smaller value if you have not enough memory for all the
+concurrent readers, or the I/O loads fluctuate a lot. But if there's
+plenty of memory(>2MB per reader), enlarge it may help speedup reads.
+
+readahead_ratio also selects the readahead logic:
+0: disable readahead totally
+1-9: select the stock readahead logic
+10-inf: select the adaptive readahead logic
+
+The default value is 50; reasonable values would be 50-100.
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (readahead-pages : accessed-pages).
+Useful only when (readahead_ratio >= 10). If the previous readahead
+request has bad hit rate, the kernel will be reluctant to do the next
+readahead.
+
+A larger value helps catch more sparse access patterns. Be aware that
+readahead of the sparse patterns sacrifices memory for speed.
+
+The default value is 2.
+It is recommended to keep the value below (max-readahead-pages / 8).
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 12/23] readahead: min/max sizes
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (10 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 11/23] readahead: sysctl parameters Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 13/23] readahead: page cache aging accounting Wu Fengguang
` (11 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-parameter-minmax-sizes.patch --]
[-- Type: text/plain, Size: 2003 bytes --]
- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
This value is no longer tightly coupled with the thrashing problem,
therefore constrained by it. The adaptive read-ahead logic merely takes
it as an upper bound, and will not stick to it under memory pressure.
- Slightly enlarge minimal/initial read-ahead size on big memory systems.
Memory bounty systems are less likely to suffer from thrashing on small
read-ahead sizes. A bigger initial value helps speed up the ra_size
growing progress.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 4 ++++
mm/readahead.c | 17 +++++++++++++++++
2 files changed, 21 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -998,7 +998,11 @@ extern int filemap_populate(struct vm_ar
int write_one_page(struct page *page, int wait);
/* readahead.c */
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+#define VM_MAX_READAHEAD 1024 /* kbytes */
+#else
#define VM_MAX_READAHEAD 128 /* kbytes */
+#endif
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
#define VM_MAX_CACHE_HIT 256 /* max pages in a row in cache before
* turning readahead off */
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1008,4 +1008,21 @@ out:
return nr_pages;
}
+/*
+ * ra_min is mainly determined by the size of cache memory.
+ * Table of concrete numbers for 4KB page size:
+ * inactive + free (MB): 4 8 16 32 64 128 256 512 1024
+ * ra_min (KB): 16 16 16 16 20 24 32 48 64
+ */
+static inline void get_readahead_bounds(struct file_ra_state *ra,
+ unsigned long *ra_min,
+ unsigned long *ra_max)
+{
+ unsigned long pages;
+
+ pages = max_sane_readahead(KB(1024*1024));
+ *ra_max = min(min(pages, 0xFFFFUL), ra->ra_pages);
+ *ra_min = min(min(MIN_RA_PAGES + (pages>>13), KB(128)), *ra_max/2);
+}
+
#endif /* CONFIG_ADAPTIVE_READAHEAD */
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 13/23] readahead: page cache aging accounting
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (11 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 12/23] readahead: min/max sizes Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 14/23] readahead: state based method Wu Fengguang
` (10 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-aging-accounting.patch --]
[-- Type: text/plain, Size: 4401 bytes --]
Collect info about the global available memory and its consumption speed.
The data are used by the stateful method to estimate the thrashing threshold.
They are the decisive factor of the correctness/accuracy of the resulting
read-ahead size.
- The accountings are done on a per-node basis, for the current vm subsystem
allocates memory in a node affined manner.
- The readahead_aging is mainly increased on first access of the read-ahead
pages, which makes it goes up constantly and smoothly, which helps improve
the accuracy on small/fast read-aheads.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 9 +++++++++
mm/memory.c | 1 +
mm/readahead.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap.c | 2 ++
mm/vmscan.c | 3 +++
5 files changed, 66 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -1031,6 +1031,15 @@ static inline int prefer_adaptive_readah
return readahead_ratio >= 10;
}
+DECLARE_PER_CPU(unsigned long, readahead_aging);
+static inline void inc_readahead_aging(void)
+{
+ if (prefer_adaptive_readahead()) {
+ per_cpu(readahead_aging, get_cpu())++;
+ put_cpu();
+ }
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.16-rc6-mm2.orig/mm/memory.c
+++ linux-2.6.16-rc6-mm2/mm/memory.c
@@ -1984,6 +1984,7 @@ static int do_anonymous_page(struct mm_s
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte_none(*page_table))
goto release;
+ inc_readahead_aging();
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(page);
page_add_new_anon_rmap(page, vma, address);
--- linux-2.6.16-rc6-mm2.orig/mm/vmscan.c
+++ linux-2.6.16-rc6-mm2/mm/vmscan.c
@@ -440,6 +440,9 @@ static unsigned long shrink_page_list(st
if (PageWriteback(page))
goto keep_locked;
+ if (!PageReferenced(page))
+ inc_readahead_aging();
+
referenced = page_referenced(page, 1);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
--- linux-2.6.16-rc6-mm2.orig/mm/swap.c
+++ linux-2.6.16-rc6-mm2/mm/swap.c
@@ -128,6 +128,8 @@ void fastcall mark_page_accessed(struct
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
+ if (PageLRU(page))
+ inc_readahead_aging();
}
}
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -46,6 +46,13 @@ int readahead_hit_rate = 2;
EXPORT_SYMBOL(readahead_hit_rate);
/*
+ * Measures the aging process of cold pages.
+ * Mainly increased on fresh page references to make it smooth.
+ */
+DEFINE_PER_CPU(unsigned long, readahead_aging);
+EXPORT_PER_CPU_SYMBOL(readahead_aging);
+
+/*
* Detailed classification of read-ahead behaviors.
*/
#define RA_CLASS_SHIFT 4
@@ -1009,6 +1016,50 @@ out:
}
/*
+ * State based calculation of read-ahead request.
+ *
+ * This figure shows the meaning of file_ra_state members:
+ *
+ * chunk A chunk B
+ * +---------------------------+-------------------------------------------+
+ * | # | # |
+ * +---------------------------+-------------------------------------------+
+ * ^ ^ ^ ^
+ * la_index ra_index lookahead_index readahead_index
+ */
+
+/*
+ * The node's effective length of inactive_list(s).
+ */
+static unsigned long node_free_and_cold_pages(void)
+{
+ unsigned int i;
+ unsigned long sum = 0;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].nr_inactive +
+ zones[i].free_pages - zones[i].pages_low;
+
+ return sum;
+}
+
+/*
+ * The node's accumulated aging activities.
+ */
+static unsigned long node_readahead_aging(void)
+{
+ unsigned long cpu;
+ unsigned long sum = 0;
+ cpumask_t mask = node_to_cpumask(numa_node_id());
+
+ for_each_cpu_mask(cpu, mask)
+ sum += per_cpu(readahead_aging, cpu);
+
+ return sum;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory.
* Table of concrete numbers for 4KB page size:
* inactive + free (MB): 4 8 16 32 64 128 256 512 1024
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 14/23] readahead: state based method
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (12 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 13/23] readahead: page cache aging accounting Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 15/23] readahead: context " Wu Fengguang
` (9 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
[-- Attachment #1: readahead-method-stateful.patch --]
[-- Type: text/plain, Size: 12309 bytes --]
This is the fast code path of adaptive read-ahead.
MAJOR STEPS
- estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file_ra_state;
- submit it.
THE REFERENCE MODEL
1. inactive list has constant length and page flow speed
2. the observed stream receives a steady flow of read requests
3. no page activation, so that the inactive list forms a pipe
With that we get the picture showed below.
|<------------------------- constant length ------------------------->|
<<<<<<<<<<<<<<<<<<<<<<<<< steady flow of pages <<<<<<<<<<<<<<<<<<<<<<<<
+---------------------------------------------------------------------+
|tail inactive list head|
| ======= ==========---- |
| chunk A(stale pages) chunk B(stale + fresh pages) |
+---------------------------------------------------------------------+
REAL WORLD ISSUES
Real world workloads will always have fluctuations (violation of assumption
1 and 2). To counteract it, a tunable parameter readahead_ratio is introduced
to make the estimation conservative enough. Violation of assumption 3 will
not lead to thrashing, it is there just for simplicity of discussion.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/fs.h | 41 +++++--
mm/readahead.c | 286 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 317 insertions(+), 10 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/include/linux/fs.h
+++ linux-2.6.16-rc6-mm2/include/linux/fs.h
@@ -611,19 +611,40 @@ struct fown_struct {
* Track a single file's readahead state
*/
struct file_ra_state {
- unsigned long start; /* Current window */
- unsigned long size;
- unsigned long flags; /* ra flags RA_FLAG_xxx*/
- unsigned long cache_hit; /* cache hit count*/
- unsigned long prev_page; /* Cache last read() position */
- unsigned long ahead_start; /* Ahead window */
- unsigned long ahead_size;
- unsigned long ra_pages; /* Maximum readahead window */
- unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
- unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
+ union {
+ struct { /* conventional read-ahead */
+ unsigned long start; /* Current window */
+ unsigned long size;
+ unsigned long ahead_start; /* Ahead window */
+ unsigned long ahead_size;
+ unsigned long cache_hit; /* cache hit count */
+ };
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+ struct { /* adaptive read-ahead */
+ pgoff_t la_index;
+ pgoff_t ra_index;
+ pgoff_t lookahead_index;
+ pgoff_t readahead_index;
+ unsigned long age;
+ uint64_t cache_hits;
+ };
+#endif
+ };
+
+ /* mmap read-around */
+ unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
+ unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
+
+ /* common ones */
+ unsigned long flags; /* ra flags RA_FLAG_xxx*/
+ unsigned long prev_page; /* Cache last read() position */
+ unsigned long ra_pages; /* Maximum readahead window */
};
#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
+#define RA_FLAG_MMAP (1UL<<31) /* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD (1UL<<30) /* disable look-ahead */
+#define RA_FLAG_NFSD (1UL<<29) /* request from nfsd */
struct file {
/*
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -16,6 +16,7 @@
#include <linux/pagevec.h>
#include <linux/writeback.h>
#include <linux/nfsd/const.h>
+#include <asm/div64.h>
/* The default max/min read-ahead pages. */
#define KB(size) (((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
@@ -1060,6 +1061,291 @@ static unsigned long node_readahead_agin
}
/*
+ * The 64bit cache_hits stores three accumulated values and a counter value.
+ * MSB LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static inline int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+ return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
+}
+
+/*
+ * Conceptual code:
+ * ra_cache_hit(ra, 1) += ra_cache_hit(ra, 0);
+ * ra_cache_hit(ra, 0) = 0;
+ */
+static inline void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+ int n;
+
+ n = ra_cache_hit(ra, 0);
+ ra->cache_hits -= n;
+ n <<= 16;
+ ra->cache_hits += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static inline int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+ return ra_cache_hit(ra, 0) * readahead_hit_rate >=
+ (ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+ if (index < ra->la_index || index >= ra->readahead_index)
+ return 0;
+
+ if (index >= ra->ra_index)
+ return 1;
+ else
+ return -1;
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static inline void ra_set_class(struct file_ra_state *ra,
+ enum ra_class ra_class)
+{
+ unsigned long flags_mask;
+ unsigned long flags;
+ unsigned long old_ra_class;
+
+ flags_mask = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+ flags = ra->flags & flags_mask;
+
+ old_ra_class = (ra->flags & RA_CLASS_MASK) << RA_CLASS_SHIFT;
+
+ ra->flags = flags | old_ra_class | ra_class;
+
+ ra_addup_cache_hit(ra);
+ if (ra_class != RA_CLASS_STATE)
+ ra->cache_hits <<= 16;
+
+ ra->age = node_readahead_aging();
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static inline void ra_set_index(struct file_ra_state *ra,
+ pgoff_t la_index, pgoff_t ra_index)
+{
+ ra->la_index = la_index;
+ ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static inline void ra_set_size(struct file_ra_state *ra,
+ unsigned long ra_size, unsigned long la_size)
+{
+ /* Disable look-ahead for loopback file. */
+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+ la_size = 0;
+
+ ra->readahead_index = ra->ra_index + ra_size;
+ ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+ struct address_space *mapping, struct file *filp)
+{
+ pgoff_t eof_index;
+ unsigned long ra_size;
+ unsigned long la_size;
+ int actual;
+ enum ra_class ra_class;
+
+ ra_class = (ra->flags & RA_CLASS_MASK);
+ BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
+
+ eof_index = ((i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT) + 1;
+ ra_size = ra->readahead_index - ra->ra_index;
+ la_size = ra->readahead_index - ra->lookahead_index;
+
+ /* Snap to EOF. */
+ if (unlikely(ra->ra_index >= eof_index))
+ return 0;
+ if (ra->readahead_index + ra_size / 2 > eof_index) {
+ if (ra_class == RA_CLASS_CONTEXT_AGGRESSIVE &&
+ eof_index > ra->lookahead_index + 1)
+ la_size = eof_index - ra->lookahead_index;
+ else
+ la_size = 0;
+ ra_size = eof_index - ra->ra_index;
+ ra_set_size(ra, ra_size, la_size);
+ }
+
+ actual = __do_page_cache_readahead(mapping, filp,
+ ra->ra_index, ra_size, la_size);
+
+#ifdef CONFIG_DEBUG_READAHEAD
+ if (ra->flags & RA_FLAG_MMAP)
+ ra_account(ra, RA_EVENT_READAHEAD_MMAP, actual);
+ if (ra->readahead_index == eof_index)
+ ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+ if (la_size)
+ ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+ if (ra_size > actual)
+ ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
+ ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+ dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+ ra_class_name[ra_class],
+ mapping->host->i_ino, ra->la_index,
+ ra->ra_index, ra_size, la_size, actual);
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+ return actual;
+}
+
+/*
+ * Determine the ra request from primitive values.
+ *
+ * It applies the following rules:
+ * - Substract ra_size by the old look-ahead to get real safe read-ahead;
+ * - Set new la_size according to the (still large) ra_size;
+ * - Apply upper limits;
+ * - Make sure stream_shift is not too small.
+ * (So that the next global_shift will not be too small.)
+ *
+ * Input:
+ * ra_size stores the estimated thrashing-threshold.
+ * la_size stores the look-ahead size of previous request.
+ */
+static inline int adjust_rala(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ unsigned long stream_shift = *la_size;
+
+ if (*ra_size > *la_size)
+ *ra_size -= *la_size;
+ else {
+ ra_account(NULL, RA_EVENT_READAHEAD_SHRINK, *ra_size);
+ return 0;
+ }
+
+ *la_size = *ra_size / LOOKAHEAD_RATIO;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ stream_shift += (*ra_size - *la_size);
+ if (stream_shift < *ra_size / 4)
+ *la_size -= (*ra_size / 4 - stream_shift);
+
+ return 1;
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ * It is returned to make the next read-ahead request.
+ * 2. the remained safe space for the current chunk
+ * It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will vibrate
+ * more on light load(with small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * This figure illustrates the formula used in the function:
+ * While the stream reads stream_shift pages inside the chunks,
+ * the chunks are shifted global_shift pages inside inactive_list.
+ *
+ * chunk A chunk B
+ * |<=============== global_shift ================|
+ * +-------------+ +-------------------+ |
+ * | # | | # | inactive_list |
+ * +-------------+ +-------------------+ head |
+ * |---->| |---------->|
+ * | |
+ * +-- stream_shift --+
+ */
+static inline unsigned long compute_thrashing_threshold(
+ struct file_ra_state *ra,
+ unsigned long *remain)
+{
+ unsigned long global_size;
+ unsigned long global_shift;
+ unsigned long stream_shift;
+ unsigned long ra_size;
+ uint64_t ll;
+
+ global_size = node_free_and_cold_pages();
+ global_shift = node_readahead_aging() - ra->age;
+ global_shift |= 1UL;
+ stream_shift = ra_cache_hit(ra, 0);
+
+ ll = (uint64_t) stream_shift * (global_size >> 9) * readahead_ratio * 5;
+ do_div(ll, global_shift);
+ ra_size = ll;
+
+ if (global_size > global_shift) {
+ ll = (uint64_t) stream_shift * (global_size - global_shift);
+ do_div(ll, global_shift);
+ *remain = ll;
+ } else
+ *remain = 0;
+
+ ddprintk("compute_thrashing_threshold: "
+ "at %lu ra %lu=%lu*%lu/%lu, remain %lu for %lu\n",
+ ra->readahead_index, ra_size,
+ stream_shift, global_size, global_shift,
+ *remain, ra->readahead_index - ra->lookahead_index);
+
+ return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static inline unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+ struct file_ra_state *ra,
+ struct page *page, pgoff_t index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ unsigned long ra_old;
+ unsigned long la_size;
+ unsigned long remain_space;
+ unsigned long growth_limit;
+
+ la_size = ra->readahead_index - index;
+ ra_old = ra->readahead_index - ra->ra_index;
+ growth_limit = ra_size + ra_max / 16 +
+ (2 + readahead_ratio / 64) * ra_old;
+ ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+ if (page && remain_space <= la_size && la_size > 1) {
+ rescue_pages(page, la_size);
+ return 0;
+ }
+
+ if (!adjust_rala(min(ra_max, growth_limit), &ra_size, &la_size))
+ return 0;
+
+ ra_set_class(ra, RA_CLASS_STATE);
+ ra_set_index(ra, index, ra->readahead_index);
+ ra_set_size(ra, ra_size, la_size);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+/*
* ra_min is mainly determined by the size of cache memory.
* Table of concrete numbers for 4KB page size:
* inactive + free (MB): 4 8 16 32 64 128 256 512 1024
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 15/23] readahead: context based method
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (13 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 14/23] readahead: state based method Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 16/23] readahead: other methods Wu Fengguang
` (8 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 16504 bytes --]
This is the slow code path of adaptive read-ahead.
No valid state info is available, so the page cache is queried to obtain
the required position/timing info. This kind of estimation is more conservative
than the stateful method, and also fluctuates more on load variance.
MAJOR STEPS
- look back/forward to find the ra_index;
- look back to estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file_ra_state;
- submit it.
ALGORITHM DYNAMICS
* startup
When a sequential read is detected, chunk size is set to readahead-min
and grows up with each readahead. The grow speed is controlled by
readahead-ratio. When readahead-ratio == 100, the new logic grows chunk
sizes exponentially -- like the current logic, but lags behind it at
early steps.
* stabilize
When chunk size reaches readahead-max, or comes close to
(readahead-ratio * thrashing-threshold)
it stops growing and stay there.
The main difference with the stock readahead logic occurs at and after
the time chunk size stops growing:
- The current logic grows chunk size exponentially in normal and
decreases it by 2 each time thrashing is seen. That can lead to
thrashing with almost every readahead for very slow streams.
- The new logic can stop at a size below the thrashing-threshold,
and stay there stable.
* on stream speed up or system load fall
thrashing-threshold follows up and chunk size is likely to be enlarged.
* on stream slow down or system load rocket up
thrashing-threshold falls down.
If thrashing happened, the next read would be treated as a random read,
and with another read the chunk-size-growing-phase is restarted.
For a slow stream that has (thrashing-threshold < readahead-max):
- When readahead-ratio = 100, there is only one chunk in cache at
most time;
- When readahead-ratio = 50, there are two chunks in cache at most
time.
- Lowing readahead-ratio helps gracefully cut down the chunk size
without thrashing.
OVERHEADS
The context based method has some overheads over the stateful method, due
to more lockings and memory scans.
Running oprofile on the following command shows the following differences:
# diff sparse sparse1
total oprofile samples run1 run2
stateful method 560482 558696
stateless method 564463 559413
So the average overhead is about 0.4%.
Detailed diffprofile data:
# diffprofile oprofile.50.stateful oprofile.50.stateless
2998 41.1% isolate_lru_pages
2669 26.4% shrink_zone
1822 14.7% system_call
1419 27.6% radix_tree_delete
1376 14.8% _raw_write_lock
1279 27.4% free_pages_bulk
1111 12.0% _raw_write_unlock
1035 43.3% free_hot_cold_page
849 15.3% unlock_page
786 29.6% page_referenced
710 4.6% kmap_atomic
651 26.4% __pagevec_release_nonlru
586 16.1% __rmqueue
578 11.3% find_get_page
481 15.5% page_waitqueue
440 6.6% add_to_page_cache
420 33.7% fget_light
260 4.3% get_page_from_freelist
223 13.7% find_busiest_group
221 35.1% mutex_debug_check_no_locks_freed
211 0.0% radix_tree_scan_hole
198 35.5% delay_tsc
195 14.8% ext3_get_branch
182 12.6% profile_tick
173 0.0% radix_tree_cache_lookup_node
164 22.9% find_next_bit
162 50.3% page_cache_readahead_adaptive
...
106 0.0% radix_tree_scan_hole_backward
...
-51 -7.6% radix_tree_preload
...
-68 -2.1% radix_tree_insert
...
-87 -2.0% mark_page_accessed
-88 -2.0% __pagevec_lru_add
-103 -7.7% softlockup_tick
-107 -71.8% free_block
-122 -77.7% do_IRQ
-132 -82.0% do_timer
-140 -47.1% ack_edge_ioapic_vector
-168 -81.2% handle_IRQ_event
-192 -35.2% irq_entries_start
-204 -14.8% rw_verify_area
-214 -13.2% account_system_time
-233 -9.5% radix_tree_lookup_node
-234 -16.6% scheduler_tick
-259 -58.7% __do_IRQ
-266 -6.8% put_page
-318 -29.3% rcu_pending
-333 -3.0% do_generic_mapping_read
-337 -28.3% hrtimer_run_queues
-493 -27.0% __rcu_pending
-1038 -9.4% default_idle
-3323 -3.5% __copy_to_user_ll
-10331 -5.9% do_mpage_readpage
# diffprofile oprofile.50.stateful2 oprofile.50.stateless2
1739 1.1% do_mpage_readpage
833 0.9% __copy_to_user_ll
340 21.3% find_busiest_group
288 9.5% free_hot_cold_page
261 4.6% _raw_read_unlock
239 3.9% get_page_from_freelist
201 0.0% radix_tree_scan_hole
163 14.3% raise_softirq
160 0.0% radix_tree_cache_lookup_node
160 11.8% update_process_times
136 9.3% fget_light
121 35.1% page_cache_readahead_adaptive
117 36.0% restore_all
117 2.8% mark_page_accessed
109 6.4% rebalance_tick
107 9.4% sys_read
102 0.0% radix_tree_scan_hole_backward
...
63 4.0% readahead_cache_hit
...
-10 -15.9% radix_tree_node_alloc
...
-39 -1.7% radix_tree_lookup_node
-39 -10.3% irq_entries_start
-43 -1.3% radix_tree_insert
...
-47 -4.6% __do_page_cache_readahead
-64 -9.3% radix_tree_preload
-65 -5.4% rw_verify_area
-65 -2.2% vfs_read
-70 -4.7% timer_interrupt
-71 -1.0% __wake_up_bit
-73 -1.1% radix_tree_delete
-79 -12.6% __mod_page_state_offset
-94 -1.8% __find_get_block
-94 -2.2% __pagevec_lru_add
-102 -1.7% free_pages_bulk
-116 -1.3% _raw_read_lock
-123 -7.4% do_sync_read
-130 -8.4% ext3_get_blocks_handle
-142 -3.8% put_page
-146 -7.9% mpage_readpages
-147 -5.6% apic_timer_interrupt
-168 -1.6% _raw_write_unlock
-172 -5.0% page_referenced
-206 -3.2% unlock_page
-212 -15.0% restore_nocheck
-213 -2.1% default_idle
-245 -5.0% __rmqueue
-278 -4.3% find_get_page
-282 -2.1% system_call
-287 -11.8% run_timer_softirq
-300 -2.7% _raw_write_lock
-420 -3.2% shrink_zone
-661 -5.7% isolate_lru_pages
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 343 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 343 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1346,6 +1346,349 @@ state_based_readahead(struct address_spa
}
/*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ * chunk A chunk B chunk C head
+ *
+ * l01 l11 l12 l21 l22
+ *| |-->|-->| |------>|-->| |------>|
+ *| +-------+ +-----------+ +-------------+ |
+ *| | # | | # | | # | |
+ *| +-------+ +-----------+ +-------------+ |
+ *| |<==============|<===========================|<============================|
+ * L0 L1 L2
+ *
+ * Let f(l) = L be a map from
+ * l: the number of pages read by the stream
+ * to
+ * L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * f(l01) <= L0
+ * f(l11 + l12) = L1
+ * f(l21 + l22) = L2
+ * ...
+ * f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ * <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+#define PAGE_REFCNT_0 0
+#define PAGE_REFCNT_1 (1 << PG_referenced)
+#define PAGE_REFCNT_2 (1 << PG_active)
+#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK PAGE_REFCNT_3
+
+/*
+ * STATUS REFERENCE COUNT
+ * __ 0
+ * _R PAGE_REFCNT_1
+ * A_ PAGE_REFCNT_2
+ * AR PAGE_REFCNT_3
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+ return page->flags & PAGE_REFCNT_MASK;
+}
+
+/*
+ * STATUS REFERENCE COUNT TYPE
+ * __ 0 fresh
+ * _R PAGE_REFCNT_1 stale
+ * A_ PAGE_REFCNT_2 disturbed once
+ * AR PAGE_REFCNT_3 disturbed twice
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+ if (!page || PageActive(page))
+ return 0;
+
+ return page_refcnt(page);
+}
+
+static inline char page_refcnt_symbol(struct page *page)
+{
+ if (!page)
+ return 'X';
+
+ switch (page_refcnt(page)) {
+ case 0:
+ return '_';
+ case PAGE_REFCNT_1:
+ return '-';
+ case PAGE_REFCNT_2:
+ return '=';
+ case PAGE_REFCNT_3:
+ return '#';
+ default:
+ return '?';
+ }
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t last_index)
+{
+ struct page *page;
+ int size = last_index - first_index + 1;
+ int count = 0;
+ int i;
+
+ cond_resched();
+ read_lock_irq(&mapping->tree_lock);
+
+ /*
+ * The first page may well is chunk head and has been accessed,
+ * so it is index 0 that makes the estimation optimistic. This
+ * behavior guarantees a readahead when (size < ra_max) and
+ * (readahead_hit_rate >= 16).
+ */
+ for (i = 0; i < 16;) {
+ page = __find_page(mapping, first_index +
+ size * ((i++ * 29) & 15) / 16);
+ if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+ break;
+ }
+
+ read_unlock_irq(&mapping->tree_lock);
+
+ return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static unsigned long query_page_cache_segment(struct address_space *mapping,
+ struct file_ra_state *ra,
+ unsigned long *remain, pgoff_t offset,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ pgoff_t index;
+ unsigned long count;
+ unsigned long nr_lookback;
+ struct radix_tree_cache cache;
+
+ /*
+ * Scan backward and check the near @ra_max pages.
+ * The count here determines ra_size.
+ */
+ cond_resched();
+ read_lock_irq(&mapping->tree_lock);
+ index = radix_tree_scan_hole_backward(&mapping->page_tree,
+ offset, ra_max);
+ read_unlock_irq(&mapping->tree_lock);
+
+ *remain = offset - index;
+
+ if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
+ count = *remain;
+ else if (count_cache_hit(mapping, index + 1, offset) *
+ readahead_hit_rate >= *remain)
+ count = *remain;
+ else
+ count = ra_min;
+
+ /*
+ * Unnecessary to count more?
+ */
+ if (count < ra_max)
+ goto out;
+
+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+ goto out;
+
+ /*
+ * Check the far pages coarsely.
+ * The big count here helps increase la_size.
+ */
+ nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+ 100 / (readahead_ratio + 1);
+
+ cond_resched();
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (count += ra_max; count < nr_lookback; count += ra_max) {
+ struct radix_tree_node *node;
+ node = radix_tree_cache_lookup_node(&mapping->page_tree,
+ &cache, offset - count, 1);
+ if (!node)
+ break;
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+out:
+ /*
+ * For sequential read that extends from index 0, the counted value
+ * may well be far under the true threshold, so return it unmodified
+ * for further process in adjust_rala_aggressive().
+ */
+ if (count >= offset)
+ count = offset;
+ else
+ count = max(ra_min, count * readahead_ratio / 100);
+
+ ddprintk("query_page_cache_segment: "
+ "ino=%lu, idx=%lu, count=%lu, remain=%lu\n",
+ mapping->host->i_ino, offset, count, *remain);
+
+ return count;
+}
+
+/*
+ * Find past-the-end index of the segment before @index.
+ */
+static inline pgoff_t find_segtail_backward(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct page *page;
+ pgoff_t origin;
+
+ origin = index;
+ if (max_scan > index)
+ max_scan = index;
+
+ cond_resched();
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (; origin - index < max_scan;) {
+ page = radix_tree_cache_lookup(&mapping->page_tree,
+ &cache, --index);
+ if (page) {
+ read_unlock_irq(&mapping->tree_lock);
+ return index + 1;
+ }
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ return 0;
+}
+
+/*
+ * Find past-the-end index of the segment at @index.
+ */
+static inline pgoff_t find_segtail(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ pgoff_t ra_index;
+
+ cond_resched();
+ read_lock_irq(&mapping->tree_lock);
+ ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+ read_unlock_irq(&mapping->tree_lock);
+
+ if (ra_index <= index + max_scan)
+ return ra_index;
+ else
+ return 0;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not chosen to make the whole next chunk safe (as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static inline int adjust_rala_aggressive(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ pgoff_t index = *ra_size;
+
+ *ra_size -= min(*ra_size, *la_size);
+ *ra_size = *ra_size * readahead_ratio / 100;
+ *la_size = index * readahead_ratio / 100;
+ *ra_size += *la_size;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ */
+static inline int
+try_context_based_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct page *prev_page,
+ struct page *page, pgoff_t index,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ pgoff_t ra_index;
+ unsigned long ra_size;
+ unsigned long la_size;
+ unsigned long remain_pages;
+
+ /* Where to start read-ahead?
+ * NFSv3 daemons may process adjacent requests in parallel,
+ * leading to many locally disordered, globally sequential reads.
+ * So do not require nearby history pages to be present or accessed.
+ */
+ if (page) {
+ ra_index = find_segtail(mapping, index, ra_max * 5 / 4);
+ if (!ra_index)
+ return -1;
+ } else if (prev_page || find_page(mapping, index - 1)) {
+ ra_index = index;
+ } else if (readahead_hit_rate > 1) {
+ ra_index = find_segtail_backward(mapping, index,
+ readahead_hit_rate + ra_min);
+ if (!ra_index)
+ return 0;
+ ra_min += 2 * (index - ra_index);
+ index = ra_index; /* pretend the request starts here */
+ } else
+ return 0;
+
+ ra_size = query_page_cache_segment(mapping, ra, &remain_pages,
+ index, ra_min, ra_max);
+
+ la_size = ra_index - index;
+ if (page && remain_pages <= la_size &&
+ remain_pages < index && la_size > 1) {
+ rescue_pages(page, la_size);
+ return -1;
+ }
+
+ if (ra_size == index) {
+ if (!adjust_rala_aggressive(ra_max, &ra_size, &la_size))
+ return -1;
+ ra_set_class(ra, RA_CLASS_CONTEXT_AGGRESSIVE);
+ } else {
+ if (!adjust_rala(ra_max, &ra_size, &la_size))
+ return -1;
+ ra_set_class(ra, RA_CLASS_CONTEXT);
+ }
+
+ ra_set_index(ra, index, ra_index);
+ ra_set_size(ra, ra_size, la_size);
+
+ return 1;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory.
* Table of concrete numbers for 4KB page size:
* inactive + free (MB): 4 8 16 32 64 128 256 512 1024
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 16/23] readahead: other methods
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (14 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 15/23] readahead: context " Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 17/23] readahead: call scheme Wu Fengguang
` (7 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-others.patch --]
[-- Type: text/plain, Size: 5066 bytes --]
Various read-ahead strategies for:
- fresh read from start of file
- backward prefetching
- seek and read one block pattern(db workload)
- quickly recover from thrashing
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 153 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1689,6 +1689,159 @@ try_context_based_readahead(struct addre
}
/*
+ * Read-ahead on start of file.
+ *
+ * The strategies here are most important for small files.
+ * 1. Set a moderately large read-ahead size;
+ * 2. Issue the next read-ahead request as soon as possible.
+ *
+ * But be careful, there are some applications that dip into only the very head
+ * of a file. The most important thing is to prevent them from triggering the
+ * next (much larger) read-ahead request, which leads to lots of cache misses.
+ * Two pages should be enough for them, correct me if I'm wrong.
+ */
+static inline unsigned long
+newfile_readahead(struct address_space *mapping,
+ struct file *filp, struct file_ra_state *ra,
+ unsigned long req_size, unsigned long ra_min)
+{
+ unsigned long ra_size;
+ unsigned long la_size;
+
+ if (req_size > ra_min) /* larger value risks thrashing */
+ req_size = ra_min;
+
+ ra_size = 4 * req_size;
+ la_size = 2 * req_size;
+
+ ra_set_class(ra, RA_CLASS_NEWFILE);
+ ra_set_index(ra, 0, 0);
+ ra_set_size(ra, ra_size, la_size);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * Backward prefetching.
+ * No look ahead and thrashing threshold estimation for stepping backward
+ * pattern: should be unnecessary.
+ */
+static inline int
+try_read_backward(struct file_ra_state *ra, pgoff_t begin_index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ pgoff_t end_index;
+
+ /* Are we reading backward? */
+ if (begin_index > ra->prev_page)
+ return 0;
+
+ if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD &&
+ ra_has_index(ra, ra->prev_page)) {
+ ra_size += 2 * ra_cache_hit(ra, 0);
+ end_index = ra->la_index;
+ } else {
+ ra_size += ra_size + ra_size * (readahead_hit_rate - 1) / 2;
+ end_index = ra->prev_page;
+ }
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ /* Read traces close enough to be covered by the prefetching? */
+ if (end_index > begin_index + ra_size)
+ return 0;
+
+ begin_index = end_index - ra_size;
+
+ ra_set_class(ra, RA_CLASS_BACKWARD);
+ ra_set_index(ra, begin_index, begin_index);
+ ra_set_size(ra, ra_size, 0);
+
+ return 1;
+}
+
+/*
+ * Readahead thrashing recovery.
+ */
+static inline unsigned long
+thrashing_recovery_readahead(struct address_space *mapping,
+ struct file *filp, struct file_ra_state *ra,
+ pgoff_t index, unsigned long ra_max)
+{
+ unsigned long ra_size;
+
+ if (readahead_debug_level && find_page(mapping, index - 1))
+ ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+ ra->readahead_index - index);
+ ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+ ra->readahead_index - index);
+
+ /*
+ * Some thrashing occur in (ra_index, la_index], in which case the
+ * old read-ahead chunk is lost soon after the new one is allocated.
+ * Ensure that we recover all needed pages in the old chunk.
+ */
+ if (index < ra->ra_index)
+ ra_size = ra->ra_index - index;
+ else {
+ /* After thrashing, we know the exact thrashing-threshold. */
+ ra_size = ra_cache_hit(ra, 0);
+
+ /* And we'd better be a bit conservative. */
+ ra_size = ra_size * 3 / 4;
+ }
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ ra_set_class(ra, RA_CLASS_THRASHING);
+ ra_set_index(ra, index, index);
+ ra_set_size(ra, ra_size, ra_size / LOOKAHEAD_RATIO);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ * Databases are known to have this seek-and-read-one-block pattern.
+ */
+static inline int
+try_readahead_on_seek(struct file_ra_state *ra, pgoff_t index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ unsigned long hit0 = ra_cache_hit(ra, 0);
+ unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+ unsigned long hit2 = ra_cache_hit(ra, 2);
+ unsigned long hit3 = ra_cache_hit(ra, 3);
+
+ /* There's a previous read-ahead request? */
+ if (!ra_has_index(ra, ra->prev_page))
+ return 0;
+
+ /* The previous read-ahead sequences have similiar sizes? */
+ if (!(ra_size < hit1 && hit1 > hit2 / 2 &&
+ hit2 > hit3 / 2 &&
+ hit3 > hit1 / 2))
+ return 0;
+
+ hit1 = max(hit1, hit2);
+
+ /* Follow the same prefetching direction. */
+ if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD)
+ index = ((index > hit1 - ra_size) ? index - hit1 + ra_size : 0);
+
+ ra_size = min(hit1, ra_max);
+
+ ra_set_class(ra, RA_CLASS_SEEK);
+ ra_set_index(ra, index, index);
+ ra_set_size(ra, ra_size, 0);
+
+ return 1;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory.
* Table of concrete numbers for 4KB page size:
* inactive + free (MB): 4 8 16 32 64 128 256 512 1024
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 17/23] readahead: call scheme
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (15 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 16/23] readahead: other methods Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 18/23] readahead: laptop mode Wu Fengguang
` (6 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 9514 bytes --]
The read-ahead logic is called when the reading hits
- a look-ahead mark;
- a non-present page.
ra.prev_page should be properly setup on entrance, and ra_access() should be
called on every page reference to maintain the cache_hits counter.
This call scheme achieves the following goals:
- makes all stateful/stateless methods happy;
- eliminates the cache hit problem naturally;
- lives in harmony with application managed read-aheads via
fadvise/madvise.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 7 ++
mm/filemap.c | 47 ++++++++++++++--
mm/readahead.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 205 insertions(+), 4 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/include/linux/mm.h
+++ linux-2.6.16-rc6-mm2/include/linux/mm.h
@@ -1019,10 +1019,17 @@ unsigned long page_cache_readahead(struc
void handle_ra_miss(struct address_space *mapping,
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ pgoff_t first_index, pgoff_t index, pgoff_t last_index);
#ifdef CONFIG_ADAPTIVE_READAHEAD
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page);
extern int readahead_ratio;
#else
+#define readahead_cache_hit(ra, page) do { } while (0)
#define readahead_ratio 1
#endif /* CONFIG_ADAPTIVE_READAHEAD */
--- linux-2.6.16-rc6-mm2.orig/mm/filemap.c
+++ linux-2.6.16-rc6-mm2/mm/filemap.c
@@ -833,14 +833,32 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- if (index == next_index)
+
+ if (!prefer_adaptive_readahead() && index == next_index)
next_index = page_cache_readahead(mapping, &ra, filp,
index, last_index - index);
find_page:
page = find_get_page(mapping, index);
+ if (prefer_adaptive_readahead()) {
+ if (unlikely(page == NULL)) {
+ ra.prev_page = prev_index;
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, NULL,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ page = find_get_page(mapping, index);
+ } else if (PageReadahead(page)) {
+ ra.prev_page = prev_index;
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, page,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ }
+ }
if (unlikely(page == NULL)) {
- handle_ra_miss(mapping, &ra, index);
+ if (!prefer_adaptive_readahead())
+ handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
@@ -848,6 +866,7 @@ find_page:
page_cache_release(prev_page);
prev_page = page;
+ readahead_cache_hit(&ra, page);
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -991,6 +1010,8 @@ no_cached_page:
out:
*_ra = ra;
+ if (prefer_adaptive_readahead())
+ _ra->prev_page = prev_index;
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (cached_page)
@@ -1275,6 +1296,7 @@ struct page *filemap_nopage(struct vm_ar
unsigned long size, pgoff;
int did_readaround = 0, majmin = VM_FAULT_MINOR;
+ ra->flags |= RA_FLAG_MMAP;
pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
retry_all:
@@ -1292,19 +1314,33 @@ retry_all:
*
* For sequential accesses, we use the generic readahead logic.
*/
- if (VM_SequentialReadHint(area))
+ if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
page_cache_readahead(mapping, ra, file, pgoff, 1);
+
/*
* Do we have something in the page cache already?
*/
retry_find:
page = find_get_page(mapping, pgoff);
+ if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+ if (!page) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, NULL,
+ pgoff, pgoff, pgoff + 1);
+ page = find_get_page(mapping, pgoff);
+ } else if (PageReadahead(page)) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, page,
+ pgoff, pgoff, pgoff + 1);
+ }
+ }
if (!page) {
unsigned long ra_pages;
if (VM_SequentialReadHint(area)) {
- handle_ra_miss(mapping, ra, pgoff);
+ if (!prefer_adaptive_readahead())
+ handle_ra_miss(mapping, ra, pgoff);
goto no_cached_page;
}
ra->mmap_miss++;
@@ -1341,6 +1377,7 @@ retry_find:
if (!did_readaround)
ra->mmap_hit++;
+ readahead_cache_hit(ra, page);
/*
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
@@ -1355,6 +1392,8 @@ success:
mark_page_accessed(page);
if (type)
*type = majmin;
+ if (prefer_adaptive_readahead())
+ ra->prev_page = page->index;
return page;
outside_data_content:
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1858,4 +1858,159 @@ static inline void get_readahead_bounds(
*ra_min = min(min(MIN_RA_PAGES + (pages>>13), KB(128)), *ra_max/2);
}
+/**
+ * page_cache_readahead_adaptive - adaptive read-ahead main function
+ * @mapping, @ra, @filp: the same as page_cache_readahead()
+ * @prev_page: the page at @index-1, may be NULL to let the function find it
+ * @page: the page at @index, or NULL if non-present
+ * @begin_index, @index, @end_index: offsets into @mapping
+ * [@begin_index, @end_index) is the read the caller is performing
+ * @index indicates the page to be read now
+ *
+ * page_cache_readahead_adaptive() is the entry point of the adaptive
+ * read-ahead logic. It tries a set of methods in turn to determine the
+ * appropriate readahead action and submits the readahead I/O.
+ *
+ * The caller is expected to point ra->prev_page to the previously accessed
+ * page, and to call it on two conditions:
+ * 1. @page == NULL
+ * A cache miss happened, some pages have to be read in
+ * 2. @page != NULL && PageReadahead(@page)
+ * A look-ahead mark encountered, this is set by a previous read-ahead
+ * invocation to instruct the caller to give the function a chance to
+ * check up and do next read-ahead in advance.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ pgoff_t begin_index, pgoff_t index, pgoff_t end_index)
+{
+ unsigned long size;
+ unsigned long ra_min;
+ unsigned long ra_max;
+ int ret;
+
+ might_sleep();
+
+ if (page) {
+ if(!TestClearPageReadahead(page))
+ return 0;
+ if (bdi_read_congested(mapping->backing_dev_info)) {
+ ra_account(ra, RA_EVENT_IO_CONGESTION,
+ end_index - index);
+ return 0;
+ }
+ }
+
+ if (page)
+ ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+ ra->readahead_index - ra->lookahead_index);
+ else if (index)
+ ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+ size = end_index - index;
+ get_readahead_bounds(ra, &ra_min, &ra_max);
+
+ /* readahead disabled? */
+ if (unlikely(!ra_max || !readahead_ratio)) {
+ size = max_sane_readahead(size);
+ goto readit;
+ }
+
+ /*
+ * Start of file.
+ */
+ if (index == 0)
+ return newfile_readahead(mapping, filp, ra, end_index, ra_min);
+
+ /*
+ * State based sequential read-ahead.
+ */
+ if (!disable_stateful_method &&
+ index == ra->lookahead_index && ra_cache_hit_ok(ra))
+ return state_based_readahead(mapping, filp, ra, page,
+ index, size, ra_max);
+
+ /*
+ * Recover from possible thrashing.
+ */
+ if (!page && index == ra->prev_page + 1 && ra_has_index(ra, index))
+ return thrashing_recovery_readahead(mapping, filp, ra,
+ index, ra_max);
+
+ /*
+ * Backward read-ahead.
+ */
+ if (!page && begin_index == index &&
+ try_read_backward(ra, index, size, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Context based sequential read-ahead.
+ */
+ ret = try_context_based_readahead(mapping, ra, prev_page, page,
+ index, ra_min, ra_max);
+ if (ret > 0)
+ return ra_dispatch(ra, mapping, filp);
+ if (ret < 0)
+ return 0;
+
+ /* No action on look ahead time? */
+ if (page) {
+ ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+ ra->readahead_index - index);
+ return 0;
+ }
+
+ /*
+ * Random read that follows a sequential one.
+ */
+ if (try_readahead_on_seek(ra, index, size, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Random read.
+ */
+ if (size > ra_max)
+ size = ra_max;
+
+readit:
+ size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+ ra_account(ra, RA_EVENT_READRANDOM, size);
+ dprintk("readrandom(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+ mapping->host->i_ino, mapping->nrpages,
+ begin_index, index, end_index, size);
+
+ return size;
+}
+
+/**
+ * readahead_cache_hit - adaptive read-ahead feedback function
+ * @ra: file_ra_state which holds the readahead state
+ * @page: the page just accessed
+ *
+ * readahead_cache_hit() is the feedback route of the adaptive read-ahead
+ * logic. It must be called on every access on the read-ahead pages.
+ */
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page)
+{
+ if (PageActive(page) || PageReferenced(page))
+ return;
+
+ if (!PageUptodate(page))
+ ra_account(ra, RA_EVENT_IO_BLOCK, 1);
+
+ if (!ra_has_index(ra, page->index))
+ return;
+
+ ra->cache_hits++;
+
+ if (page->index >= ra->ra_index)
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+ else
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
#endif /* CONFIG_ADAPTIVE_READAHEAD */
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 18/23] readahead: laptop mode
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (16 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 17/23] readahead: call scheme Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 19/23] readahead: loop case Wu Fengguang
` (5 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Bart Samwel
[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3063 bytes --]
When the laptop drive is spinned down, defer look-ahead to spin up time.
The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.
------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:
# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda # this is the max possible value
Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/writeback.h | 6 ++++++
mm/page-writeback.c | 2 +-
mm/readahead.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 37 insertions(+), 1 deletion(-)
--- linux-2.6.16-rc6-mm2.orig/include/linux/writeback.h
+++ linux-2.6.16-rc6-mm2/include/linux/writeback.h
@@ -85,6 +85,12 @@ void laptop_io_completion(void);
void laptop_sync_completion(void);
void throttle_vm_writeout(void);
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+ return !timer_pending(&laptop_mode_wb_timer);
+}
+
/* These are exported to sysctl. */
extern int dirty_background_ratio;
extern int vm_dirty_ratio;
--- linux-2.6.16-rc6-mm2.orig/mm/page-writeback.c
+++ linux-2.6.16-rc6-mm2/mm/page-writeback.c
@@ -377,7 +377,7 @@ static void wb_timer_fn(unsigned long un
static void laptop_timer_fn(unsigned long unused);
static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
/*
* Periodic writeback of "old" data.
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1017,6 +1017,31 @@ out:
}
/*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static inline int renew_lookahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ pgoff_t index, pgoff_t new_index)
+{
+ struct page *page;
+
+ if (index == ra->lookahead_index &&
+ new_index >= ra->readahead_index)
+ return 1;
+
+ page = find_page(mapping, new_index);
+ if (!page)
+ return 1;
+
+ __SetPageReadahead(page);
+ if (ra->lookahead_index == index)
+ ra->lookahead_index = new_index;
+
+ return 0;
+}
+
+/*
* State based calculation of read-ahead request.
*
* This figure shows the meaning of file_ra_state members:
@@ -1901,6 +1926,11 @@ page_cache_readahead_adaptive(struct add
end_index - index);
return 0;
}
+ if (laptop_mode && laptop_spinned_down()) {
+ if (!renew_lookahead(mapping, ra, index,
+ index + LAPTOP_POLL_INTERVAL))
+ return 0;
+ }
}
if (page)
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 19/23] readahead: loop case
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (17 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 18/23] readahead: laptop mode Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 20/23] readahead: nfsd case Wu Fengguang
` (4 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-loop-case.patch --]
[-- Type: text/plain, Size: 894 bytes --]
Disable look-ahead for loop file.
Loopback files normally contain filesystems, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
I'd like to thank Tero Grundstr?m for uncovering the loopback problem.
drivers/block/loop.c | 6 ++++++
1 files changed, 6 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/drivers/block/loop.c
+++ linux-2.6.16-rc6-mm2/drivers/block/loop.c
@@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
mapping = file->f_mapping;
inode = mapping->host;
+ /*
+ * The upper layer should already do proper look-ahead,
+ * one more look-ahead here only ruins the cache hit rate.
+ */
+ file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
if (!(file->f_mode & FMODE_WRITE))
lo_flags |= LO_FLAGS_READ_ONLY;
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 20/23] readahead: nfsd case
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (18 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 19/23] readahead: loop case Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 21/23] readahead: debug radix tree new functions Wu Fengguang
` (3 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Neil Brown
[-- Attachment #1: readahead-nfsd-case.patch --]
[-- Type: text/plain, Size: 5353 bytes --]
Adaptive read-ahead for nfsd:
1) disable nfsd raparms: the new logic do not rely on it
2) disable look-ahead on start of file: leave it to the client
For the case of NFS service, the new read-ahead logic
+ can handle disordered nfsd requests
+ can handle concurrent sequential requests on large files
with the help of look-ahead
- will have much ado about the concurrent ones on small files
--------------------------------
Notes about the concurrent nfsd requests issue:
nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:
1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache
A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlapped read-ahead requests, which lead to many conflicts at step 4.
There's no much luck to eliminate the concurrent problem in general and
efficient ways. But experiments show that mount with tcp,rsize=32768 can
cut down the overhead a lot.
--------------------------------
Here are the benchmark outputs. The test cases cover
- small/big files
- small/big rsize mount option
- serialized/parallel nfsd processing
`serialized' means running the following command to enforce serialized
nfsd requests processing:
# for pid in `pidof nfsd`; do taskset -p 1 $pid; done
8 nfsd; local mount with tcp,rsize=8192
=======================================
SERIALIZED, SMALL FILES
readahead_ratio = 0, ra_max = 128kb (old logic, the ra_max is not quite relevant)
96.51s real 11.32s system 3.27s user 160334+2829 cs diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb (new read-ahead logic)
94.88s real 11.53s system 3.20s user 152415+3777 cs diff -r $NFSDIR $NFSDIR2
SERIALIZED, BIG FILES
readahead_ratio = 0, ra_max = 128kb
56.52s real 3.38s system 1.23s user 47930+5256 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
32.54s real 5.71s system 1.38s user 23851+17007 cs diff $NFSFILE $NFSFILE2
PARALLEL, SMALL FILES
readahead_ratio = 0, ra_max = 128kb
99.87s real 11.41s system 3.15s user 173945+9163 cs diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
100.14s real 12.06s system 3.16s user 170865+13406 cs diff -r $NFSDIR $NFSDIR2
PARALLEL, BIG FILES
readahead_ratio = 0, ra_max = 128kb
63.35s real 5.68s system 1.57s user 82594+48747 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
33.87s real 10.17s system 1.55s user 72291+100079 cs diff $NFSFILE $NFSFILE2
8 nfsd; local mount with tcp,rsize=32768
========================================
Note that the normal data are now much better, and come close to that of the
serialized ones.
PARALLEL/NORMAL
readahead_ratio = 8, ra_max = 1024kb (old logic)
48.36s real 2.22s system 1.51s user 7209+4110 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb (new logic)
30.04s real 2.46s system 1.33s user 5420+2492 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 8, ra_max = 1024kb
92.99s real 10.32s system 3.23s user 145004+1826 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
90.96s real 10.68s system 3.22s user 144414+2520 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
SERIALIZED
readahead_ratio = 8, ra_max = 1024kb
47.58s real 2.10s system 1.27s user 7933+1357 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
29.46s real 2.41s system 1.38s user 5590+2613 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 8, ra_max = 1024kb
93.02s real 10.67s system 3.25s user 144850+2286 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
91.15s real 11.04s system 3.31s user 144432+2814 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
Greg Banks gives a valuable recommend on the test cases, which helps me to
get the more complete picture. Thanks!
fs/nfsd/vfs.c | 6 +++++-
mm/readahead.c | 9 +++++++--
2 files changed, 12 insertions(+), 3 deletions(-)
--- linux-2.6.16-rc6-mm2.orig/fs/nfsd/vfs.c
+++ linux-2.6.16-rc6-mm2/fs/nfsd/vfs.c
@@ -833,10 +833,14 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
#endif
/* Get readahead parameters */
- ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+ if (prefer_adaptive_readahead())
+ ra = NULL;
+ else
+ ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
if (ra && ra->p_set)
file->f_ra = ra->p_ra;
+ file->f_ra.flags |= RA_FLAG_NFSD;
if (file->f_op->sendfile) {
svc_pushback_unused_pages(rqstp);
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1736,8 +1736,13 @@ newfile_readahead(struct address_space *
if (req_size > ra_min) /* larger value risks thrashing */
req_size = ra_min;
- ra_size = 4 * req_size;
- la_size = 2 * req_size;
+ if (unlikely(ra->flags & RA_FLAG_NFSD)) {
+ ra_size = MIN_NFSD_PAGES;
+ la_size = 0;
+ } else {
+ ra_size = 4 * req_size;
+ la_size = 2 * req_size;
+ }
ra_set_class(ra, RA_CLASS_NEWFILE);
ra_set_index(ra, 0, 0);
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 21/23] readahead: debug radix tree new functions
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (19 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 20/23] readahead: nfsd case Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 22/23] readahead: debug traces showing accessed file names Wu Fengguang
` (2 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-debug-radix-tree.patch --]
[-- Type: text/plain, Size: 2114 bytes --]
Do some sanity checkings on the newly added radix tree code.
It is meant to be removed after a while.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -80,6 +80,8 @@ enum ra_class {
#include <linux/debugfs.h>
#include <linux/seq_file.h>
+#define DEBUG_READAHEAD_RADIXTREE
+
/* Read-ahead events to be accounted. */
enum ra_event {
RA_EVENT_CACHE_MISS, /* read cache misses */
@@ -1515,6 +1517,13 @@ static unsigned long query_page_cache_se
read_lock_irq(&mapping->tree_lock);
index = radix_tree_scan_hole_backward(&mapping->page_tree,
offset, ra_max);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ WARN_ON(index > offset);
+ if (index != offset)
+ WARN_ON(!__find_page(mapping, index + 1));
+ if (index && offset - index < ra_max)
+ WARN_ON(__find_page(mapping, index));
+#endif
read_unlock_irq(&mapping->tree_lock);
*remain = offset - index;
@@ -1550,6 +1559,11 @@ static unsigned long query_page_cache_se
struct radix_tree_node *node;
node = radix_tree_cache_lookup_node(&mapping->page_tree,
&cache, offset - count, 1);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (node != radix_tree_lookup_node(&mapping->page_tree,
+ offset - count, 1))
+ BUG();
+#endif
if (!node)
break;
}
@@ -1614,6 +1628,16 @@ static inline pgoff_t find_segtail(struc
cond_resched();
read_lock_irq(&mapping->tree_lock);
ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ BUG_ON(!__find_page(mapping, index));
+ WARN_ON(ra_index < index);
+ if (ra_index != index && !__find_page(mapping, ra_index - 1))
+ printk(KERN_ERR "radix_tree_scan_hole(index=%lu ra_index=%lu "
+ "max_scan=%lu nrpages=%lu) fooled!\n",
+ index, ra_index, max_scan, mapping->nrpages);
+ if (ra_index != ~0UL && ra_index - index < max_scan)
+ WARN_ON(__find_page(mapping, ra_index));
+#endif
read_unlock_irq(&mapping->tree_lock);
if (ra_index <= index + max_scan)
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 22/23] readahead: debug traces showing accessed file names
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (20 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 21/23] readahead: debug radix tree new functions Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 2:34 ` [PATCH 23/23] readahead: debug traces showing read patterns Wu Fengguang
2006-03-19 3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-debug-traces-file-list.patch --]
[-- Type: text/plain, Size: 1011 bytes --]
Print file names on their first read-ahead, for tracing file access patterns.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 14 ++++++++++++++
1 files changed, 14 insertions(+)
--- linux-2.6.16-rc6-mm2.orig/mm/readahead.c
+++ linux-2.6.16-rc6-mm2/mm/readahead.c
@@ -1229,6 +1229,20 @@ static int ra_dispatch(struct file_ra_st
ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
ra_account(ra, RA_EVENT_READAHEAD, actual);
+ if (!ra->ra_index && filp->f_dentry->d_inode) {
+ char *fn;
+ static char path[1024];
+ unsigned long size;
+
+ size = (i_size_read(filp->f_dentry->d_inode)+1023)/1024;
+ fn = d_path(filp->f_dentry, filp->f_vfsmnt, path, 1000);
+ if (!IS_ERR(fn))
+ ddprintk("ino %lu is %s size %luK by %s(%d)\n",
+ filp->f_dentry->d_inode->i_ino,
+ fn, size,
+ current->comm, current->pid);
+ }
+
dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
ra_class_name[ra_class],
mapping->host->i_ino, ra->la_index,
--
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 23/23] readahead: debug traces showing read patterns
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (21 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 22/23] readahead: debug traces showing accessed file names Wu Fengguang
@ 2006-03-19 2:34 ` Wu Fengguang
2006-03-19 3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
23 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 2:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-debug-traces-access-pattern.patch --]
[-- Type: text/plain, Size: 2681 bytes --]
Print all relavant read requests to help discover the access pattern.
If you are experiencing performance problems, or want to help improve
the read-ahead logic, please send me the trace data. Thanks.
- Preparations
# Compile kernel with option CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug
- For each session with distinct access pattern
echo > /debug/readahead # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
echo 8 > /debug/readahead/debug_level # show verbose printk traces
# do one benchmark/task
echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/filemap.c | 23 ++++++++++++++++++++++-
1 files changed, 22 insertions(+), 1 deletion(-)
--- linux-2.6.16-rc6-mm2.orig/mm/filemap.c
+++ linux-2.6.16-rc6-mm2/mm/filemap.c
@@ -45,6 +45,12 @@ static ssize_t
generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
+#ifdef CONFIG_DEBUG_READAHEAD
+extern u32 readahead_debug_level;
+#else
+#define readahead_debug_level 0
+#endif /* CONFIG_DEBUG_READAHEAD */
+
/*
* Shared mappings implemented 30.11.1994. It's not fully working yet,
* though.
@@ -815,6 +821,10 @@ void do_generic_mapping_read(struct addr
if (!isize)
goto out;
+ if (readahead_debug_level >= 5)
+ printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+ inode->i_ino, index, last_index - index);
+
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -867,6 +877,11 @@ find_page:
prev_page = page;
readahead_cache_hit(&ra, page);
+ if (readahead_debug_level >= 7)
+ printk(KERN_DEBUG "read-page(ino=%lu, idx=%lu, io=%s)\n",
+ inode->i_ino, index,
+ PageUptodate(page) ? "hit" : "miss");
+
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -1317,7 +1332,6 @@ retry_all:
if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
page_cache_readahead(mapping, ra, file, pgoff, 1);
-
/*
* Do we have something in the page cache already?
*/
@@ -1378,6 +1392,13 @@ retry_find:
ra->mmap_hit++;
readahead_cache_hit(ra, page);
+ if (readahead_debug_level >= 6)
+ printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s, io=%s)\n",
+ inode->i_ino, pgoff,
+ VM_RandomReadHint(area) ? "random" :
+ (VM_SequentialReadHint(area) ? "sequential" : "none"),
+ PageUptodate(page) ? "hit" : "miss");
+
/*
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
--
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 2:34 [PATCH 00/23] Adaptive read-ahead V11 Wu Fengguang
` (22 preceding siblings ...)
2006-03-19 2:34 ` [PATCH 23/23] readahead: debug traces showing read patterns Wu Fengguang
@ 2006-03-19 3:10 ` Jon Smirl
2006-03-19 3:47 ` Wu Fengguang
2006-03-27 21:38 ` Matt Heler
23 siblings, 2 replies; 37+ messages in thread
From: Jon Smirl @ 2006-03-19 3:10 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel
This is probably a readahead problem. The lighttpd people that are
encountering this problem are not regular lkml readers.
http://bugzilla.kernel.org/show_bug.cgi?id=5949
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
@ 2006-03-19 3:47 ` Wu Fengguang
2006-03-19 4:10 ` Jon Smirl
2006-03-27 21:38 ` Matt Heler
1 sibling, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 3:47 UTC (permalink / raw)
To: Jon Smirl; +Cc: linux-kernel
On Sat, Mar 18, 2006 at 10:10:43PM -0500, Jon Smirl wrote:
> This is probably a readahead problem. The lighttpd people that are
> encountering this problem are not regular lkml readers.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5949
[QUOTE]
My general conclusion is that since they were able to write a user
space implementation that avoids the problem something must be broken
in the kernel readahead logic for sendfile().
Maybe the user space solution does the trick by using a larger window size?
IMHO, the stock read-ahead is not designed with extremely high concurrency in
mind. However, 100 streams should not be a problem at all.
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 3:47 ` Wu Fengguang
@ 2006-03-19 4:10 ` Jon Smirl
2006-03-19 5:09 ` Wu Fengguang
0 siblings, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-03-19 4:10 UTC (permalink / raw)
To: Wu Fengguang, linux-kernel
On 3/18/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> On Sat, Mar 18, 2006 at 10:10:43PM -0500, Jon Smirl wrote:
> > This is probably a readahead problem. The lighttpd people that are
> > encountering this problem are not regular lkml readers.
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=5949
>
> [QUOTE]
> My general conclusion is that since they were able to write a user
> space implementation that avoids the problem something must be broken
> in the kernel readahead logic for sendfile().
>
> Maybe the user space solution does the trick by using a larger window size?
>
> IMHO, the stock read-ahead is not designed with extremely high concurrency in
> mind. However, 100 streams should not be a problem at all.
Has anyone checked to see if the readahead logic is working as
expected from sendfile? IO from sendfile is a different type of
context than IO from user space, there could be sendfile specific
problems. If window size is the trick, shouldn't sendfile
automatically adapt it's window size? I don't think you can control
the sendfile window size from user space.
The goal of sendfile is to be the most optimal way to send a file over
the network. This is a case where user space code is easily beating
it.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 4:10 ` Jon Smirl
@ 2006-03-19 5:09 ` Wu Fengguang
2006-03-19 15:53 ` Jon Smirl
0 siblings, 1 reply; 37+ messages in thread
From: Wu Fengguang @ 2006-03-19 5:09 UTC (permalink / raw)
To: Jon Smirl; +Cc: linux-kernel
On Sat, Mar 18, 2006 at 11:10:33PM -0500, Jon Smirl wrote:
> > Maybe the user space solution does the trick by using a larger window size?
> >
> > IMHO, the stock read-ahead is not designed with extremely high concurrency in
> > mind. However, 100 streams should not be a problem at all.
>
> Has anyone checked to see if the readahead logic is working as
> expected from sendfile? IO from sendfile is a different type of
> context than IO from user space, there could be sendfile specific
AFAIK, sendfile() and read() use the same readahead logic, which
handles them equally good. And there is another readaround logic
which handles unhinted mmapped reads.
> problems. If window size is the trick, shouldn't sendfile
> automatically adapt it's window size? I don't think you can control
> the sendfile window size from user space.
For whole file readings, the stock readahead logic by default uses a fixed
window size of VM_MAX_READAHEAD=128KB, the adaptive readahead logic
uses an adaptive window size with a high limit of VM_MAX_READAHEAD=1024KB.
The VM_MAX_READAHEAD in the kernel is used to init the .ra_pages
attribute of block devices, which can later be altered in _runtime_.
To set a 512KB window size limit for hda, one can do it in two ways:
1) blockdev --setra 1024 /dev/hda
2) echo 512 > /sys/block/had/queue/read_ahead_kb
Cheers,
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 5:09 ` Wu Fengguang
@ 2006-03-19 15:53 ` Jon Smirl
2006-03-20 13:54 ` Wu Fengguang
0 siblings, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-03-19 15:53 UTC (permalink / raw)
To: Wu Fengguang, linux-kernel
On 3/19/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> On Sat, Mar 18, 2006 at 11:10:33PM -0500, Jon Smirl wrote:
> > > Maybe the user space solution does the trick by using a larger window size?
> > >
> > > IMHO, the stock read-ahead is not designed with extremely high concurrency in
> > > mind. However, 100 streams should not be a problem at all.
> >
> > Has anyone checked to see if the readahead logic is working as
> > expected from sendfile? IO from sendfile is a different type of
> > context than IO from user space, there could be sendfile specific
>
> AFAIK, sendfile() and read() use the same readahead logic, which
> handles them equally good. And there is another readaround logic
> which handles unhinted mmapped reads.
In another thread someone made a mention that this problem may have
something to do with the pools of memory being used for sendfile. The
readahead from sendfile is going into a moderately sized pool. When
you get 100 of them going at once the other threads flush the
readahead data out of the pool before it can be used and thus trigger
the thrashing seek storm. Is this true, that sendfile data is read
ahead into a fixed sized pool? If so, the readahead algorithms would
need to reduce the sendfile window sizes to stop the pool from
thrashing.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 15:53 ` Jon Smirl
@ 2006-03-20 13:54 ` Wu Fengguang
0 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-20 13:54 UTC (permalink / raw)
To: Jon Smirl; +Cc: linux-kernel
On Sun, Mar 19, 2006 at 10:53:46AM -0500, Jon Smirl wrote:
> In another thread someone made a mention that this problem may have
> something to do with the pools of memory being used for sendfile. The
> readahead from sendfile is going into a moderately sized pool. When
> you get 100 of them going at once the other threads flush the
> readahead data out of the pool before it can be used and thus trigger
> the thrashing seek storm. Is this true, that sendfile data is read
> ahead into a fixed sized pool? If so, the readahead algorithms would
The pages are kept in a cache pool which is made of all the free memory.
E.g. the following command shows a system with 331M cache pool:
% free -m
total used free shared buffers cached
Mem: 488 482 5 0 7 331
-/+ buffers/cache: 142 345
Swap: 127 0 127
That would be more than enough for the stock read-ahead to handle 100
concurrent readers.
> need to reduce the sendfile window sizes to stop the pool from
> thrashing.
Sure, it is the desired behavior. This patch provides exactly this feature :)
Cheers,
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-19 3:10 ` [PATCH 00/23] Adaptive read-ahead V11 Jon Smirl
2006-03-19 3:47 ` Wu Fengguang
@ 2006-03-27 21:38 ` Matt Heler
2006-03-28 3:44 ` Wu Fengguang
1 sibling, 1 reply; 37+ messages in thread
From: Matt Heler @ 2006-03-27 21:38 UTC (permalink / raw)
To: Jon Smirl; +Cc: Wu Fengguang, linux-kernel
We use lighttpd on our servers, and I can say with 100% that this problem
happens alot. Because of this , we were forced to use to userspace mechanism
that the lighttpd author made to cirvumvent this issue. However with this
patch, I'm unable to produce any of the problems we had experienced before.
IO-Wait has dropped significantly from 80% to 20%.
I'd be happy to send over some benchmarks if need be.
Matt Heler
On Saturday 18 March 2006 10:10 pm, Jon Smirl wrote:
> This is probably a readahead problem. The lighttpd people that are
> encountering this problem are not regular lkml readers.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5949
>
> --
> Jon Smirl
> jonsmirl@gmail.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/23] Adaptive read-ahead V11
2006-03-27 21:38 ` Matt Heler
@ 2006-03-28 3:44 ` Wu Fengguang
0 siblings, 0 replies; 37+ messages in thread
From: Wu Fengguang @ 2006-03-28 3:44 UTC (permalink / raw)
To: Matt Heler; +Cc: Jon Smirl, linux-kernel
On Mon, Mar 27, 2006 at 04:38:33PM -0500, Matt Heler wrote:
> We use lighttpd on our servers, and I can say with 100% that this problem
> happens alot. Because of this , we were forced to use to userspace mechanism
> that the lighttpd author made to cirvumvent this issue. However with this
> patch, I'm unable to produce any of the problems we had experienced before.
> IO-Wait has dropped significantly from 80% to 20%.
> I'd be happy to send over some benchmarks if need be.
Thanks, your production service would be the best benchmark ;)
Would you send some performance numbers and the basic server configuration?
Wu
^ permalink raw reply [flat|nested] 37+ messages in thread