[PATCH 00/28] Adaptive readahead V16

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/28] Adaptive readahead V16
       [not found] <20061115075007.832957580@localhost.localdomain>
@ 2006-11-15  7:50 ` Wu Fengguang
       [not found] ` <20061115075024.180138257@localhost.localdomain>
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi Andrew,

This readahead update focuses on complexity/overhead problems.
It introduces major changes to initial/context/nfsd readahead,
along with many other cleanups.

Patches
=======

One-to-one patching to the existing patches would be impossible,
so I updated the -mm patchset in place. Please replace the -mm patches
from
	readahead-kconfig-options.patch
to
	readahead-remove-the-size-limit-of-max_sectors_kb-on-read_ahead_kb.patch
with this patchset, thanks.

[PATCH 01/28] readahead: kconfig options
[PATCH 02/28] radixtree: introduce scan hole/data functions
[PATCH 03/28] mm: introduce probe_page()
[PATCH 04/28] mm: introduce PG_readahead
[PATCH 05/28] readahead: add look-ahead support to __do_page_cache_readahead()
[PATCH 06/28] readahead: insert cond_resched() calls
[PATCH 07/28] readahead: {MIN,MAX}_RA_PAGES
[PATCH 08/28] readahead: events accounting
[PATCH 09/28] readahead: rescue_pages()
[PATCH 10/28] readahead: sysctl parameters
[PATCH 11/28] readahead: min/max sizes
[PATCH 12/28] readahead: state based method - aging accounting
[PATCH 13/28] readahead: state based method - routines
[PATCH 14/28] readahead: state based method
[PATCH 15/28] readahead: context based method
[PATCH 16/28] readahead: initial method - guiding sizes
[PATCH 17/28] readahead: initial method - thrashing guard size
[PATCH 18/28] readahead: initial method - user recommended size
[PATCH 19/28] readahead: initial method
[PATCH 20/28] readahead: backward prefetching method
[PATCH 21/28] readahead: thrashing recovery method
[PATCH 22/28] readahead: call scheme
[PATCH 23/28] readahead: laptop mode
[PATCH 24/28] readahead: loop case
[PATCH 25/28] readahead: nfsd case
[PATCH 26/28] readahead: turn on by default
[PATCH 27/28] readahead: remove size limit on read_ahead_kb
[PATCH 28/28] readahead: remove size limit of max_sectors_kb on read_ahead_kb

Diffstat
========

 Documentation/sysctl/vm.txt |   46 +
 block/ll_rw_blk.c           |   43 -
 drivers/block/loop.c        |    6
 fs/mpage.c                  |    4
 fs/nfs/client.c             |    3
 fs/nfsd/vfs.c               |    6
 include/linux/backing-dev.h |    2
 include/linux/fs.h          |   60 +
 include/linux/mm.h          |   31
 include/linux/mmzone.h      |    3
 include/linux/page-flags.h  |    6
 include/linux/pagemap.h     |    2
 include/linux/radix-tree.h  |    6
 include/linux/sysctl.h      |    2
 include/linux/writeback.h   |    6
 kernel/sysctl.c             |   27
 lib/radix-tree.c            |   93 ++
 mm/Kconfig                  |   57 +
 mm/filemap.c                |   65 +-
 mm/page-writeback.c         |    2
 mm/page_alloc.c             |   33 +
 mm/readahead.c              | 1404 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                 |    1
 23 files changed, 1862 insertions(+), 46 deletions(-)

Changelog
=========

V16  2006-11-15

OVERHEADS ELIMINATION
- nfsd readahead: now works in peace with the chaotic nfsd requests
- context readahead on random reads: disabled by default
- replace TestClearPageReadahead() with ClearPageReadahead()

CODE REDUCTION
- take 2 function parameters from page_cache_readahead_adaptive()
- drop delayed page_cache_release() patch on filemap.c
- drop smooth aging accounting
- drop cache hit feedback
- drop fixed size random read prefetching
- drop initial readahead size auto detection

READABILITY IMPROVEMENT
- context readahead: document and clearly define cases/subroutines
- rename ra_dispatch() to ra_submit()
- rename zone.aging_total to zone.total_scanned
- move readahead.c/node_readahead_aging() to page_alloc.c/nr_scanned_pages_node()
- some minor cleanups

SEMANTIC CHANGES
- vm.readahead_ratio == 1 (single value) selects stock readahead
- vm.readahead_hit_rate == 0 prevents possible context readahead overheads
- /sys/block/sda/queue/initial_ra_kb = 64 as recommended initial readahead size
- limit ra_min to ra_max/8 (8 = 128k/16k)
- ignore readahead_ratio in aggressive context readahead
- be conservative on backward prefetching

V15  2006-05-28
- apply stream_shift size limits to contexta method
- *remain in query_page_cache_segment() is over counted by 1, fix it
- comment update for adjust_rala_aggressive()
- add use case comment for backward prefetching

V14  2006-05-27
- remove __radix_tree_lookup_parent()
- implement radix_tree_scan_hole*() as dumb and safe
- break file_ra_state.cache_hits into u16s
- rationalize ra_dispatch and move look-ahead/age stuffs here
- move node_free_and_cold_pages() to page_alloc.c/nr_free_inactive_pages_node()
- fix a bug in query_page_cache_segment()
- adjust RA_FLAG_XXX to avoid confliction with ra_class_{new,old}
- random comments

V13  2006-05-26
- remove radix tree look-aside cache
- fix radix tree NULL dereference bug
- fix radix tree bugs on direct embedded data
- add comment on cold_page_refcnt()
- rename find_page() to probe_page()
- replace the non-atomic __SetPageReadahead()
- fix the risky rescue_pages()
- some cleanups recommended by Nick Piggin

V12  2006-05-24
- improve small files case
- allow pausing of events accounting
- disable sparse read-ahead by default
- a bug fix in radix_tree_cache_lookup_parent()
- more cleanups

V11  2006-03-19
- patchset rework
- add kconfig option to make the feature compile-time selectable
- improve radix tree scan functions
- fix bug of using smp_processor_id() in preemptible code
- avoid overflow in compute_thrashing_threshold()
- disable sparse read prefetching if (readahead_hit_rate == 1)
- make thrashing recovery a standalone function
- random cleanups

V10  2005-12-16
- remove delayed page activation
- remove live page protection
- revert mmap readaround to old behavior
- default to original readahead logic
- default to original readahead size
- merge comment fixes from Andreas Mohr
- merge radixtree cleanups from Christoph Lameter
- reduce sizeof(struct file_ra_state) by unnamed union
- stateful method cleanups
- account other read-ahead paths

V9  2005-12-3
- standalone mmap read-around code, a little more smart and tunable
- make stateful method sensible of request size
- decouple readahead_ratio from live pages protection
- let readahead_ratio contribute to ra_size grow speed in stateful method
- account variance of ra_size

V8  2005-11-25

- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values

V7  2005-11-09

- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size

V6  2005-11-01

- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages

V5  2005-10-28

- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities

V4  2005-10-15

- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache

V3  2005-10-06

- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones

V2  2005-09-15

- delayed page activation
- look-ahead: towards pipelined read-ahead

V1  2005-09-13

Initial release which features:
        o stateless (for now)
        o adapts to available memory / read speed
        o free of thrashing (in theory)

And handles:
        o large number of slow streams (FTP server)
	o open/read/close access patterns (NFS server)
        o multiple interleaved, sequential streams in one file
	  (multithread / multimedia / database)

Cheers,
Fengguang Wu
--
Dept. Automation                University of Science and Technology of China

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/28] readahead: kconfig options
       [not found] ` <20061115075024.180138257@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-kconfig-options.patch --]
[-- Type: text/plain, Size: 4701 bytes --]

This patchset introduces a set of adaptive read-ahead methods.
They enable the kernel to better support many important I/O applications.

MAIN FEATURES
=============

- Adaptive read-ahead buffer management
	- aggressive, thrashing safe read-ahead size
		- optimal memory utilisation while achieving good I/O throughput
		- unnecessary to hand tuning VM_MAX_READAHEAD
		- support slow/fast readers at the same time
		- support large number of concurrent readers
	- aggressive read-ahead on start-of-file
		- configurable recommended read-ahead size
		- safeguarded by dynamic estimated thrashing threshold
		- good for lots-of-small-files case
	- shrinkable look-ahead size
		- cut down up to 40% memory consumption on overloaded situation

- Detecting any form of (semi-)sequencial scan
        - parallel / interleaved sequential scans on one fd
        - sequential reads across file open/close lifetime
        - mixed sequential / random accesses
        - sparse / skimming sequential read

- Support more access patterns
        - backward prefetching

- Better special case handling
        - nfs daemon support: the raparams cache is no longer required
	- laptop mode support: defer look-ahead on drive spinned down
        - loopback file support: avoid double look-ahead

DESIGN STRATEGIES
=================

- Dual methods design
        - stateful method: the fast and default one
	- stateless method: the robust and failsafe one
	- if anything abnormal happens, the stateful method bails out, the
	  stateless method queries the page cache and possibly restart the
	  read-ahead process

- Robust feedback design
	- sense and handle important states so that the logic wont run away
	- detect danger of thrashing and prevent it in advance
        - extensive accounting and debugging traces

This patch:

Add kconfig options to enable/disable:
	- adaptive read-ahead logic
	- adaptive read-ahead debug traces and events accounting

The read-ahead introduction text is cited from the well written LWN article
"Adaptive file readahead" <http://lwn.net/Articles/155510/> :)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc4-mm1.orig/mm/Kconfig
+++ linux-2.6.19-rc4-mm1/mm/Kconfig
@@ -163,3 +163,60 @@ config ZONE_DMA_FLAG
 	default "0" if !ZONE_DMA
 	default "1"
 
+#
+# Adaptive file readahead
+#
+config ADAPTIVE_READAHEAD
+	bool "Adaptive file readahead (EXPERIMENTAL)"
+	default n
+	depends on EXPERIMENTAL
+	help
+	  Readahead is a technique employed by the kernel in an attempt
+	  to improve file reading performance. If the kernel has reason
+	  to believe that a particular file is being read sequentially,
+	  it will attempt to read blocks from the file into memory before
+	  the application requests them. When readahead works, it speeds
+	  up the system's throughput, since the reading application does
+	  not have to wait for its requests. When readahead fails, instead,
+	  it generates useless I/O and occupies memory pages which are
+	  needed for some other purpose.
+
+	  The kernel already has a stock readahead logic that is well
+	  understood and well tuned. This option enables a more complex and
+	  feature rich one. It tries to be smart and memory efficient.
+	  However, due to the great diversity of real world applications, it
+	  might not fit everyone.
+
+	  Please refer to Documentation/sysctl/vm.txt for tunable parameters.
+
+	  It is known to work well for many desktops, file servers and
+	  postgresql databases. Say Y to try it out for yourself.
+
+config DEBUG_READAHEAD
+	bool "Readahead debug and accounting"
+	default n
+	depends on ADAPTIVE_READAHEAD
+	select DEBUG_FS
+	help
+	  This option injects extra code to dump detailed debug traces and do
+	  readahead events accounting.
+
+	  To actually get the data:
+
+	  mkdir /debug
+	  mount -t debug none /debug
+
+	  After that you can do the following:
+
+	  echo > /debug/readahead/events # reset the counters
+	  cat /debug/readahead/events    # check the counters
+
+	  echo 1 > /debug/readahead/debug_level # start events accounting
+	  echo 0 > /debug/readahead/debug_level # pause events accounting
+
+	  echo 2 > /debug/readahead/debug_level # show printk traces
+	  echo 3 > /debug/readahead/debug_level # show printk traces(verbose)
+	  echo 1 > /debug/readahead/debug_level # stop filling my kern.log
+
+	  Say N for production servers.
+
--- linux-2.6.19-rc4-mm1.orig/mm/readahead.c
+++ linux-2.6.19-rc4-mm1/mm/readahead.c
@@ -5,6 +5,8 @@
  *
  * 09Apr2002	akpm@zip.com.au
  *		Initial version.
+ * 26May2006	Fengguang Wu <wfg@ustc.edu>
+ *		Adaptive read-ahead framework.
  */
 
 #include <linux/kernel.h>

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 02/28] radixtree: introduce scan hole/data functions
       [not found] ` <20061115075024.503627543@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: radixtree-introduce-radix_tree_scan_functions.patch --]
[-- Type: text/plain, Size: 4280 bytes --]

Introduce a set of functions to scan radix tree for hole/data item.
	- radix_tree_scan_hole(root, index, max_scan)
	- radix_tree_scan_hole_backward(root, index, max_scan)
	- radix_tree_scan_data_backward(root, index, max_scan)

The implementation is dumb and obviously correct.
It will help to debug the smart ones to be introduced in future.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc4-mm1.orig/include/linux/radix-tree.h
+++ linux-2.6.19-rc4-mm1/include/linux/radix-tree.h
@@ -156,6 +156,12 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_scan_data_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
--- linux-2.6.19-rc4-mm1.orig/lib/radix-tree.c
+++ linux-2.6.19-rc4-mm1/lib/radix-tree.c
@@ -598,6 +598,99 @@ int radix_tree_tag_get(struct radix_tree
 EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
+static unsigned long
+radix_tree_scan_hole_dumb(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++)
+		if (!radix_tree_lookup(root, index) || ++index == 0)
+			break;
+
+	return index;
+}
+
+static unsigned long
+radix_tree_scan_hole_backward_dumb(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++)
+		if (!radix_tree_lookup(root, index) || --index == ULONG_MAX)
+			break;
+
+	return index;
+}
+
+static unsigned long
+radix_tree_scan_data_backward_dumb(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++)
+		if (radix_tree_lookup(root, index) || --index == ULONG_MAX)
+			break;
+
+	return index;
+}
+
+/**
+ *	radix_tree_scan_hole    -    scan for hole
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan forward from @index for a hole/empty item, stop when
+ *      - hit hole
+ *      - wrap-around to index 0
+ *      - @max_scan or more items scanned
+ */
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	return radix_tree_scan_hole_dumb(root, index, max_scan);
+}
+EXPORT_SYMBOL(radix_tree_scan_hole);
+
+/**
+ *	radix_tree_scan_hole_backward    -    scan backward for hole
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan backward from @index for a hole/empty item, stop when
+ *      - hit hole
+ *      - wrap-around to index ULONG_MAX
+ *      - @max_scan or more items scanned
+ */
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	return radix_tree_scan_hole_backward_dumb(root, index, max_scan);
+}
+EXPORT_SYMBOL(radix_tree_scan_hole_backward);
+
+/**
+ *	radix_tree_scan_data_backward    -    scan backward for data
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan backward from @index for a data item, stop when
+ *      - hit data
+ *      - wrap-around to index ULONG_MAX
+ *      - @max_scan or more items scanned
+ */
+unsigned long radix_tree_scan_data_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	return radix_tree_scan_data_backward_dumb(root, index, max_scan);
+}
+EXPORT_SYMBOL(radix_tree_scan_data_backward);
+
 static unsigned int
 __lookup(struct radix_tree_node *slot, void **results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 03/28] mm: introduce probe_page()
       [not found] ` <20061115075024.850542829@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: mm-introduce-probe_page.patch --]
[-- Type: text/plain, Size: 1540 bytes --]

Introduce a pair of functions to probe the existence of file page.
	- int __probe_page(mapping, offset)
	- int probe_page(mapping, offset)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc4-mm1.orig/include/linux/pagemap.h
+++ linux-2.6.19-rc4-mm1/include/linux/pagemap.h
@@ -72,6 +72,8 @@ static inline struct page *page_cache_al
 
 typedef int filler_t(void *, struct page *);
 
+extern int __probe_page(struct address_space *mapping, pgoff_t offset);
+extern int probe_page(struct address_space *mapping, pgoff_t offset);
 extern struct page * find_get_page(struct address_space *mapping,
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
--- linux-2.6.19-rc4-mm1.orig/mm/filemap.c
+++ linux-2.6.19-rc4-mm1/mm/filemap.c
@@ -613,6 +613,29 @@ void fastcall __lock_page_nosync(struct 
 							TASK_UNINTERRUPTIBLE);
 }
 
+/*
+ * Probing page existence.
+ */
+int __probe_page(struct address_space *mapping, pgoff_t offset)
+{
+	return !!radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+/*
+ * Here we just do not bother to grab the page, it's meaningless anyway.
+ */
+int probe_page(struct address_space *mapping, pgoff_t offset)
+{
+	int exists;
+
+	read_lock_irq(&mapping->tree_lock);
+	exists = __probe_page(mapping, offset);
+	read_unlock_irq(&mapping->tree_lock);
+
+	return exists;
+}
+EXPORT_SYMBOL(probe_page);
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 04/28] mm: introduce PG_readahead
       [not found] ` <20061115075025.438524224@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: mm-introduce-pg_readahead.patch --]
[-- Type: text/plain, Size: 1627 bytes --]

Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader:
	Hey, it's time to invoke the adaptive read-ahead logic!
	For the sake of I/O pipelining, don't wait until it runs out of
	cached pages.  ;-)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/page-flags.h
+++ linux-2.6.19-rc5-mm2/include/linux/page-flags.h
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_readahead		20	/* Reminder to do read-ahead */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -247,6 +249,10 @@ static inline void SetPageUptodate(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
+#define ClearPageReadahead(page) clear_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
--- linux-2.6.19-rc5-mm2.orig/mm/page_alloc.c
+++ linux-2.6.19-rc5-mm2/mm/page_alloc.c
@@ -598,7 +598,7 @@ static int prep_new_page(struct page *pa
 	if (PageReserved(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_fs_misc | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 06/28] readahead: insert cond_resched() calls
       [not found] ` <20061115075026.121499794@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-insert-cond_resched-calls.patch --]
[-- Type: text/plain, Size: 1983 bytes --]

Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm more
complex, it becomes necessary to insert some cond_resched() calls in
the read-ahead path.

If desktop users still feel audio jitters with the new read-ahead code,
please try one of the following ways to get rid of it:

1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
2) reduce the read-ahead request size by running
	blockdev --setra 256 /dev/hda # or whatever device you are using

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/fs/mpage.c
+++ linux-2.6.19-rc5-mm2/fs/mpage.c
@@ -407,8 +407,10 @@ mpage_readpages(struct address_space *ma
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else {
 			page_cache_release(page);
 		}
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -182,8 +182,10 @@ int read_cache_pages(struct address_spac
 			continue;
 		}
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
+		if (!pagevec_add(&lru_pvec, page)) {
+			cond_resched();
 			__pagevec_lru_add(&lru_pvec);
+		}
 		if (ret) {
 			read_cache_pages_invalidate_pages(mapping, pages);
 			break;
@@ -216,8 +218,10 @@ static int read_pages(struct address_spa
 		if (!add_to_page_cache(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else
 			page_cache_release(page);
 	}
@@ -330,6 +334,7 @@ __do_page_cache_readahead(struct address
 
 		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
+		cond_resched();
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 09/28] readahead: rescue_pages()
       [not found] ` <20061115075027.139255636@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-rescue_pages.patch --]
[-- Type: text/plain, Size: 3207 bytes --]

Introduce function rescue_pages() to protect pages in danger of thrashing.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm1/mm/readahead.c
@@ -708,6 +708,96 @@ unsigned long max_sane_readahead(unsigne
 }
 
 /*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ *   1. state based   - the default one
+ *   2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ *                 a read-ahead chunk
+ *    +-----------------------------------------+
+ *    |       # PG_readahead                    |
+ *    +-----------------------------------------+
+ *            ^ When this page is read, notify me for the next read-ahead.
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ *
+ * @page will be skipped: it's grabbed and won't die away.
+ * The following @nr_pages-1 pages will be protected.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+	int pgrescue = 0;
+	pgoff_t index = page_index(page);
+	struct address_space *mapping = page_mapping(page);
+	struct page *grabbed_page = NULL;
+	struct zone *zone;
+
+	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+			mapping->host->i_ino, index, nr_pages);
+
+	for(;;) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (!PageLRU(page))
+			goto out_unlock;
+
+		while (page_mapping(page) == mapping &&
+				page_index(page) == index) {
+			struct page *the_page = page;
+			page = list_entry((page)->lru.prev, struct page, lru);
+			if (!PageActive(the_page) &&
+					!PageLocked(the_page) &&
+					page_count(the_page) == 1) {
+				list_move(&the_page->lru, &zone->inactive_list);
+				pgrescue++;
+			}
+			index++;
+			if (!--nr_pages)
+				goto out_unlock;
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+		cond_resched();
+
+		if (grabbed_page)
+			page_cache_release(grabbed_page);
+		grabbed_page = page = find_get_page(mapping, index);
+		if (!page)
+			goto out;
+	}
+
+out_unlock:
+	spin_unlock_irq(&zone->lru_lock);
+out:
+	if (grabbed_page)
+		page_cache_release(grabbed_page);
+	ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+	return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
  * Read-ahead events accounting.
  */
 #ifdef CONFIG_DEBUG_READAHEAD

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 11/28] readahead: min/max sizes
       [not found] ` <20061115075027.832896629@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-min-max-sizes.patch --]
[-- Type: text/plain, Size: 2096 bytes --]

- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
  This value is no longer tightly coupled with the thrashing problem,
  therefore constrained by it. The adaptive read-ahead logic merely takes
  it as an upper bound, and will not stick to it under memory pressure.

- Slightly enlarge minimal/initial read-ahead size on big memory systems.
  Memory bounty systems are less likely to suffer from thrashing on small
  read-ahead sizes. A bigger initial value helps the ra_size scaling up
  progress.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/mm.h
+++ linux-2.6.19-rc5-mm2/include/linux/mm.h
@@ -1046,7 +1046,11 @@ extern int filemap_populate(struct vm_ar
 int write_one_page(struct page *page, int wait);
 
 /* readahead.c */
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+#define VM_MAX_READAHEAD	1024	/* kbytes */
+#else
 #define VM_MAX_READAHEAD	128	/* kbytes */
+#endif
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 #define VM_MAX_CACHE_HIT    	256	/* max pages in a row in cache before
 					 * turning readahead off */
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -814,6 +814,30 @@ out:
 	return nr_pages;
 }
 
+/*
+ * ra_min is mainly determined by the size of cache memory. Reasonable?
+ *
+ * Table of concrete numbers for 4KB page size:
+ *   inactive + free (MB):    4   8   16   32   64  128  256  512 1024
+ *            ra_min (KB):   16  16   16   16   20   24   32   48   64
+ */
+static inline void get_readahead_bounds(struct file_ra_state *ra,
+					unsigned long *ra_min,
+					unsigned long *ra_max)
+{
+	unsigned long active;
+	unsigned long inactive;
+	unsigned long free;
+
+	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+
+	free += inactive;
+	*ra_max = min(min(ra->ra_pages, 0xFFFFUL), free / 2);
+	*ra_min = min(min(MIN_RA_PAGES + (free >> 14),
+			  DIV_ROUND_UP(64*1024, PAGE_CACHE_SIZE)),
+			  *ra_max / 8);
+}
+
 #endif /* CONFIG_ADAPTIVE_READAHEAD */
 
 /*

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 12/28] readahead: state based method - aging accounting
       [not found] ` <20061115075028.178039166@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  2006-11-15 16:54     ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Christoph Lameter

[-- Attachment #1: readahead-state-based-method-aging-accounting.patch --]
[-- Type: text/plain, Size: 3447 bytes --]

Collect info about the global available memory and its consumption speed.
The data are used by the stateful method to estimate the thrashing threshold.

They are the decisive factor of the correctness/accuracy of the resulting
read-ahead size.

The accountings are done on a per-node basis. On NUMA systems, it works for
the two common real-world schemes:
	- the reader process allocates caches in a node affined manner;
	- the reader process allocates caches _balancely_ from a set of nodes.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
Apply type enum zone_type (readahead)
EDESC
From: Christoph Lameter <clameter@sgi.com>

After we have done this we can now do some typing cleanup.

The memory policy layer keeps a policy_zone that specifies
the zone that gets memory policies applied. This variable
can now be of type enum zone_type.

The check_highest_zone function and the build_zonelists funnctionm must
then also take a enum zone_type parameter.

Plus there are a number of loops over zones that also should use
zone_type.

We run into some troubles at some points with functions that need a
zone_type variable to become -1. Fix that up.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.19-rc5-mm2/include/linux/mmzone.h
@@ -218,6 +218,7 @@ struct zone {
 	unsigned long		nr_active;
 	unsigned long		nr_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
+	unsigned long		total_scanned;	   /* accumulated, may overflow */
 	int			all_unreclaimable; /* All pages pinned */
 
 	/* A count of how many reclaimers are scanning this zone */
@@ -464,6 +465,8 @@ void __get_zone_counts(unsigned long *ac
 			unsigned long *free, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free);
+unsigned long nr_free_inactive_pages_node(int nid);
+unsigned long nr_scanned_pages_node(int nid);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
--- linux-2.6.19-rc5-mm2.orig/mm/page_alloc.c
+++ linux-2.6.19-rc5-mm2/mm/page_alloc.c
@@ -1489,6 +1489,37 @@ unsigned int nr_free_pagecache_pages(voi
 	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER));
 }
 
+/*
+ * Amount of free+inactive RAM in a node.
+ */
+unsigned long nr_free_inactive_pages_node(int nid)
+{
+	enum zone_type i;
+	unsigned long sum = 0;
+	struct zone *zones = NODE_DATA(nid)->node_zones;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		sum += zones[i].nr_inactive +
+			zones[i].free_pages - zones[i].pages_low;
+
+	return sum;
+}
+
+/*
+ * Accumulated scanned pages in a node.
+ */
+unsigned long nr_scanned_pages_node(int nid)
+{
+       enum zone_type i;
+       unsigned long sum = 0;
+       struct zone *zones = NODE_DATA(nid)->node_zones;
+
+       for (i = 0; i < MAX_NR_ZONES; i++)
+	       sum += zones[i].total_scanned;
+
+       return sum;
+}
+
 static inline void show_node(struct zone *zone)
 {
 	if (NUMA_BUILD)
--- linux-2.6.19-rc5-mm2.orig/mm/vmscan.c
+++ linux-2.6.19-rc5-mm2/mm/vmscan.c
@@ -683,6 +683,7 @@ static unsigned long shrink_inactive_lis
 					     &page_list, &nr_scan);
 		zone->nr_inactive -= nr_taken;
 		zone->pages_scanned += nr_scan;
+		zone->total_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
 
 		nr_scanned += nr_scan;

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 13/28] readahead: state based method - routines
       [not found] ` <20061115075028.494374406@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-state-based-method-routines.patch --]
[-- Type: text/plain, Size: 7812 bytes --]

Extend struct file_ra_state to support the adaptive read-ahead logic.
Also define some helpers for it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/fs.h
+++ linux-2.6.19-rc5-mm2/include/linux/fs.h
@@ -688,21 +688,59 @@ struct fown_struct {
 
 /*
  * Track a single file's readahead state
+ *
+ * Diagram for the adaptive readahead logic:
+ *
+ *  |--------- old chunk ------->|-------------- new chunk -------------->|
+ *  +----------------------------+----------------------------------------+
+ *  |               #            |                  #                     |
+ *  +----------------------------+----------------------------------------+
+ *                  ^            ^                  ^                     ^
+ *  file_ra_state.la_index    .ra_index   .lookahead_index      .readahead_index
+ *
+ * Common used deduced sizes:
+ *                               |----------- readahead size ------------>|
+ *  +----------------------------+----------------------------------------+
+ *  |               #            |                  #                     |
+ *  +----------------------------+----------------------------------------+
+ *                  |------- invoke interval ------>|-- lookahead size -->|
  */
 struct file_ra_state {
-	unsigned long start;		/* Current window */
-	unsigned long size;
-	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
-	unsigned long cache_hit;	/* cache hit count*/
-	unsigned long prev_page;	/* Cache last read() position */
-	unsigned long ahead_start;	/* Ahead window */
-	unsigned long ahead_size;
-	unsigned long ra_pages;		/* Maximum readahead window */
+	union {
+		struct { /* stock read-ahead */
+			unsigned long start;		/* Current window */
+			unsigned long size;
+			unsigned long ahead_start;	/* Ahead window */
+			unsigned long ahead_size;
+			unsigned long cache_hit;	/* cache hit count */
+		};
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+		struct { /* adaptive read-ahead */
+			pgoff_t la_index;		/* old chunk */
+			pgoff_t ra_index;
+			pgoff_t lookahead_index;	/* new chunk */
+			pgoff_t readahead_index;
+
+			/*
+			 * Snapshot of the (node's) read-ahead aging value
+			 * on time of I/O submission.
+			 */
+			unsigned long age;
+		};
+#endif
+	};
+
+	/* mmap read-around */
 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
+
+	unsigned long flags;	/* RA_FLAG_xxx | ra_class_old | ra_class_new */
+	unsigned long prev_page;	/* Cache last read() position */
+	unsigned long ra_pages;		/* Maximum readahead window */
 };
-#define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
-#define RA_FLAG_INCACHE 0x02	/* file is already in cache */
+#define RA_FLAG_MISS	(1UL<<31) /* a cache miss occured against this file */
+#define RA_FLAG_INCACHE	(1UL<<30) /* file is already in cache */
+#define RA_FLAG_MMAP	(1UL<<29) /* mmap page access */
 
 struct file {
 	/*
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -815,6 +815,155 @@ out:
 }
 
 /*
+ * Some helpers for querying/building a read-ahead request.
+ *
+ * Diagram for some variable names used frequently:
+ *
+ *                                   |<------- la_size ------>|
+ *                  +-----------------------------------------+
+ *                  |                #                        |
+ *                  +-----------------------------------------+
+ *      ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+static enum ra_class ra_class_new(struct file_ra_state *ra)
+{
+	return ra->flags & RA_CLASS_MASK;
+}
+
+static inline enum ra_class ra_class_old(struct file_ra_state *ra)
+{
+	return (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+}
+
+static unsigned long ra_readahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->ra_index;
+}
+
+static unsigned long ra_lookahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->lookahead_index;
+}
+
+static unsigned long ra_invoke_interval(struct file_ra_state *ra)
+{
+	return ra->lookahead_index - ra->la_index;
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+	return (index >= ra->la_index &&
+		index <  ra->readahead_index);
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static void ra_set_class(struct file_ra_state *ra, enum ra_class ra_class)
+{
+	unsigned long flags_mask;
+	unsigned long flags;
+	unsigned long old_ra_class;
+
+	flags_mask = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+	flags = ra->flags & flags_mask;
+
+	old_ra_class = ra_class_new(ra) << RA_CLASS_SHIFT;
+
+	ra->flags = flags | old_ra_class | ra_class;
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static void ra_set_index(struct file_ra_state *ra,
+					pgoff_t la_index, pgoff_t ra_index)
+{
+	ra->la_index = la_index;
+	ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static void ra_set_size(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+	ra->readahead_index = ra->ra_index + ra_size;
+	ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static unsigned long ra_submit(struct file_ra_state *ra,
+			       struct address_space *mapping, struct file *filp)
+{
+	unsigned long ra_size;
+	unsigned long la_size;
+	pgoff_t eof;
+	int actual;
+
+	eof = /* it's a past-the-end index! */
+		DIV_ROUND_UP(i_size_read(mapping->host), PAGE_CACHE_SIZE);
+
+	if (unlikely(ra->ra_index >= eof))
+		return 0;
+
+	/*
+	 * Snap to EOF, if the request
+	 * 	- crossed the EOF boundary;
+	 * 	- is close to EOF(explained below).
+	 *
+	 * Imagine a file sized 18 pages, and we dicided to read-ahead the
+	 * first 16 pages. It is highly possible that in the near future we
+	 * will have to do another read-ahead for the remaining 2 pages,
+	 * which is an unfavorable small I/O.
+	 *
+	 * So we prefer to take a bit risk to enlarge the current read-ahead,
+	 * to eliminate possible future small I/O.
+	 */
+	if (ra->readahead_index + MIN_RA_PAGES
+				+ ra_readahead_size(ra) / 4 > eof) {
+		ra->readahead_index = eof;
+		if (ra->lookahead_index > eof)
+		    ra->lookahead_index = eof;
+	}
+
+	/* Take down the current read-ahead aging value. */
+	ra->age = nr_scanned_pages_node(numa_node_id());
+
+	ra_size = ra_readahead_size(ra);
+	la_size = ra_lookahead_size(ra);
+	actual = __do_page_cache_readahead(mapping, filp,
+					ra->ra_index, ra_size, la_size);
+
+#ifdef CONFIG_DEBUG_READAHEAD
+	if (ra->flags & RA_FLAG_MMAP)
+		ra_account(ra, RA_EVENT_READAHEAD_MMAP, actual);
+	if (ra->readahead_index == eof)
+		ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+	if (la_size)
+		ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+	if (ra_size > actual)
+		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
+	ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+	dprintk("readahead-%s(ino=%lu, index=%lu:%lu, size=%lu-%lu) = %d\n",
+			ra_class_name[ra_class_new(ra)],
+			mapping->host->i_ino, ra->la_index,
+			ra->ra_index, ra_size, la_size, actual);
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+	return actual;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:
@@ -890,10 +1039,10 @@ static void ra_account(struct file_ra_st
 		return;
 
 	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
-		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		c = ra_class_old(ra);
 		pages = -pages;
 	} else if (ra)
-		c = ra->flags & RA_CLASS_MASK;
+		c = ra_class_new(ra);
 	else
 		c = RA_CLASS_NONE;
 

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 14/28] readahead: state based method
       [not found] ` <20061115075028.829507795@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: readahead-state-based-method.patch --]
[-- Type: text/plain, Size: 6894 bytes --]

This is the fast code path of adaptive read-ahead.

MAJOR STEPS
===========

        - estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.

THE REFERENCE MODEL
===================

        1. inactive list has constant length and page flow speed
        2. the observed stream receives a steady flow of read requests
        3. no page activation, so that the inactive list forms a pipe

With that we get the picture showed below.

 |<------------------------- constant length ------------------------->|
 <<<<<<<<<<<<<<<<<<<<<<<<< steady flow of pages <<<<<<<<<<<<<<<<<<<<<<<<
 +---------------------------------------------------------------------+
 |tail                        inactive list                        head|
 |   =======                  ==========----                           |
 |   chunk A(stale pages)     chunk B(stale + fresh pages)             |
 +---------------------------------------------------------------------+

REAL WORLD ISSUES
=================

Real world workloads will always have fluctuations (violation of assumption
1 and 2). To counteract it, a tunable parameter readahead_ratio is introduced
to make the estimation conservative enough. Violation of assumption 3 will
not lead to thrashing, it is there just for simplicity of discussion.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: state based method - stand-alone size limit code
EDESC
From: Wu Fengguang <wfg@mail.ustc.edu.cn>

Separate out the readahead/lookahead sizes limiting code, and put them to
stand-alone limit_rala() function.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -18,6 +18,8 @@
 #include <linux/pagevec.h>
 #include <linux/buffer_head.h>
 
+#include <asm/div64.h>
+
 /*
  * Convienent macros for min/max read-ahead pages.
  * Note that MAX_RA_PAGES is rounded down, while MIN_RA_PAGES is rounded up.
@@ -964,6 +966,161 @@ static int ra_submit(struct file_ra_stat
 }
 
 /*
+ * Deduce the read-ahead/look-ahead size from primitive values.
+ *
+ * Input:
+ *	- @ra_size stores the estimated thrashing-threshold.
+ *	- @la_size stores the look-ahead size of previous request.
+ */
+static int adjust_rala(unsigned long ra_max,
+			unsigned long *ra_size, unsigned long *la_size)
+{
+	/*
+	 * Substract the old look-ahead to get real safe size for the next
+	 * read-ahead request.
+	 */
+	if (*ra_size > *la_size)
+		*ra_size -= *la_size;
+	else {
+		ra_account(NULL, RA_EVENT_READAHEAD_SHRINK, *ra_size);
+		return 0;
+	}
+
+	/*
+	 * Set new la_size according to the (still large) ra_size.
+	 */
+	*la_size = *ra_size / LOOKAHEAD_RATIO;
+
+	return 1;
+}
+
+static void limit_rala(unsigned long ra_max, unsigned long la_old,
+			unsigned long *ra_size, unsigned long *la_size)
+{
+	unsigned long stream_shift;
+
+	/*
+	 * Apply basic upper limits.
+	 */
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	/*
+	 * Make sure stream_shift is not too small.
+	 * (So that the next global_shift will not be too small.)
+	 */
+	stream_shift = la_old + (*ra_size - *la_size);
+	if (stream_shift < *ra_size / 4)
+		*la_size -= (*ra_size / 4 - stream_shift);
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ *    It is returned to make the next read-ahead request.
+ * 2. the remained safe space for the current chunk
+ *    It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will vibrate
+ * more on light load(with small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * The following figure illustrates the formula used in the function:
+ * 	While the stream reads stream_shift pages inside the chunks,
+ * 	the chunks are shifted global_shift pages inside inactive_list.
+ * So
+ * 	thrashing_threshold = free_mem * stream_shift / global_shift;
+ *
+ *
+ *      chunk A                    chunk B
+ *                          |<=============== global_shift ================|
+ *  +-------------+         +-------------------+                          |
+ *  |       #     |         |           #       |            inactive_list |
+ *  +-------------+         +-------------------+                     head |
+ *          |---->|         |---------->|
+ *             |                  |
+ *             +-- stream_shift --+
+ */
+static unsigned long compute_thrashing_threshold(struct file_ra_state *ra,
+							unsigned long *remain)
+{
+	unsigned long global_size;
+	unsigned long global_shift;
+	unsigned long stream_shift;
+	unsigned long ra_size;
+	uint64_t ll;
+
+	global_size = nr_free_inactive_pages_node(numa_node_id());
+	global_shift = nr_scanned_pages_node(numa_node_id()) - ra->age;
+	global_shift |= 1UL;
+	stream_shift = ra_invoke_interval(ra);
+
+	/* future safe space */
+	ll = (uint64_t) stream_shift * (global_size >> 9) * readahead_ratio * 5;
+	do_div(ll, global_shift);
+	ra_size = ll;
+
+	/* remained safe space */
+	if (global_size > global_shift) {
+		ll = (uint64_t) stream_shift * (global_size - global_shift);
+		do_div(ll, global_shift);
+		*remain = ll;
+	} else
+		*remain = 0;
+
+	ddprintk("compute_thrashing_threshold: "
+			"at %lu ra %lu=%lu*%lu/%lu, remain %lu for %lu\n",
+			ra->readahead_index, ra_size,
+			stream_shift, global_size, global_shift,
+			*remain, ra_lookahead_size(ra));
+
+	return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+			struct file_ra_state *ra,
+			struct page *page, pgoff_t offset,
+			unsigned long req_size, unsigned long ra_max)
+{
+	unsigned long ra_old, ra_size;
+	unsigned long la_old, la_size;
+	unsigned long remain_space;
+	unsigned long growth_limit;
+
+	la_old = la_size = ra->readahead_index - offset;
+	ra_old = ra_readahead_size(ra);
+	ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+	if (page && remain_space <= la_size) {
+		rescue_pages(page, la_size);
+		return 0;
+	}
+
+	growth_limit = req_size;
+	growth_limit += ra_max / 16;
+	growth_limit += (2 + readahead_ratio / 64) * ra_old;
+	if (growth_limit > ra_max)
+	    growth_limit = ra_max;
+
+	if (!adjust_rala(growth_limit, &ra_size, &la_size))
+		return 0;
+
+	limit_rala(growth_limit, la_old, &ra_size, &la_size);
+
+	ra_set_class(ra, RA_CLASS_STATE);
+	ra_set_index(ra, offset, ra->readahead_index);
+	ra_set_size(ra, ra_size, la_size);
+
+	return ra_submit(ra, mapping, filp);
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 15/28] readahead: context based method
       [not found] ` <20061115075029.205178794@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-context-based-method.patch --]
[-- Type: text/plain, Size: 18079 bytes --]

This is the slow code path of adaptive read-ahead.

No valid state info is available, so the page cache is queried to obtain
the required position/timing info. This kind of estimation is more conservative
than the stateful method, and also fluctuates more on load variance.

HOW IT WORKS
============

It works by peeking into the file cache and check if there are any history
pages present or accessed. In this way it can detect almost all forms of
sequential / semi-sequential read patterns, e.g.
        - parallel / interleaved sequential scans on one file
        - sequential reads across file open/close
        - mixed sequential / random accesses
        - sparse / skimming sequential read

HOW DATABASES CAN BENEFIT FROM IT
=================================

The adaptive readahead might help db performance in the following cases:
        - concurrent sequential scans
        - sequential scan on a fragmented table
        - index scan with clustered matches
        - index scan on majority rows (in case the planner goes wrong)

However some databases do not rely on kernel readahead, so are unlikely to
benefit from it.

ALGORITHM STEPS
===============

        - look back/forward to find the ra_index;
        - look back to estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.

ALGORITHM DYNAMICS
==================

* startup
When a sequential read is detected, chunk size is set to readahead-min
and grows up with each readahead.  The grow speed is controlled by
readahead-ratio.  When readahead-ratio == 100, the new logic grows chunk
sizes exponentially -- like the current logic, but lags behind it at
early steps.

* stabilize
When chunk size reaches readahead-max, or comes close to
        (readahead-ratio * thrashing-threshold)
it stops growing and stay there.

The main difference with the stock readahead logic occurs at and after
the time chunk size stops growing:
     -  The current logic grows chunk size exponentially in normal and
        decreases it by 2 each time thrashing is seen. That can lead to
        thrashing with almost every readahead for very slow streams.
     -  The new logic can stop at a size below the thrashing-threshold,
        and stay there stable.

* on stream speed up or system load fall
thrashing-threshold follows up and chunk size is likely to be enlarged.

* on stream slow down or system load rocket up
thrashing-threshold falls down.
If thrashing happened, the next read would be treated as a random read,
and with another read the chunk-size-growing-phase is restarted.

For a slow stream that has (thrashing-threshold < readahead-max):
      - When readahead-ratio = 100, there is only one chunk in cache at
        most time;
      - When readahead-ratio = 50, there are two chunks in cache at most
        time.
      - Lowing readahead-ratio helps gracefully cut down the chunk size
        without thrashing.

OVERHEADS
=========

The context based method has some overheads over the stateful method, due
to more lockings and memory scans.

Running oprofile on the following command shows the following differences:

	# diff sparse sparse1

	total oprofile samples	run1	run2
	stateful method		560482	558696
	stateless method	564463	559413

So the average overhead is about 0.4%.

Detailed diffprofile data:

# diffprofile oprofile.50.stateful oprofile.50.stateless
      2998    41.1% isolate_lru_pages
      2669    26.4% shrink_zone
      1822    14.7% system_call
      1419    27.6% radix_tree_delete
      1376    14.8% _raw_write_lock
      1279    27.4% free_pages_bulk
      1111    12.0% _raw_write_unlock
      1035    43.3% free_hot_cold_page
       849    15.3% unlock_page
       786    29.6% page_referenced
       710     4.6% kmap_atomic
       651    26.4% __pagevec_release_nonlru
       586    16.1% __rmqueue
       578    11.3% find_get_page
       481    15.5% page_waitqueue
       440     6.6% add_to_page_cache
       420    33.7% fget_light
       260     4.3% get_page_from_freelist
       223    13.7% find_busiest_group
       221    35.1% mutex_debug_check_no_locks_freed
       211     0.0% radix_tree_scan_hole
       198    35.5% delay_tsc
       195    14.8% ext3_get_branch
       182    12.6% profile_tick
       173     0.0% radix_tree_cache_lookup_node
       164    22.9% find_next_bit
       162    50.3% page_cache_readahead_adaptive
...
       106     0.0% radix_tree_scan_hole_backward
...
       -51    -7.6% radix_tree_preload
...
       -68    -2.1% radix_tree_insert
...
       -87    -2.0% mark_page_accessed
       -88    -2.0% __pagevec_lru_add
      -103    -7.7% softlockup_tick
      -107   -71.8% free_block
      -122   -77.7% do_IRQ
      -132   -82.0% do_timer
      -140   -47.1% ack_edge_ioapic_vector
      -168   -81.2% handle_IRQ_event
      -192   -35.2% irq_entries_start
      -204   -14.8% rw_verify_area
      -214   -13.2% account_system_time
      -233    -9.5% radix_tree_lookup_node
      -234   -16.6% scheduler_tick
      -259   -58.7% __do_IRQ
      -266    -6.8% put_page
      -318   -29.3% rcu_pending
      -333    -3.0% do_generic_mapping_read
      -337   -28.3% hrtimer_run_queues
      -493   -27.0% __rcu_pending
     -1038    -9.4% default_idle
     -3323    -3.5% __copy_to_user_ll
    -10331    -5.9% do_mpage_readpage

# diffprofile oprofile.50.stateful2 oprofile.50.stateless2
      1739     1.1% do_mpage_readpage
       833     0.9% __copy_to_user_ll
       340    21.3% find_busiest_group
       288     9.5% free_hot_cold_page
       261     4.6% _raw_read_unlock
       239     3.9% get_page_from_freelist
       201     0.0% radix_tree_scan_hole
       163    14.3% raise_softirq
       160     0.0% radix_tree_cache_lookup_node
       160    11.8% update_process_times
       136     9.3% fget_light
       121    35.1% page_cache_readahead_adaptive
       117    36.0% restore_all
       117     2.8% mark_page_accessed
       109     6.4% rebalance_tick
       107     9.4% sys_read
       102     0.0% radix_tree_scan_hole_backward
...
        63     4.0% readahead_cache_hit
...
       -10   -15.9% radix_tree_node_alloc
...
       -39    -1.7% radix_tree_lookup_node
       -39   -10.3% irq_entries_start
       -43    -1.3% radix_tree_insert
...
       -47    -4.6% __do_page_cache_readahead
       -64    -9.3% radix_tree_preload
       -65    -5.4% rw_verify_area
       -65    -2.2% vfs_read
       -70    -4.7% timer_interrupt
       -71    -1.0% __wake_up_bit
       -73    -1.1% radix_tree_delete
       -79   -12.6% __mod_page_state_offset
       -94    -1.8% __find_get_block
       -94    -2.2% __pagevec_lru_add
      -102    -1.7% free_pages_bulk
      -116    -1.3% _raw_read_lock
      -123    -7.4% do_sync_read
      -130    -8.4% ext3_get_blocks_handle
      -142    -3.8% put_page
      -146    -7.9% mpage_readpages
      -147    -5.6% apic_timer_interrupt
      -168    -1.6% _raw_write_unlock
      -172    -5.0% page_referenced
      -206    -3.2% unlock_page
      -212   -15.0% restore_nocheck
      -213    -2.1% default_idle
      -245    -5.0% __rmqueue
      -278    -4.3% find_get_page
      -282    -2.1% system_call
      -287   -11.8% run_timer_softirq
      -300    -2.7% _raw_write_lock
      -420    -3.2% shrink_zone
      -661    -5.7% isolate_lru_pages

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: context based method - apply stream_shift size limits to contexta method
EDESC
From: Wu Fengguang <wfg@mail.ustc.edu.cn>

Apply size limits to the contexta method, so that the possible following
stateful method will not be presented too small stream_shift size.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: context based method - fix *remain counting
EDESC
From: Wu Fengguang <wfg@mail.ustc.edu.cn>

*remain in query_page_cache_segment() is over counted by 1, fix it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: context based method - slow start
EDESC
From: Wu Fengguang <wfg@mail.ustc.edu.cn>

The context method will lead to noticable overhead(readahead miss) on very
sparse random reads.

Having the readahead window to start slowly makes it much better.  But
still startup quick if the user prefers sparse readahead.

Benchmarks of reading randomly 100,000 pages on a 1000,000 pages _sparse_
file:

        ARA before patch          ARA                STOCK
        ================    ================    ================
real    2.779s    2.782s    2.552s    2.606s    2.477s    2.521s
user    1.120s    1.184s    1.133s    1.155s    1.097s    1.159s
sys     1.248s    1.208s    1.093s    1.086s    1.079s    1.064s

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -1121,6 +1121,300 @@ state_based_readahead(struct address_spa
 }
 
 /*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ *   chunk A           chunk B                      chunk C                 head
+ *
+ *   l01 l11           l12   l21                    l22
+ *| |-->|-->|       |------>|-->|                |------>|
+ *| +-------+       +-----------+                +-------------+               |
+ *| |   #   |       |       #   |                |       #     |               |
+ *| +-------+       +-----------+                +-------------+               |
+ *| |<==============|<===========================|<============================|
+ *        L0                     L1                            L2
+ *
+ * Let f(l) = L be a map from
+ * 	l: the number of pages read by the stream
+ * to
+ * 	L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * 	f(l01) <= L0
+ * 	f(l11 + l12) = L1
+ * 	f(l21 + l22) = L2
+ * 	...
+ * 	f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ *			   <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+#if PG_active < PG_referenced
+#  error unexpected page flags order
+#endif
+
+#define PAGE_REFCNT_0           0
+#define PAGE_REFCNT_1           (1 << PG_referenced)
+#define PAGE_REFCNT_2           (1 << PG_active)
+#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
+
+/*
+ * STATUS   REFERENCE COUNT      TYPE
+ *  __                   0      fresh
+ *  _R       PAGE_REFCNT_1      stale
+ *  A_       PAGE_REFCNT_2      disturbed once
+ *  AR       PAGE_REFCNT_3      disturbed twice
+ *
+ *  A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+        return page->flags & PAGE_REFCNT_MASK;
+}
+
+/*
+ * Now that revisited pages are put into active_list immediately,
+ * we cannot get an accurate estimation of
+ *
+ * 		len(inactive_list) / speed(leader)
+ *
+ * on the situation of two sequential readers that come close enough:
+ *
+ *        chunk 1         chunk 2               chunk 3
+ *      ==========  =============-------  --------------------
+ *                     follower ^                     leader ^
+ *
+ * In this case, using inactive_page_refcnt() in the context based method yields
+ * conservative read-ahead size, while page_refcnt() yields aggressive size.
+ */
+static inline unsigned long inactive_page_refcnt(struct page *page)
+{
+	if (!page || PageActive(page))
+		return 0;
+
+	return page_refcnt(page);
+}
+
+/*
+ * Count/estimate cache hits in range [begin, end).
+ * The estimation is simple and optimistic.
+ */
+#define CACHE_HIT_HASH_KEY	29	/* some prime number */
+static int count_cache_hit(struct address_space *mapping,
+						pgoff_t begin, pgoff_t end)
+{
+	int size = end - begin;
+	int count = 0;
+	int i;
+
+	/*
+	 * The first page may well is chunk head and has been accessed,
+	 * so it is index 0 that makes the estimation optimistic. This
+	 * behavior guarantees a readahead when (size < ra_max) and
+	 * (readahead_hit_rate >= 8).
+	 */
+	read_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < 8;) {
+		struct page *page = radix_tree_lookup(&mapping->page_tree,
+			begin + size * ((i++ * CACHE_HIT_HASH_KEY) & 7) / 8);
+		if (inactive_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+			break;
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+	return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static unsigned long count_history_pages(struct address_space *mapping,
+					struct file_ra_state *ra,
+					pgoff_t offset, unsigned long ra_max)
+{
+	pgoff_t head;
+	unsigned long count;
+	unsigned long lookback;
+	unsigned long hit_rate;
+
+	/*
+	 * Scan backward and check the near @ra_max pages.
+	 * The count here determines ra_size.
+	 */
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+	head = 1 + radix_tree_scan_hole_backward(&mapping->page_tree,
+							offset - 1, ra_max);
+
+	count = offset - head;
+
+	/*
+	 * Ensure readahead hit rate
+	 */
+	hit_rate = max(readahead_hit_rate, 1);
+	if (count_cache_hit(mapping, head, offset) * hit_rate < count)
+		count = 0;
+
+	/*
+	 * Unnecessary to count more?
+	 */
+	if (count < ra_max)
+		goto out_unlock;
+
+	/*
+	 * Check the far pages coarsely.
+	 * The enlarged count will contribute to the look-ahead size.
+	 */
+	lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+						100 / (readahead_ratio | 1);
+
+	for (count += ra_max; count < lookback; count += ra_max)
+		if (!__probe_page(mapping, offset - count))
+			break;
+
+out_unlock:
+	read_unlock_irq(&mapping->tree_lock);
+
+	ddprintk("count_history_pages: ino=%lu, idx=%lu, count=%lu\n",
+				mapping->host->i_ino, offset, count);
+
+	return count;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * One major weakness of stateless method is the slow scaling up of ra_size.
+ * The logic tries to make up for this in the important case of sequential
+ * reads that extend from start of file. In this case, the ra_size is not
+ * chosen to make the whole next chunk safe (as in normal ones). Only part of
+ * which is safe: the tailing look-ahead part is 'unsafe'. However it will be
+ * safeguarded by rescue_pages() when the previous chunks are lost.
+ */
+static void adjust_rala_aggressive(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	pgoff_t offset = *ra_size;
+
+	*ra_size -= min(*ra_size, *la_size);
+	*la_size = offset;
+	*ra_size += *la_size;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ *
+ * RETURN VALUE		HINT
+ *      1		@ra contains a valid ra-request, please submit it
+ *      0		no seq-pattern discovered, please try the next method
+ *     -1		please don't do _any_ readahead
+ */
+static int
+try_context_based_readahead(struct address_space *mapping,
+			struct file_ra_state *ra,
+			struct page *page, pgoff_t offset,
+			unsigned long ra_min, unsigned long ra_max)
+{
+	pgoff_t start;
+	unsigned long ra_size;
+	unsigned long la_size;
+
+	/* Check if there is a segment of history pages, and its end index.
+	 * Based on which we decide whether and where to start read-ahead.
+	 *
+	 * Case 1: we have a current page.
+	 * 	   Search forward for a nearby hole.
+	 */
+	if (page) {
+		unsigned long max_scan = ra_max + ra_min;
+		start = radix_tree_scan_hole(&mapping->page_tree,
+							offset, max_scan);
+		if (start != 0 && start - offset < max_scan)
+			goto has_history_pages;
+		else
+			return -1;
+	}
+
+	/* Case 2: current page is missing; previous page is present.
+	 *         Just do read-ahead from the current index on.
+	 * There's clear sign of sequential reading. It can be
+	 * 	a) seek => read => this read
+	 * 	b) cache hit read(s) => this read
+	 * We either just detected a new sequence of sequential reads,
+	 * or should quickly resume readahead after the cache hit.
+	 */
+	if (offset == ra->prev_page + 1) {
+		start = offset;
+		goto has_history_pages;
+	}
+
+	/* Case 2x: the same context info as 2.
+	 * It can be the early stage of semi-sequential reads(interleaved/nfsd),
+	 * or an ugly random one.  So be conservative.
+	 */
+	if (readahead_hit_rate && probe_page(mapping, offset - 1)) {
+		start = offset;
+		if (ra_min > 2 * readahead_hit_rate)
+		    ra_min = 2 * readahead_hit_rate;
+		goto has_history_pages;
+	}
+
+	/* Case 3: no current/previous pages;
+	 *         sparse read-ahead is enabled: ok, be aggressive.
+	 *         Check if there's any adjecent history pages.
+	 */
+	if (readahead_hit_rate > 1) {
+		start = radix_tree_scan_data_backward(&mapping->page_tree,
+							       offset, ra_min);
+		if (start != ULONG_MAX && offset - start < ra_min) {
+			ra_min += offset - start;
+			offset = ++start; /* pretend the request starts here */
+			goto has_history_pages;
+		}
+	}
+
+	return 0;
+
+has_history_pages:
+	ra_size = count_history_pages(mapping, ra, offset, ra_max);
+	if (!ra_size)
+		return 0;
+
+	la_size = start - offset;
+	if (page && ra_size < la_size) {
+		if (ra_size < offset)
+			rescue_pages(page, la_size);
+		return -1;
+	}
+
+	if (ra_size >= offset) {
+		ra_size = offset;
+		adjust_rala_aggressive(ra_max, &ra_size, &la_size);
+		ra_set_class(ra, RA_CLASS_CONTEXT_AGGRESSIVE);
+	} else {
+		ra_size = max(ra_min, ra_size * readahead_ratio / 100);
+		if (!adjust_rala(ra_max, &ra_size, &la_size))
+			return -1;
+		ra_set_class(ra, RA_CLASS_CONTEXT);
+	}
+
+	limit_rala(ra_max, start - offset, &ra_size, &la_size);
+
+	ra_set_index(ra, offset, start);
+	ra_set_size(ra, ra_size, la_size);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 16/28] readahead: initial method - guiding sizes
       [not found] ` <20061115075029.519507130@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-initial-method-guiding-sizes.patch --]
[-- Type: text/plain, Size: 2509 bytes --]

Introduce three guiding sizes for the initial readahead method.
	- ra_pages0:	   min readahead on start-of-file
	- ra_thrash_bytes: estimated thrashing threshold

ra_thrash_bytes defaults to large value:
	- most systems don't have the danger of thrashing
	- it increases slowly and drops rapidly

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/backing-dev.h
+++ linux-2.6.19-rc5-mm2/include/linux/backing-dev.h
@@ -26,6 +26,8 @@ typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
+	unsigned long ra_pages0; /* min readahead on start of file */
+	unsigned long ra_thrash_bytes;	/* estimated thrashing threshold */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -32,6 +32,9 @@
  * Adaptive read-ahead parameters.
  */
 
+/* Default initial read-ahead size. */
+#define INITIAL_RA_PAGES  DIV_ROUND_UP(64*1024, PAGE_CACHE_SIZE)
+
 /* In laptop mode, poll delayed look-ahead on every ## pages read. */
 #define LAPTOP_POLL_INTERVAL 16
 
@@ -114,6 +117,8 @@ EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= MAX_RA_PAGES,
+	.ra_pages0	= INITIAL_RA_PAGES,
+	.ra_thrash_bytes = MAX_RA_PAGES * PAGE_CACHE_SIZE,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
--- linux-2.6.19-rc5-mm2.orig/block/ll_rw_blk.c
+++ linux-2.6.19-rc5-mm2/block/ll_rw_blk.c
@@ -214,9 +214,6 @@ void blk_queue_make_request(request_queu
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	q->make_request_fn = mfn;
-	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-	q->backing_dev_info.state = 0;
-	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
@@ -1848,6 +1845,7 @@ request_queue_t *blk_alloc_queue_node(gf
 	q->kobj.ktype = &queue_ktype;
 	kobject_init(&q->kobj);
 
+	q->backing_dev_info = default_backing_dev_info;
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
 

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 17/28] readahead: initial method - thrashing guard size
       [not found] ` <20061115075029.869472273@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-initial-method-thrashing-guard-size.patch --]
[-- Type: text/plain, Size: 1568 bytes --]

backing_dev_info.ra_thrash_bytes is dynamicly updated to be a little above
the thrashing safe read-ahead size. It is used in the initial method where
the thrashing threshold for the particular reader is still unknown.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -822,6 +822,22 @@ out:
 }
 
 /*
+ * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
+ * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
+ * _minimal_ read-ahead thrashing threshold on the device.
+ *
+ * (*) Note that being a bit risky can _help_ overall performance.
+ */
+static void update_ra_thrash_bytes(struct backing_dev_info *bdi,
+							unsigned long ra_size)
+{
+	ra_size <<= PAGE_CACHE_SHIFT;
+	bdi->ra_thrash_bytes = (bdi->ra_thrash_bytes < ra_size) ?
+				(ra_size + bdi->ra_thrash_bytes * 127) / 128:
+				(ra_size + bdi->ra_thrash_bytes *   7) /   8;
+}
+
+/*
  * Some helpers for querying/building a read-ahead request.
  *
  * Diagram for some variable names used frequently:
@@ -1118,6 +1134,10 @@ state_based_readahead(struct address_spa
 
 	limit_rala(growth_limit, la_old, &ra_size, &la_size);
 
+	/* ra_size in its _steady_ state reflects thrashing threshold */
+	if (page && ra_old + ra_old / 8 >= ra_size)
+		update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
 	ra_set_class(ra, RA_CLASS_STATE);
 	ra_set_index(ra, offset, ra->readahead_index);
 	ra_set_size(ra, ra_size, la_size);

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 18/28] readahead: initial method - user recommended size
       [not found] ` <20061115075030.229339867@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-initial-method-user-recommended-size.patch --]
[-- Type: text/plain, Size: 1709 bytes --]

backing_dev_info.ra_pages0 is a user configurable parameter that controls
the readahead size on start-of-file.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/block/ll_rw_blk.c
+++ linux-2.6.19-rc5-mm2/block/ll_rw_blk.c
@@ -3795,6 +3795,24 @@ queue_ra_store(struct request_queue *q, 
 	return ret;
 }
 
+static ssize_t queue_initial_ra_show(struct request_queue *q, char *page)
+{
+	int kb = q->backing_dev_info.ra_pages0 << (PAGE_CACHE_SHIFT - 10);
+
+	return queue_var_show(kb, (page));
+}
+
+static ssize_t
+queue_initial_ra_store(struct request_queue *q, const char *page, size_t count)
+{
+	unsigned long kb;
+	ssize_t ret = queue_var_store(&kb, page, count);
+
+	q->backing_dev_info.ra_pages0 = kb >> (PAGE_CACHE_SHIFT - 10);
+
+	return ret;
+}
+
 static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
 {
 	int max_sectors_kb = q->max_sectors >> 1;
@@ -3852,6 +3870,12 @@ static struct queue_sysfs_entry queue_ra
 	.store = queue_ra_store,
 };
 
+static struct queue_sysfs_entry queue_initial_ra_entry = {
+	.attr = {.name = "initial_ra_kb", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_initial_ra_show,
+	.store = queue_initial_ra_store,
+};
+
 static struct queue_sysfs_entry queue_max_sectors_entry = {
 	.attr = {.name = "max_sectors_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_max_sectors_show,
@@ -3872,6 +3896,7 @@ static struct queue_sysfs_entry queue_io
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
+	&queue_initial_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_iosched_entry.attr,

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 20/28] readahead: backward prefetching method
       [not found] ` <20061115075030.942942737@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-backward-prefetching-method.patch --]
[-- Type: text/plain, Size: 1701 bytes --]

Readahead policy for reading backward.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: backward prefetching method - add use case comment
EDESC
From: Wu Fengguang <wfg@mail.ustc.edu.cn>

Backward prefetching is vital to structural analysis and some other
scientific applications.  Comment this use case.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -1476,6 +1476,52 @@ initial_readahead(struct address_space *
 }
 
 /*
+ * Backward prefetching.
+ *
+ * No look-ahead and thrashing safety guard: should be unnecessary.
+ *
+ * Important for certain scientific arenas(i.e. structural analysis).
+ */
+static int
+try_backward_prefetching(struct file_ra_state *ra, pgoff_t offset,
+			 unsigned long size, unsigned long ra_max)
+{
+	pgoff_t prev = ra->prev_page;
+
+	/* Reading backward? */
+	if (offset >= prev)
+		return 0;
+
+	/* Close enough? */
+	size += readahead_hit_rate;
+	if (offset + 2 * size <= prev)
+		return 0;
+
+	if (ra_class_new(ra) == RA_CLASS_BACKWARD && ra_has_index(ra, prev)) {
+		prev = ra->la_index;
+		size += 2 * ra_readahead_size(ra);
+	} else
+		size *= 2;
+
+	if (size > ra_max)
+		size = ra_max;
+	if (size > prev)
+		size = prev;
+
+	/* The readahead-request covers the read-request? */
+	if (offset < prev - size)
+		return 0;
+
+	offset = prev - size;
+
+	ra_set_class(ra, RA_CLASS_BACKWARD);
+	ra_set_index(ra, offset, offset);
+	ra_set_size(ra, size, 0);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 21/28] readahead: thrashing recovery method
       [not found] ` <20061115075031.286178806@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-thrashing-recovery-method.patch --]
[-- Type: text/plain, Size: 1732 bytes --]

Readahead policy after thrashing.

It tries to recover gracefully from the thrashing.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -1522,6 +1522,50 @@ try_backward_prefetching(struct file_ra_
 }
 
 /*
+ * Readahead thrashing recovery.
+ */
+static unsigned long
+thrashing_recovery_readahead(struct address_space *mapping,
+				struct file *filp, struct file_ra_state *ra,
+				pgoff_t offset, unsigned long ra_max)
+{
+	unsigned long ra_size;
+
+#ifdef CONFIG_DEBUG_READAHEAD
+	if (probe_page(mapping, offset - 1))
+		ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+						ra->readahead_index - offset);
+	ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+						ra->readahead_index - offset);
+#endif
+
+	/*
+	 * Some thrashing occur in (ra_index, la_index], in which case the
+	 * old read-ahead chunk is lost soon after the new one is allocated.
+	 * Ensure that we recover all needed pages in the old chunk.
+	 */
+	if (offset < ra->ra_index)
+		ra_size = ra->ra_index - offset;
+	else {
+		/* After thrashing, we know the exact thrashing-threshold. */
+		ra_size = offset - ra->ra_index;
+		update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
+		/* And we'd better be a bit conservative. */
+		ra_size = ra_size * 3 / 4;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	ra_set_class(ra, RA_CLASS_THRASHING);
+	ra_set_index(ra, offset, offset);
+	ra_set_size(ra, ra_size, ra_size / LOOKAHEAD_RATIO);
+
+	return ra_submit(ra, mapping, filp);
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 22/28] readahead: call scheme
       [not found] ` <20061115075031.524129110@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 12742 bytes --]

The read-ahead logic is called when the reading hits
        - a PG_readahead marked page;
        - a non-present page.

ra.prev_page should be properly setup on entrance, and readahead_cache_hit()
should be called on every page reference as a feedback.

This call scheme achieves the following goals:
        - makes all stateful/stateless methods happy;
        - eliminates the cache hit problem naturally;
        - lives in harmony with application managed read-aheads via
          fadvise/madvise.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: initial method - expected read size - fix fastcall
EDESC
From: Fengguang Wu <wfg@mail.ustc.edu.cn>

Remove 'fastcall' directive for function readahead_close().

It has drawn concerns from Andrew Morton. Now I have some benchmarks
on it, and proved it as a _false_ optimization.

The tests are simple runs of the following command over _cached_ dirs:
                time find / > /dev/null

Table of summary(averages):
		user		sys		cpu		total
fastcall:	1.236		4.39		89%		6.2936
non-fastcall:	1.18		4.14166667	92%		5.75416667
stock:		1.25833333	4.14666667	93.3%		5.75866667

Detailed outputs:

readahead patched kernel with fastcall:
noglob find / > /dev/null  1.21s user 4.58s system 90% cpu 6.378 total
noglob find / > /dev/null  1.25s user 4.47s system 86% cpu 6.623 total
noglob find / > /dev/null  1.23s user 4.36s system 90% cpu 6.173 total
noglob find / > /dev/null  1.25s user 4.33s system 92% cpu 6.067 total
noglob find / > /dev/null  1.24s user 4.21s system 87% cpu 6.227 total

readahead patched kernel without fastcall:
noglob find / > /dev/null  1.21s user 4.46s system 95% cpu 5.962 total
noglob find / > /dev/null  1.26s user 4.58s system 94% cpu 6.142 total
noglob find / > /dev/null  1.10s user 3.80s system 86% cpu 5.661 total
noglob find / > /dev/null  1.13s user 3.98s system 95% cpu 5.355 total
noglob find / > /dev/null  1.18s user 4.00s system 89% cpu 5.805 total
noglob find / > /dev/null  1.22s user 4.03s system 93% cpu 5.600 total

stock kernel:
noglob find / > /dev/null  1.22s user 4.24s system 94% cpu 5.803 total
noglob find / > /dev/null  1.31s user 4.21s system 95% cpu 5.784 total
noglob find / > /dev/null  1.27s user 4.24s system 97% cpu 5.676 total
noglob find / > /dev/null  1.34s user 4.21s system 94% cpu 5.844 total
noglob find / > /dev/null  1.26s user 4.08s system 89% cpu 5.935 total
noglob find / > /dev/null  1.15s user 3.90s system 91% cpu 5.510 total

Similar regression has also been found by Voluspa <lista1@comhem.se>:
> "cd /usr ; time find . -type f -exec md5sum {} \;"
>
> 2.6.17-rc5 ------- 2.6.17-rc5-ar
>
> real 21m21.009s -- 21m37.663s
> user 3m20.784s  -- 3m20.701s
> sys  6m34.261s  -- 6m41.735s

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead: call scheme - no fastcall for readahead_cache_hit()
EDESC
From: Wu Fengguang <wfg@mail.ustc.edu.cn>

Remove 'fastcall' directive for readahead_cache_hit().

It leads to unfavorable performance in the following micro benchmark on i386
with CONFIG_REGPARM=n:

Command:
        time cp cold /dev/null

Summary:
              user     sys     cpu    total
no-fastcall   1.24    24.88    90.9   28.57
fastcall      1.16    25.69    91.5   29.23

Details:
without fastcall:
cp cold /dev/null  1.27s user 24.63s system 91% cpu 28.348 total
cp cold /dev/null  1.17s user 25.09s system 91% cpu 28.653 total
cp cold /dev/null  1.24s user 24.75s system 91% cpu 28.448 total
cp cold /dev/null  1.20s user 25.04s system 91% cpu 28.614 total
cp cold /dev/null  1.31s user 24.67s system 91% cpu 28.499 total
cp cold /dev/null  1.30s user 24.87s system 91% cpu 28.530 total
cp cold /dev/null  1.26s user 24.84s system 91% cpu 28.542 total
cp cold /dev/null  1.16s user 25.15s system 90% cpu 28.925 total

with fastcall:
cp cold /dev/null  1.16s user 26.39s system 91% cpu 30.061 total
cp cold /dev/null  1.25s user 26.53s system 91% cpu 30.378 total
cp cold /dev/null  1.10s user 25.32s system 92% cpu 28.679 total
cp cold /dev/null  1.15s user 25.20s system 91% cpu 28.747 total
cp cold /dev/null  1.19s user 25.38s system 92% cpu 28.841 total
cp cold /dev/null  1.11s user 25.75s system 92% cpu 29.126 total
cp cold /dev/null  1.17s user 25.49s system 91% cpu 29.042 total
cp cold /dev/null  1.17s user 25.49s system 92% cpu 28.970 total

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
DESC
readahead-call-scheme fix
EDESC
From: Mike Galbraith <efault@gmx.de>

On Thu, 2006-08-10 at 02:19 -0700, Andrew Morton wrote:

> It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD -
> perhaps that got broken.

A typo was pinning pagecache.  Fixes leak encountered with rpm -qaV.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/mm.h
+++ linux-2.6.19-rc5-mm2/include/linux/mm.h
@@ -1067,6 +1067,22 @@ unsigned long page_cache_readahead(struc
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+				struct file_ra_state *ra,
+				struct file *filp,
+				struct page *page,
+				pgoff_t offset,
+				unsigned long size);
+
+#if defined(CONFIG_DEBUG_READAHEAD)
+void readahead_cache_hit(struct file_ra_state *ra, struct page *page);
+#else
+static inline void readahead_cache_hit(struct file_ra_state *ra,
+					struct page *page)
+{
+}
+#endif
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
 extern int readahead_ratio;
--- linux-2.6.19-rc5-mm2.orig/mm/filemap.c
+++ linux-2.6.19-rc5-mm2/mm/filemap.c
@@ -974,16 +974,33 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
+
+		if (!prefer_adaptive_readahead() && index == next_index)
 			next_index = page_cache_readahead(mapping, &ra, filp,
 					index, last_index - index);
 
 find_page:
 		page = find_get_page(mapping, index);
+		if (prefer_adaptive_readahead()) {
+			if (unlikely(page == NULL)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping,
+						&ra, filp, NULL,
+						index, last_index - index);
+				page = find_get_page(mapping, index);
+			} else if (PageReadahead(page)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping,
+						&ra, filp, page,
+						index, last_index - index);
+			}
+		}
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
+		readahead_cache_hit(&ra, page);
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1131,6 +1148,8 @@ no_cached_page:
 
 out:
 	*_ra = ra;
+	if (prefer_adaptive_readahead())
+		_ra->prev_page = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
@@ -1401,6 +1420,7 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	ra->flags |= RA_FLAG_MMAP;
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
 retry_all:
@@ -1418,7 +1438,7 @@ retry_all:
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
+	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
 	/*
@@ -1426,11 +1446,22 @@ retry_all:
 	 */
 retry_find:
 	page = find_get_page(mapping, pgoff);
+	if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+		if (!page) {
+			page_cache_readahead_adaptive(mapping, ra, file, NULL,
+								   pgoff, 1);
+			page = find_get_page(mapping, pgoff);
+		} else if (PageReadahead(page)) {
+			page_cache_readahead_adaptive(mapping, ra, file, page,
+								   pgoff, 1);
+		}
+	}
 	if (!page) {
 		unsigned long ra_pages;
 
 		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, ra, pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1466,6 +1497,7 @@ retry_find:
 
 	if (!did_readaround)
 		ra->mmap_hit++;
+	readahead_cache_hit(ra, page);
 
 	/*
 	 * Ok, found a page in the page cache, now we need to check
@@ -1481,6 +1513,8 @@ success:
 	mark_page_accessed(page);
 	if (type)
 		*type = majmin;
+	if (prefer_adaptive_readahead())
+		ra->prev_page = page->index;
 	return page;
 
 outside_data_content:
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -1591,6 +1591,149 @@ static inline void get_readahead_bounds(
 
 #endif /* CONFIG_ADAPTIVE_READAHEAD */
 
+/**
+ * page_cache_readahead_adaptive - thrashing safe adaptive read-ahead
+ * @mapping, @ra, @filp, @offset, @req_size: the same as page_cache_readahead()
+ * @page: the page at @offset, or NULL if non-present
+ *
+ * page_cache_readahead_adaptive() is the entry point of the adaptive
+ * read-ahead logic. It tries a set of methods in turn to determine the
+ * appropriate readahead action and submits the readahead I/O.
+ *
+ * This function is expected to be called on two conditions:
+ * 1. @page == NULL
+ *    A cache miss happened, some pages have to be read in
+ * 2. @page != NULL && PageReadahead(@page)
+ *    A look-ahead mark encountered, this is set by a previous read-ahead
+ *    invocation to instruct the caller to give the function a chance to
+ *    check up and do next read-ahead in advance.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+				struct file_ra_state *ra, struct file *filp,
+				struct page *page,
+				pgoff_t offset, unsigned long req_size)
+{
+	unsigned long ra_size;
+	unsigned long ra_min;
+	unsigned long ra_max;
+	int ret;
+
+	if (page) {
+		ClearPageReadahead(page);
+
+		/*
+		 * Defer read-ahead on IO congestion.
+		 */
+		if (bdi_read_congested(mapping->backing_dev_info)) {
+			ra_account(ra, RA_EVENT_IO_CONGESTION, req_size);
+			return 0;
+		}
+	}
+
+	if (page)
+		ra_account(ra, RA_EVENT_LOOKAHEAD_HIT, ra_lookahead_size(ra));
+	else if (offset)
+		ra_account(ra, RA_EVENT_CACHE_MISS, req_size);
+
+	get_readahead_bounds(ra, &ra_min, &ra_max);
+
+	/* read-ahead disabled? */
+	if (unlikely(!ra_max || !readahead_ratio)) {
+		ra_size = max_sane_readahead(req_size);
+		goto readit;
+	}
+
+	/*
+	 * Start of file.
+	 */
+	if (offset == 0)
+		return initial_readahead(mapping, filp, ra, req_size);
+
+	/*
+	 * State based sequential read-ahead.
+	 */
+	if (offset == ra->prev_page + 1 &&
+	    offset == ra->lookahead_index &&
+					!debug_option(disable_stateful_method))
+		return state_based_readahead(mapping, filp, ra, page,
+						offset, req_size, ra_max);
+
+	/*
+	 * Recover from possible thrashing.
+	 */
+	if (!page && offset == ra->prev_page + 1 && ra_has_index(ra, offset))
+		return thrashing_recovery_readahead(mapping, filp, ra,
+								offset, ra_max);
+
+	/*
+	 * Backward read-ahead.
+	 */
+	if (!page && try_backward_prefetching(ra, offset, req_size, ra_max))
+		return ra_submit(ra, mapping, filp);
+
+	/*
+	 * Context based sequential read-ahead.
+	 */
+	ret = try_context_based_readahead(mapping, ra, page,
+							offset, ra_min, ra_max);
+	if (ret > 0)
+		return ra_submit(ra, mapping, filp);
+	if (ret < 0)
+		return 0;
+
+	/* No action on look-ahead time? */
+	if (page) {
+		ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+						ra->readahead_index - offset);
+		return 0;
+	}
+
+	/*
+	 * Random read.
+	 */
+	ra_size = min(req_size, ra_max);
+readit:
+	ra_size = __do_page_cache_readahead(mapping, filp, offset, ra_size, 0);
+
+	ra_account(ra, RA_EVENT_RANDOM_READ, ra_size);
+	dprintk("random_read(ino=%lu, req=%lu+%lu) = %lu\n",
+		mapping->host->i_ino, offset, req_size, ra_size);
+
+	return ra_size;
+}
+EXPORT_SYMBOL_GPL(page_cache_readahead_adaptive);
+
+#if CONFIG_DEBUG_READAHEAD
+/**
+ * readahead_cache_hit - adaptive read-ahead feedback function
+ * @ra: file_ra_state which holds the readahead state
+ * @page: the page just accessed
+ *
+ * This is the optional feedback route of the adaptive read-ahead logic.
+ * It must be called on every access on the read-ahead pages.
+ */
+void readahead_cache_hit(struct file_ra_state *ra, struct page *page)
+{
+	if (!prefer_adaptive_readahead())
+		return;
+
+	if (PageActive(page) || PageReferenced(page))
+		return;
+
+	if (!PageUptodate(page))
+		ra_account(ra, RA_EVENT_IO_BLOCK, 1);
+
+	if (!ra_has_index(ra, page->index))
+		return;
+
+	if (page->index >= ra->ra_index)
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+	else
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+#endif /* CONFIG_DEBUG_READAHEAD */
+
 /*
  * Read-ahead events accounting.
  */

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 23/28] readahead: laptop mode
       [not found] ` <20061115075031.909090639@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3190 bytes --]

When the laptop drive is spinned down, defer look-ahead to spin up time.

The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.

For crazy laptop users who prefer aggressive read-ahead, here is the way:

# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda      # this is the max possible value

Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
  which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/include/linux/writeback.h
+++ linux-2.6.19-rc5-mm2/include/linux/writeback.h
@@ -86,6 +86,12 @@ void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
 
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+	return !timer_pending(&laptop_mode_wb_timer);
+}
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
--- linux-2.6.19-rc5-mm2.orig/mm/page-writeback.c
+++ linux-2.6.19-rc5-mm2/mm/page-writeback.c
@@ -375,7 +375,7 @@ static void wb_timer_fn(unsigned long un
 static void laptop_timer_fn(unsigned long unused);
 
 static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
  * Periodic writeback of "old" data.
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/buffer_head.h>
+#include <linux/writeback.h>
 
 #include <asm/div64.h>
 
@@ -822,6 +823,32 @@ out:
 }
 
 /*
+ * Set a new look-ahead mark at @next.
+ * Return 0 if the new mark is successfully set.
+ */
+static int renew_lookahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				pgoff_t offset, pgoff_t next)
+{
+	struct page *page;
+
+	if (offset == ra->lookahead_index &&
+	      next >= ra->readahead_index)
+		return 1;
+
+	if (!(page = find_get_page(mapping, next)))
+		return 1;
+
+	SetPageReadahead(page);
+	page_cache_release(page);
+
+	if (ra->lookahead_index == offset)
+	    ra->lookahead_index = next;
+
+	return 0;
+}
+
+/*
  * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
  * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
  * _minimal_ read-ahead thrashing threshold on the device.
@@ -1629,6 +1656,15 @@ page_cache_readahead_adaptive(struct add
 			ra_account(ra, RA_EVENT_IO_CONGESTION, req_size);
 			return 0;
 		}
+
+		/*
+		 * Defer read-ahead to save energy.
+		 */
+		if (unlikely(laptop_mode && laptop_spinned_down())) {
+			if (!renew_lookahead(mapping, ra, offset,
+						offset + LAPTOP_POLL_INTERVAL))
+				return 0;
+		}
 	}
 
 	if (page)

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 24/28] readahead: loop case
       [not found] ` <20061115075032.213167260@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-loop-case.patch --]
[-- Type: text/plain, Size: 1672 bytes --]

Disable look-ahead for loop file.

Loopback files normally contain filesystems, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/drivers/block/loop.c
+++ linux-2.6.19-rc5-mm2/drivers/block/loop.c
@@ -755,6 +755,12 @@ static int loop_set_fd(struct loop_devic
 	mapping = file->f_mapping;
 	inode = mapping->host;
 
+	/* Instruct the readahead code to skip look-ahead on loop file.
+	 * The upper layer should already do proper look-ahead,
+	 * one more look-ahead here only ruins the cache hit rate.
+	 */
+	file->f_ra.flags |= RA_FLAG_LOOP;
+
 	if (!(file->f_mode & FMODE_WRITE))
 		lo_flags |= LO_FLAGS_READ_ONLY;
 
--- linux-2.6.19-rc5-mm2.orig/include/linux/fs.h
+++ linux-2.6.19-rc5-mm2/include/linux/fs.h
@@ -741,6 +741,7 @@ struct file_ra_state {
 #define RA_FLAG_MISS	(1UL<<31) /* a cache miss occured against this file */
 #define RA_FLAG_INCACHE	(1UL<<30) /* file is already in cache */
 #define RA_FLAG_MMAP	(1UL<<29) /* mmap page access */
+#define RA_FLAG_LOOP	(1UL<<28) /* loopback file */
 
 struct file {
 	/*
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -985,6 +985,10 @@ static int ra_submit(struct file_ra_stat
 		    ra->lookahead_index = eof;
 	}
 
+	/* Disable look-ahead for loopback file. */
+	if (ra->flags & RA_FLAG_LOOP)
+		ra->lookahead_index = ra->readahead_index;
+
 	/* Take down the current read-ahead aging value. */
 	ra->age = nr_scanned_pages_node(numa_node_id());
 

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 25/28] readahead: nfsd case
       [not found] ` <20061115075032.515501374@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-nfsd-case.patch --]
[-- Type: text/plain, Size: 4875 bytes --]

Bypass nfsd raparms cache -- the new logic do not rely on it.

Also try to work in peace with the chaotic nfsd requests.

For nfsd reads, the new read-ahead logic can handle
+ disordered nfsd requests
+ concurrent sequential requests on large files

Notes about the chaotic nfsd requests issue:

nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:

1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache

A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlapped read-ahead requests, which lead to many conflicts at step 4.

To work with the tricky situation, readahead decision of nfsd requests is
delayed a bit.

Benchmark results with local mounted nfs(tcp,rsize=32768):

SMALL FILES
readahead_ratio = 8, ra_max = 1024kb
92.99s real  10.32s system  3.23s user  145004+1826 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
90.96s real  10.68s system  3.22s user  144414+2520 cs  diff -r $NFSDIR $NFSDIR2

BIG FILES
readahead_ratio = 8, ra_max = 1024kb (old logic)
48.36s real  2.22s system  1.51s user  7209+4110 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb (new logic)
30.04s real  2.46s system  1.33s user  5420+2492 cs  diff $NFSFILE $NFSFILE2

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm2.orig/fs/nfsd/vfs.c
+++ linux-2.6.19-rc5-mm2/fs/nfsd/vfs.c
@@ -853,7 +853,11 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
 #endif
 
 	/* Get readahead parameters */
-	ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+	if (prefer_adaptive_readahead()) {
+		ra = NULL;
+		file->f_ra.flags |= RA_FLAG_NFSD;
+	} else
+		ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
 
 	if (ra && ra->p_set)
 		file->f_ra = ra->p_ra;
--- linux-2.6.19-rc5-mm2.orig/include/linux/fs.h
+++ linux-2.6.19-rc5-mm2/include/linux/fs.h
@@ -742,6 +742,7 @@ struct file_ra_state {
 #define RA_FLAG_INCACHE	(1UL<<30) /* file is already in cache */
 #define RA_FLAG_MMAP	(1UL<<29) /* mmap page access */
 #define RA_FLAG_LOOP	(1UL<<28) /* loopback file */
+#define RA_FLAG_NFSD	(1UL<<27) /* nfsd read */
 
 struct file {
 	/*
--- linux-2.6.19-rc5-mm2.orig/mm/readahead.c
+++ linux-2.6.19-rc5-mm2/mm/readahead.c
@@ -1298,7 +1298,6 @@ static unsigned long count_history_pages
 	pgoff_t head;
 	unsigned long count;
 	unsigned long lookback;
-	unsigned long hit_rate;
 
 	/*
 	 * Scan backward and check the near @ra_max pages.
@@ -1312,11 +1311,13 @@ static unsigned long count_history_pages
 	count = offset - head;
 
 	/*
-	 * Ensure readahead hit rate
+	 * Ensure readahead hit rate, when it's not a chaotic nfsd read.
 	 */
-	hit_rate = max(readahead_hit_rate, 1);
-	if (count_cache_hit(mapping, head, offset) * hit_rate < count)
-		count = 0;
+	if (!(ra->flags & RA_FLAG_NFSD)) {
+		unsigned long hit_rate = max(readahead_hit_rate, 1);
+		if (count_cache_hit(mapping, head, offset) * hit_rate < count)
+			count = 0;
+	}
 
 	/*
 	 * Unnecessary to count more?
@@ -1669,6 +1670,9 @@ page_cache_readahead_adaptive(struct add
 						offset + LAPTOP_POLL_INTERVAL))
 				return 0;
 		}
+	} else if (ra->flags & RA_FLAG_NFSD) { /* nfsd read */
+		ra_size = max_sane_readahead(req_size);
+		goto readit;
 	}
 
 	if (page)
@@ -1740,6 +1744,21 @@ readit:
 	dprintk("random_read(ino=%lu, req=%lu+%lu) = %lu\n",
 		mapping->host->i_ino, offset, req_size, ra_size);
 
+	/*
+	 * nfsd read-ahead, starting stage.
+	 */
+	if (ra->flags & RA_FLAG_NFSD) {
+		pgoff_t ra_index = offset + ra_size;
+		if (probe_page(mapping, offset - 1) &&
+		   !probe_page(mapping, ra_index)) {
+			ra->prev_page = ra_index - 1;
+			ret = try_context_based_readahead(mapping, ra, NULL,
+						 ra_index, ra_min, ra_max);
+			if (ret > 0)
+				ra_size += ra_submit(ra, mapping, filp);
+		}
+	}
+
 	return ra_size;
 }
 EXPORT_SYMBOL_GPL(page_cache_readahead_adaptive);
--- linux-2.6.19-rc5-mm2.orig/fs/nfs/client.c
+++ linux-2.6.19-rc5-mm2/fs/nfs/client.c
@@ -661,6 +661,9 @@ static void nfs_server_set_fsinfo(struct
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	server->backing_dev_info.ra_pages0 = min(server->rpages, VM_MIN_READAHEAD
+							>> (PAGE_CACHE_SHIFT - 10));
+	server->backing_dev_info.ra_thrash_bytes = server->rsize * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
 		server->wsize = max_rpc_payload;

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 26/28] readahead: turn on by default
       [not found] ` <20061115075032.945192537@localhost.localdomain>
@ 2006-11-15  7:50   ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-15  7:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-turn-on-by-default.patch --]
[-- Type: text/plain, Size: 543 bytes --]

Enable the adaptive readahead logic by default.

It helps collect more early testers, and is meant to be a -mm only patch.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
--- linux-2.6.19-rc5-mm1.orig/mm/Kconfig
+++ linux-2.6.19-rc5-mm1/mm/Kconfig
@@ -168,7 +168,7 @@ config ZONE_DMA_FLAG
 #
 config ADAPTIVE_READAHEAD
 	bool "Adaptive file readahead (EXPERIMENTAL)"
-	default n
+	default y
 	depends on EXPERIMENTAL
 	help
 	  Readahead is a technique employed by the kernel in an attempt

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 12/28] readahead: state based method - aging accounting
  2006-11-15  7:50   ` [PATCH 12/28] readahead: state based method - aging accounting Wu Fengguang
@ 2006-11-15 16:54     ` Christoph Lameter
       [not found]       ` <20061116133919.GA6645@mail.ustc.edu.cn>
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2006-11-15 16:54 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

On Wed, 15 Nov 2006, Wu Fengguang wrote:

> Collect info about the global available memory and its consumption speed.
> The data are used by the stateful method to estimate the thrashing threshold.

Looks like you should use a ZVC counter for total scanned. See 
include/linux/mmzone.h.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 12/28] readahead: state based method - aging accounting
       [not found]       ` <20061116133919.GA6645@mail.ustc.edu.cn>
@ 2006-11-16 13:39         ` Wu Fengguang
  0 siblings, 0 replies; 24+ messages in thread
From: Wu Fengguang @ 2006-11-16 13:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel

On Wed, Nov 15, 2006 at 08:54:44AM -0800, Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Wu Fengguang wrote:
> 
> > Collect info about the global available memory and its consumption speed.
> > The data are used by the stateful method to estimate the thrashing threshold.
> 
> Looks like you should use a ZVC counter for total scanned. See 
> include/linux/mmzone.h.

OK.

By using zone.total_scanned, I have chose an easy way :)

To do the general vm timing in something like zone.vm_stat[NR_SCAN_INACTIVE],
a set of new functions will be required:

        global_page_state_raw()
        zone_page_state_raw()
        node_page_state_raw()

They do not check overflows, so that we can do

        time_elapsed = new_raw_value - old_raw_value;

However, before introducing the ugly *_raw() functions, I'd like to know if

        #ifdef CONFIG_SMP
                if (x < 0)
                        x = 0;
        #endif  

really helps some big NUMA system. I suspect object counters like
NR_FILE_PAGES will _never_ overflow, and an accumulated counter like
NR_VMSCAN_WRITE is expected to overflow. In either case, it is ok to
return an unsigned long raw counter.

Regards,
Wu

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-11-16 13:39 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20061115075007.832957580@localhost.localdomain>
2006-11-15  7:50 ` [PATCH 00/28] Adaptive readahead V16 Wu Fengguang
     [not found] ` <20061115075024.180138257@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 01/28] readahead: kconfig options Wu Fengguang
     [not found] ` <20061115075024.503627543@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 02/28] radixtree: introduce scan hole/data functions Wu Fengguang
     [not found] ` <20061115075024.850542829@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 03/28] mm: introduce probe_page() Wu Fengguang
     [not found] ` <20061115075025.438524224@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 04/28] mm: introduce PG_readahead Wu Fengguang
     [not found] ` <20061115075026.121499794@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 06/28] readahead: insert cond_resched() calls Wu Fengguang
     [not found] ` <20061115075027.139255636@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 09/28] readahead: rescue_pages() Wu Fengguang
     [not found] ` <20061115075027.832896629@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 11/28] readahead: min/max sizes Wu Fengguang
     [not found] ` <20061115075028.178039166@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 12/28] readahead: state based method - aging accounting Wu Fengguang
2006-11-15 16:54     ` Christoph Lameter
     [not found]       ` <20061116133919.GA6645@mail.ustc.edu.cn>
2006-11-16 13:39         ` Wu Fengguang
     [not found] ` <20061115075028.494374406@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 13/28] readahead: state based method - routines Wu Fengguang
     [not found] ` <20061115075028.829507795@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 14/28] readahead: state based method Wu Fengguang
     [not found] ` <20061115075029.205178794@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 15/28] readahead: context " Wu Fengguang
     [not found] ` <20061115075029.519507130@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 16/28] readahead: initial method - guiding sizes Wu Fengguang
     [not found] ` <20061115075029.869472273@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 17/28] readahead: initial method - thrashing guard size Wu Fengguang
     [not found] ` <20061115075030.229339867@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 18/28] readahead: initial method - user recommended size Wu Fengguang
     [not found] ` <20061115075030.942942737@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 20/28] readahead: backward prefetching method Wu Fengguang
     [not found] ` <20061115075031.286178806@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 21/28] readahead: thrashing recovery method Wu Fengguang
     [not found] ` <20061115075031.524129110@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 22/28] readahead: call scheme Wu Fengguang
     [not found] ` <20061115075031.909090639@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 23/28] readahead: laptop mode Wu Fengguang
     [not found] ` <20061115075032.213167260@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 24/28] readahead: loop case Wu Fengguang
     [not found] ` <20061115075032.515501374@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 25/28] readahead: nfsd case Wu Fengguang
     [not found] ` <20061115075032.945192537@localhost.localdomain>
2006-11-15  7:50   ` [PATCH 26/28] readahead: turn on by default Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox