public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/32] Adaptive readahead V14
       [not found] <20060527154849.927021763@localhost.localdomain>
@ 2006-05-27 15:48 ` Wu Fengguang
  2006-05-27 17:29   ` Michael Tokarev
       [not found] ` <20060527155125.911021581@localhost.localdomain>
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

Andrew,

This is the 14th release of the adaptive readahead patchset.

Thanks to Andrew Morton, Nick Piggin and Peter Zijlstra,
the patchset is reviewed and greatly improved.

It has received tests in a wide range of applications in the past
six months, and polished up considerably.


Performance benefits
====================

Besides file servers and desktops, it is recently found to benefit
postgresql databases a lot.

I explained to pgsql users how the patch may help their db performance:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
[QUOTE]
	HOW IT WORKS

	In adaptive readahead, the context based method may be of particular
	interest to postgresql users. It works by peeking into the file cache
	and check if there are any history pages present or accessed. In this
	way it can detect almost all forms of sequential / semi-sequential read
	patterns, e.g.
		- parallel / interleaved sequential scans on one file
		- sequential reads across file open/close
		- mixed sequential / random accesses
		- sparse / skimming sequential read

	It also have methods to detect some less common cases:
		- reading backward
		- seeking all over reading N pages

	WAYS TO BENEFIT FROM IT

	As we know, postgresql relies on the kernel to do proper readahead.
	The adaptive readahead might help performance in the following cases:
		- concurrent sequential scans
		- sequential scan on a fragmented table
		  (some DBs suffer from this problem, not sure for pgsql)
		- index scan with clustered matches
		- index scan on majority rows (in case the planner goes wrong)

And received positive responses:
[QUOTE from Michael Stone]
	I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
	with the patch the job took 1.7M ms. Another VACUUM that normally takes
	between 300k-500k ms took 150k. Definately a promising addition.

[QUOTE from Michael Stone]
	>I'm thinking about it, we're already using a fixed read-ahead of 16MB
	>using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to
	>not have to set this so we may try it.

	FWIW, I never saw much performance difference from doing that. Wu's
	patch, OTOH, gave a big boost.

[QUOTE: odbc-bench with Postgresql 7.4.11 on dual Opteron]
	Base kernel:
	 Transactions per second:                92.384758
	 Transactions per second:                99.800896

	After read-ahvm.readahead_ratio = 100:
	 Transactions per second:                105.461952
	 Transactions per second:                105.458664

	vm.readahead_ratio = 100 ; vm.readahead_hit_rate = 1:
	 Transactions per second:                113.055367
	 Transactions per second:                124.815910


Patches
=======

All 32 patches are bisect friendly.

The following 28 patches are only logically seperated -
one should not remove one of them and expect others to compile cleanly:

[patch 01/32] readahead: kconfig options
[patch 02/32] radixtree: introduce radix_tree_scan_hole[_backward]()
[patch 03/32] mm: introduce probe_page()
[patch 04/32] mm: introduce PG_readahead
[patch 05/32] readahead: add look-ahead support to __do_page_cache_readahead() 
[patch 06/32] readahead: delay page release in do_generic_mapping_read()
[patch 07/32] readahead: insert cond_resched() calls
[patch 08/32] readahead: {MIN,MAX}_RA_PAGES
[patch 09/32] readahead: events accounting
[patch 10/32] readahead: rescue_pages()
[patch 11/32] readahead: sysctl parameters
[patch 12/32] readahead: min/max sizes
[patch 13/32] readahead: state based method - aging accounting
[patch 14/32] readahead: state based method - routines
[patch 15/32] readahead: state based method
[patch 16/32] readahead: context based method
[patch 17/32] readahead: initial method - guiding sizes
[patch 18/32] readahead: initial method - thrashing guard size
[patch 19/32] readahead: initial method - expected read size
[patch 20/32] readahead: initial method - user recommended size
[patch 21/32] readahead: initial method
[patch 22/32] readahead: backward prefetching method
[patch 23/32] readahead: seeking reads method
[patch 24/32] readahead: thrashing recovery method
[patch 25/32] readahead: call scheme
[patch 26/32] readahead: laptop mode
[patch 27/32] readahead: loop case
[patch 28/32] readahead: nfsd case

The following 4 patches are for debugging purpose, and for -mm only:

[patch 29/32] readahead: turn on by default
[patch 30/32] readahead: debug radix tree new functions
[patch 31/32] readahead: debug traces showing accessed file names
[patch 32/32] readahead: debug traces showing read patterns

Diffstat
========

 Documentation/sysctl/vm.txt |   37 +
 block/ll_rw_blk.c           |   34 
 drivers/block/loop.c        |    6 
 fs/file_table.c             |    7 
 fs/mpage.c                  |    4 
 fs/nfsd/vfs.c               |    5 
 include/linux/backing-dev.h |    3 
 include/linux/fs.h          |   74 +-
 include/linux/mm.h          |   31 
 include/linux/mmzone.h      |    6 
 include/linux/page-flags.h  |    6 
 include/linux/pagemap.h     |    2 
 include/linux/radix-tree.h  |    4 
 include/linux/sysctl.h      |    2 
 include/linux/writeback.h   |    6 
 kernel/sysctl.c             |   28 
 lib/radix-tree.c            |   71 +
 mm/Kconfig                  |   62 +
 mm/filemap.c                |  112 ++-
 mm/page-writeback.c         |    2 
 mm/page_alloc.c             |   18 
 mm/readahead.c              | 1599 +++++++++++++++++++++++++++++++++++++++++++-
 mm/swap.c                   |    2 
 mm/vmscan.c                 |    4 
 24 files changed, 2088 insertions(+), 37 deletions(-)

Changelog
=========

V14  2006-05-27
- remove __radix_tree_lookup_parent()
- implement radix_tree_scan_hole*() as dumb and safe ones
- break file_ra_state.cache_hits into u16s
- rationalize ra_dispatch() and move look-ahead/age stuffs here
- move node_free_and_cold_pages() to page_alloc.c/nr_free_inactive_pages_node()
- fix a bug in query_page_cache_segment()
- adjust RA_FLAG_XXX to avoid confliction with ra_class_{new,old}
- random comments

V13  2006-05-26
- remove radix tree look-aside cache
- fix radix tree NULL dereference bug
- fix radix tree bugs on direct embedded data
- add comment on cold_page_refcnt()
- rename find_page() to probe_page()
- replace the non-atomic __SetPageReadahead()
- fix the risky rescue_pages()
- some cleanups recommended by Nick Piggin

V12  2006-05-24
- improve small files case
- allow pausing of events accounting
- disable sparse read-ahead by default
- a bug fix in radix_tree_cache_lookup_parent()
- more cleanups

V11  2006-03-19
- patchset rework
- add kconfig option to make the feature compile-time selectable
- improve radix tree scan functions
- fix bug of using smp_processor_id() in preemptible code
- avoid overflow in compute_thrashing_threshold()
- disable sparse read prefetching if (readahead_hit_rate == 1)
- make thrashing recovery a standalone function
- random cleanups

V10  2005-12-16
- remove delayed page activation
- remove live page protection
- revert mmap readaround to old behavior
- default to original readahead logic
- default to original readahead size
- merge comment fixes from Andreas Mohr
- merge radixtree cleanups from Christoph Lameter
- reduce sizeof(struct file_ra_state) by unnamed union
- stateful method cleanups
- account other read-ahead paths

V9  2005-12-3
- standalone mmap read-around code, a little more smart and tunable
- make stateful method sensible of request size
- decouple readahead_ratio from live pages protection
- let readahead_ratio contribute to ra_size grow speed in stateful method
- account variance of ra_size

V8  2005-11-25

- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values

V7  2005-11-09

- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size

V6  2005-11-01

- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages

V5  2005-10-28

- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities

V4  2005-10-15

- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache

V3  2005-10-06

- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones

V2  2005-09-15

- delayed page activation
- look-ahead: towards pipelined read-ahead

V1  2005-09-13

Initial release which features:
        o stateless (for now)
        o adapts to available memory / read speed
        o free of thrashing (in theory)

And handles:
        o large number of slow streams (FTP server)
	o open/read/close access patterns (NFS server)
        o multiple interleaved, sequential streams in one file
	  (multithread / multimedia / database)

Cheers,
Wu Fengguang
--
Dept. Automation                University of Science and Technology of China

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/32] readahead: kconfig options
       [not found] ` <20060527155125.911021581@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-kconfig-options.patch --]
[-- Type: text/plain, Size: 5022 bytes --]

This patchset introduces a set of adaptive read-ahead methods.
They enable the kernel to better support many important I/O applications.

MAIN FEATURES
=============

- Adaptive read-ahead buffer management
	- aggressive, thrashing safe read-ahead size
		- optimal memory utilisation while achieving good I/O throughput
		- unnecessary to hand tuning VM_MAX_READAHEAD
		- support slow/fast readers at the same time
		- support large number of concurrent readers
	- aggressive read-ahead on start-of-file
		- configurable recommended read-ahead size
		- safeguarded by dynamic estimated thrashing threshold
		- safeguarded by dynamic estimated expected read size
		- good for lots-of-small-files case
	- shrinkable look-ahead size
		- cut down up to 40% memory consumption on overloaded situation

- Detecting any form of (semi-)sequencial scan
        - parallel / interleaved sequential scans on one fd
        - sequential reads across file open/close lifetime
        - mixed sequential / random accesses
        - sparse / skimming sequential read

- Support more access patterns
        - backward prefetching
        - seeking around reading N pages

- Better special case handling
        - nfs daemon support: the raparams cache is no longer required
	- laptop mode support: defer look-ahead on drive spinned down
        - loopback file support: avoid double look-ahead


DESIGN STRATEGIES
=================

- Dual methods design
        - stateful method: the fast and default one
	- stateless method: the robust and failsafe one
	- if anything abnormal happens, the stateful method bails out, the
	  stateless method queries the page cache and possibly restart the
	  read-ahead process

- Robust feedback design
	- sense and handle important states so that the logic wont run away
	- detect danger of thrashing and prevent it in advance
        - extensive accounting and debugging traces


This patch:

Add kconfig options to enable/disable:
	- adaptive read-ahead logic
	- adaptive read-ahead debug traces and events accounting

The read-ahead introduction text is cited from the well written LWN article
"Adaptive file readahead" <http://lwn.net/Articles/155510/> :)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/Kconfig     |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c |    2 ++
 2 files changed, 59 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/Kconfig
+++ linux-2.6.17-rc4-mm3/mm/Kconfig
@@ -146,3 +146,60 @@ config MIGRATION
 	  while the virtual addresses are not changed. This is useful for
 	  example on NUMA systems to put pages nearer to the processors accessing
 	  the page.
+
+#
+# Adaptive file readahead
+#
+config ADAPTIVE_READAHEAD
+	bool "Adaptive file readahead (EXPERIMENTAL)"
+	default n
+	depends on EXPERIMENTAL
+	help
+	  Readahead is a technique employed by the kernel in an attempt
+	  to improve file reading performance. If the kernel has reason
+	  to believe that a particular file is being read sequentially,
+	  it will attempt to read blocks from the file into memory before
+	  the application requests them. When readahead works, it speeds
+	  up the system's throughput, since the reading application does
+	  not have to wait for its requests. When readahead fails, instead,
+	  it generates useless I/O and occupies memory pages which are
+	  needed for some other purpose.
+
+	  The kernel already has a stock readahead logic that is well
+	  understood and well tuned. This option enables a more complex and
+	  feature rich one. It tries to be smart and memory efficient.
+	  However, due to the great diversity of real world applications, it
+	  might not fit everyone.
+
+	  Please refer to Documentation/sysctl/vm.txt for tunable parameters.
+
+	  It is known to work well for many desktops, file servers and
+	  postgresql databases. Say Y to try it out for yourself.
+
+config DEBUG_READAHEAD
+	bool "Readahead debug and accounting"
+	default y
+	depends on ADAPTIVE_READAHEAD
+	select DEBUG_FS
+	help
+	  This option injects extra code to dump detailed debug traces and do
+	  readahead events accounting.
+
+	  To actually get the data:
+
+	  mkdir /debug
+	  mount -t debug none /debug
+
+	  After that you can do the following:
+
+	  echo > /debug/readahead/events # reset the counters
+	  cat /debug/readahead/events    # check the counters
+
+	  echo 1 > /debug/readahead/debug_level # start events accounting
+	  echo 0 > /debug/readahead/debug_level # pause events accounting
+
+	  echo 2 > /debug/readahead/debug_level # show printk traces
+	  echo 3 > /debug/readahead/debug_level # show printk traces(verbose)
+	  echo 1 > /debug/readahead/debug_level # stop filling my kern.log
+
+	  Say N for production servers.
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -5,6 +5,8 @@
  *
  * 09Apr2002	akpm@zip.com.au
  *		Initial version.
+ * 26May2006	Wu Fengguang <wfg@mail.ustc.edu.cn>
+ *		Adaptive read-ahead framework.
  */
 
 #include <linux/kernel.h>

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 04/32] mm: introduce PG_readahead
       [not found] ` <20060527155127.522802387@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: mm-PG_readahead.patch --]
[-- Type: text/plain, Size: 1726 bytes --]

Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader:
	Hey, it's time to invoke the adaptive read-ahead logic!
	For the sake of I/O pipelining, don't wait until it runs out of
	cached pages.  ;-)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/page-flags.h |    6 ++++++
 mm/page_alloc.c            |    2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
+++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
@@ -90,6 +90,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_readahead		20	/* Reminder to do readahead */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -372,6 +374,10 @@ extern void __mod_page_state_offset(unsi
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
--- linux-2.6.17-rc4-mm3.orig/mm/page_alloc.c
+++ linux-2.6.17-rc4-mm3/mm/page_alloc.c
@@ -564,7 +564,7 @@ static int prep_new_page(struct page *pa
 	if (PageReserved(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 06/32] readahead: delay page release in do_generic_mapping_read()
       [not found] ` <20060527155128.472551240@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-delay-page-release-in-do_generic_mapping_read.patch --]
[-- Type: text/plain, Size: 2540 bytes --]

In do_generic_mapping_read(), release accessed pages some time later,
so that it can be passed to and used by the adaptive read-ahead code.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/filemap.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -835,10 +835,12 @@ void do_generic_mapping_read(struct addr
 	unsigned long prev_index;
 	loff_t isize;
 	struct page *cached_page;
+	struct page *prev_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
+	prev_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -877,6 +879,11 @@ find_page:
 			handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
+
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -911,7 +918,6 @@ page_ok:
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
 		if (ret == nr && desc->count)
 			continue;
 		goto out;
@@ -923,7 +929,6 @@ page_not_up_to_date:
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);
-			page_cache_release(page);
 			continue;
 		}
 
@@ -953,7 +958,6 @@ readpage:
 					 * invalidate_inode_pages got it
 					 */
 					unlock_page(page);
-					page_cache_release(page);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -974,7 +978,6 @@ readpage:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			page_cache_release(page);
 			goto out;
 		}
 
@@ -983,7 +986,6 @@ readpage:
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
 			if (nr <= offset) {
-				page_cache_release(page);
 				goto out;
 			}
 		}
@@ -993,7 +995,6 @@ readpage:
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
-		page_cache_release(page);
 		goto out;
 
 no_cached_page:
@@ -1018,6 +1019,9 @@ no_cached_page:
 		}
 		page = cached_page;
 		cached_page = NULL;
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
 		goto readpage;
 	}
 
@@ -1027,6 +1031,8 @@ out:
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);
+	if (prev_page)
+		page_cache_release(prev_page);
 	if (filp)
 		file_accessed(filp);
 }

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 07/32] readahead: insert cond_resched() calls
       [not found] ` <20060527155129.001886224@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Con Kolivas

[-- Attachment #1: readahead-insert-cond_resched-calls.patch --]
[-- Type: text/plain, Size: 2140 bytes --]

Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm more
complex, it becomes necessary to insert some cond_resched() calls in
the read-ahead path.

If desktop users still feel audio jitters with the new read-ahead code,
please try one of the following ways to get rid of it:

1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
2) reduce the read-ahead request size by running
	blockdev --setra 256 /dev/hda # or whatever device you are using

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


This patch is recommended by Con Kolivas to improve respond time for desktop.
Thanks!

 fs/mpage.c     |    4 +++-
 mm/readahead.c |    9 +++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -148,8 +148,10 @@ int read_cache_pages(struct address_spac
 			continue;
 		}
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
+		if (!pagevec_add(&lru_pvec, page)) {
+			cond_resched();
 			__pagevec_lru_add(&lru_pvec);
+		}
 		if (ret) {
 			while (!list_empty(pages)) {
 				struct page *victim;
@@ -186,8 +188,10 @@ static int read_pages(struct address_spa
 		if (!add_to_page_cache(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else
 			page_cache_release(page);
 	}
@@ -299,6 +303,7 @@ __do_page_cache_readahead(struct address
 			continue;
 
 		read_unlock_irq(&mapping->tree_lock);
+		cond_resched();
 		page = page_cache_alloc_cold(mapping);
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
--- linux-2.6.17-rc4-mm3.orig/fs/mpage.c
+++ linux-2.6.17-rc4-mm3/fs/mpage.c
@@ -407,8 +407,10 @@ mpage_readpages(struct address_space *ma
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else {
 			page_cache_release(page);
 		}

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 08/32] readahead: {MIN,MAX}_RA_PAGES
       [not found] ` <20060527155129.653903854@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-macros-min-max-rapages.patch --]
[-- Type: text/plain, Size: 1569 bytes --]

Define two convenient macros for read-ahead:
	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
page sizes like 64k.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -17,13 +17,21 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 
+/*
+ * Convienent macros for min/max read-ahead pages.
+ * Note that MAX_RA_PAGES is rounded down, while MIN_RA_PAGES is rounded up.
+ * The latter is necessary for systems with large page size(i.e. 64k).
+ */
+#define MAX_RA_PAGES	(VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
+#define MIN_RA_PAGES	DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+	.ra_pages	= MAX_RA_PAGES,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
@@ -52,7 +60,7 @@ static inline unsigned long get_max_read
 
 static inline unsigned long get_min_readahead(struct file_ra_state *ra)
 {
-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	return MIN_RA_PAGES;
 }
 
 static inline void reset_ahead_window(struct file_ra_state *ra)

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 09/32] readahead: events accounting
       [not found] ` <20060527155130.013773601@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, J?rn Engel, Ingo Oeser

[-- Attachment #1: readahead-events-accounting.patch --]
[-- Type: text/plain, Size: 10414 bytes --]

A debugfs file named `readahead/events' is created according to advises from
J?rn Engel, Andrew Morton and Ingo Oeser.

It reveals various read-ahead activities/events, and is vital to the testing.

---------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.

- Preparations

## First compile kernel with CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 3 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 1 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

The commented out commands can uncover more detailed file accesses,
which are useful sometimes. Note that the log file can grow huge!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  292 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 291 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -25,6 +25,69 @@
 #define MAX_RA_PAGES	(VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
 #define MIN_RA_PAGES	DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
 
+/*
+ * Detailed classification of read-ahead behaviors.
+ */
+#define RA_CLASS_SHIFT 4
+#define RA_CLASS_MASK  ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+	RA_CLASS_ALL,
+	RA_CLASS_INITIAL,
+	RA_CLASS_STATE,
+	RA_CLASS_CONTEXT,
+	RA_CLASS_CONTEXT_AGGRESSIVE,
+	RA_CLASS_BACKWARD,
+	RA_CLASS_THRASHING,
+	RA_CLASS_SEEK,
+	RA_CLASS_NONE,
+	RA_CLASS_COUNT
+};
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+	RA_EVENT_CACHE_MISS,		/* read cache misses */
+	RA_EVENT_RANDOM_READ,		/* random reads */
+	RA_EVENT_IO_CONGESTION,		/* i/o congestion */
+	RA_EVENT_IO_CACHE_HIT,		/* canceled i/o due to cache hit */
+	RA_EVENT_IO_BLOCK,		/* wait for i/o completion */
+
+	RA_EVENT_READAHEAD,		/* read-ahead issued */
+	RA_EVENT_READAHEAD_HIT,		/* read-ahead page hit */
+	RA_EVENT_LOOKAHEAD,		/* look-ahead issued */
+	RA_EVENT_LOOKAHEAD_HIT,		/* look-ahead mark hit */
+	RA_EVENT_LOOKAHEAD_NOACTION,	/* look-ahead mark ignored */
+	RA_EVENT_READAHEAD_MMAP,	/* read-ahead for mmap access */
+	RA_EVENT_READAHEAD_EOF,		/* read-ahead reaches EOF */
+	RA_EVENT_READAHEAD_SHRINK,	/* ra_size falls under previous la_size */
+	RA_EVENT_READAHEAD_THRASHING,	/* read-ahead thrashing happened */
+	RA_EVENT_READAHEAD_MUTILATE,	/* read-ahead mutilated by imbalanced aging */
+	RA_EVENT_READAHEAD_RESCUE,	/* read-ahead rescued */
+
+	RA_EVENT_READAHEAD_CUBE,
+	RA_EVENT_COUNT
+};
+
+#ifdef CONFIG_DEBUG_READAHEAD
+u32 initial_ra_hit;
+u32 initial_ra_miss;
+u32 debug_level = 1;
+u32 disable_stateful_method = 0;
+static const char * const ra_class_name[];
+static void ra_account(struct file_ra_state *ra, enum ra_event e, int pages);
+#  define debug_inc(var)		do { var++; } while (0)
+#  define debug_option(o)		(o)
+#else
+#  define ra_account(ra, e, pages)	do { } while (0)
+#  define debug_inc(var)		do { } while (0)
+#  define debug_option(o)		(0)
+#  define debug_level 			(0)
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+#define dprintk(args...) \
+	do { if (debug_level >= 2) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+	do { if (debug_level >= 3) printk(KERN_DEBUG args); } while(0)
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
@@ -365,6 +428,9 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
 	return ret;
 }
 
@@ -400,10 +466,16 @@ static inline int check_ra_success(struc
 int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read)
 {
+	unsigned long ret;
+
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+	ret = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+	ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
+	return ret;
 }
 
 /*
@@ -425,6 +497,10 @@ blockable_page_cache_readahead(struct ad
 
 	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 
+	ra_account(NULL, RA_EVENT_READAHEAD, actual);
+	dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+			mapping->host->i_ino, offset, nr_to_read, actual);
+
 	return check_ra_success(ra, nr_to_read, actual);
 }
 
@@ -604,3 +680,217 @@ unsigned long max_sane_readahead(unsigne
 	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef CONFIG_DEBUG_READAHEAD
+
+#include <linux/init.h>
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+static const char * const ra_class_name[] = {
+	"total",
+	"initial",
+	"state",
+	"context",
+	"contexta",
+	"backward",
+	"onthrash",
+	"onseek",
+	"none"
+};
+
+static const char * const ra_event_name[] = {
+	"cache_miss",
+	"random_read",
+	"io_congestion",
+	"io_cache_hit",
+	"io_block",
+	"readahead",
+	"readahead_hit",
+	"lookahead",
+	"lookahead_hit",
+	"lookahead_ignore",
+	"readahead_mmap",
+	"readahead_eof",
+	"readahead_shrink",
+	"readahead_thrash",
+	"readahead_mutilt",
+	"readahead_rescue"
+};
+
+static unsigned long ra_events[RA_CLASS_COUNT][RA_EVENT_COUNT][2];
+
+static void ra_account(struct file_ra_state *ra, enum ra_event e, int pages)
+{
+	enum ra_class c;
+
+	if (!debug_level)
+		return;
+
+	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		pages = -pages;
+	} else if (ra)
+		c = ra->flags & RA_CLASS_MASK;
+	else
+		c = RA_CLASS_NONE;
+
+	if (!c)
+		c = RA_CLASS_NONE;
+
+	ra_events[c][e][0] += 1;
+	ra_events[c][e][1] += pages;
+
+	if (e == RA_EVENT_READAHEAD)
+		ra_events[c][RA_EVENT_READAHEAD_CUBE][1] += pages * pages;
+}
+
+static int ra_events_show(struct seq_file *s, void *_)
+{
+	int i;
+	int c;
+	int e;
+	static const char event_fmt[] = "%-16s";
+	static const char class_fmt[] = "%10s";
+	static const char item_fmt[] = "%10lu";
+	static const char percent_format[] = "%9lu%%";
+	static const char * const table_name[] = {
+		"[table requests]",
+		"[table pages]",
+		"[table summary]"};
+
+	for (i = 0; i <= 1; i++) {
+		for (e = 0; e < RA_EVENT_COUNT; e++) {
+			ra_events[RA_CLASS_ALL][e][i] = 0;
+			for (c = RA_CLASS_INITIAL; c < RA_CLASS_NONE; c++)
+				ra_events[RA_CLASS_ALL][e][i] += ra_events[c][e][i];
+		}
+
+		seq_printf(s, event_fmt, table_name[i]);
+		for (c = 0; c < RA_CLASS_COUNT; c++)
+			seq_printf(s, class_fmt, ra_class_name[c]);
+		seq_puts(s, "\n");
+
+		for (e = 0; e < RA_EVENT_COUNT; e++) {
+			if (e == RA_EVENT_READAHEAD_CUBE)
+				continue;
+			if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+				continue;
+			if (e == RA_EVENT_IO_BLOCK && i == 1)
+				continue;
+
+			seq_printf(s, event_fmt, ra_event_name[e]);
+			for (c = 0; c < RA_CLASS_COUNT; c++)
+				seq_printf(s, item_fmt, ra_events[c][e][i]);
+			seq_puts(s, "\n");
+		}
+		seq_puts(s, "\n");
+	}
+
+	seq_printf(s, event_fmt, table_name[2]);
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, class_fmt, ra_class_name[c]);
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "random_rate");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_RANDOM_READ][0] * 100) /
+			((ra_events[c][RA_EVENT_RANDOM_READ][0] +
+			  ra_events[c][RA_EVENT_READAHEAD][0]) | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "ra_hit_rate");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+			(ra_events[c][RA_EVENT_READAHEAD][1] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "la_hit_rate");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+			(ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "var_ra_size");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_READAHEAD_CUBE][1] -
+			 ra_events[c][RA_EVENT_READAHEAD][1] *
+			(ra_events[c][RA_EVENT_READAHEAD][1] /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1))) /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_ra_size");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_READAHEAD][1] +
+			 ra_events[c][RA_EVENT_READAHEAD][0] / 2) /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_la_size");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_LOOKAHEAD][1] +
+			 ra_events[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+			(ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int ra_events_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, ra_events_show, NULL);
+}
+
+static ssize_t ra_events_write(struct file *file, const char __user *buf,
+						size_t size, loff_t *offset)
+{
+	memset(ra_events, 0, sizeof(ra_events));
+	return 1;
+}
+
+struct file_operations ra_events_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ra_events_open,
+	.write		= ra_events_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+#define READAHEAD_DEBUGFS_ENTRY_U32(var) \
+	debugfs_create_u32(__stringify(var), 0644, root, &var)
+
+#define READAHEAD_DEBUGFS_ENTRY_BOOL(var) \
+	debugfs_create_bool(__stringify(var), 0644, root, &var)
+
+static int __init readahead_init(void)
+{
+	struct dentry *root;
+
+	root = debugfs_create_dir("readahead", NULL);
+
+	debugfs_create_file("events", 0644, root, NULL, &ra_events_fops);
+
+	READAHEAD_DEBUGFS_ENTRY_U32(initial_ra_hit);
+	READAHEAD_DEBUGFS_ENTRY_U32(initial_ra_miss);
+
+	READAHEAD_DEBUGFS_ENTRY_U32(debug_level);
+	READAHEAD_DEBUGFS_ENTRY_BOOL(disable_stateful_method);
+
+	return 0;
+}
+
+module_init(readahead_init)
+
+#endif /* CONFIG_DEBUG_READAHEAD */

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 10/32] readahead: rescue_pages()
       [not found] ` <20060527155130.538411854@localhost.localdomain>
@ 2006-05-27 15:48   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-rescue-pages.patch --]
[-- Type: text/plain, Size: 3165 bytes --]

Introduce function rescue_pages() to protect pages in danger of thrashing.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 87 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -682,6 +682,93 @@ unsigned long max_sane_readahead(unsigne
 }
 
 /*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ *   1. state based   - the default one
+ *   2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ *                 a read-ahead chunk
+ *    +-----------------------------------------+
+ *    |       # PG_readahead                    |
+ *    +-----------------------------------------+
+ *            ^ When this page is read, notify me for the next read-ahead.
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+	int pgrescue = 0;
+	pgoff_t index = page_index(page);
+	struct address_space *mapping = page_mapping(page);
+	struct page *grabbed_page = NULL;
+	struct zone *zone;
+
+	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+			mapping->host->i_ino, index, nr_pages);
+
+	for(;;) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (!PageLRU(page))
+			goto out_unlock;
+
+		while (page_mapping(page) == mapping &&
+				page_index(page) == index) {
+			struct page *the_page = page;
+			page = list_entry((page)->lru.prev, struct page, lru);
+			if (!PageActive(the_page) &&
+					!PageLocked(the_page) &&
+					page_count(the_page) == 1) {
+				list_move(&the_page->lru, &zone->inactive_list);
+				pgrescue++;
+			}
+			index++;
+			if (!--nr_pages)
+				goto out_unlock;
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+		cond_resched();
+
+		if (grabbed_page)
+			page_cache_release(grabbed_page);
+		grabbed_page = page = find_get_page(mapping, index);
+		if (!page)
+			goto out;
+	}
+
+out_unlock:
+	spin_unlock_irq(&zone->lru_lock);
+out:
+	if (grabbed_page)
+		page_cache_release(grabbed_page);
+	ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+	return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
  * Read-ahead events accounting.
  */
 #ifdef CONFIG_DEBUG_READAHEAD

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 11/32] readahead: sysctl parameters
       [not found] ` <20060527155131.200177171@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-parameter-sysctl-variables.patch --]
[-- Type: text/plain, Size: 5746 bytes --]

Add new sysctl entries in /proc/sys/vm:

- readahead_ratio = 50
	i.e. set read-ahead size to <=(readahead_ratio%) thrashing threshold
- readahead_hit_rate = 1
	i.e. read-ahead hit ratio >=(1/readahead_hit_rate) is deemed ok

readahead_ratio also provides a way to select read-ahead logic at runtime:

	condition			    action
==========================================================================
readahead_ratio == 0		disable read-ahead
readahead_ratio <= 9		select the (old) stock read-ahead logic
readahead_ratio >= 10		select the (new) adaptive read-ahead logic

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 Documentation/sysctl/vm.txt |   37 +++++++++++++++++++++++++++++++++++++
 include/linux/mm.h          |   11 +++++++++++
 include/linux/sysctl.h      |    2 ++
 kernel/sysctl.c             |   28 ++++++++++++++++++++++++++++
 mm/readahead.c              |   17 +++++++++++++++++
 5 files changed, 95 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1029,6 +1029,17 @@ void handle_ra_miss(struct address_space
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+extern int readahead_ratio;
+#else
+#define readahead_ratio 1
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+static inline int prefer_adaptive_readahead(void)
+{
+	return readahead_ratio >= 10;
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -26,6 +26,23 @@
 #define MIN_RA_PAGES	DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
 
 /*
+ * Adaptive read-ahead parameters.
+ */
+
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL_GPL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 1;
+
+/*
  * Detailed classification of read-ahead behaviors.
  */
 #define RA_CLASS_SHIFT 4
--- linux-2.6.17-rc4-mm3.orig/include/linux/sysctl.h
+++ linux-2.6.17-rc4-mm3/include/linux/sysctl.h
@@ -194,6 +194,8 @@ enum
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
 	VM_PANIC_ON_OOM=33,	/* panic at out-of-memory */
 	VM_SWAP_PREFETCH=34,	/* swap prefetch */
+	VM_READAHEAD_RATIO=35,	/* percent of read-ahead size to thrashing-threshold */
+	VM_READAHEAD_HIT_RATE=36, /* one accessed page legitimizes so many read-ahead pages */
 };
 
 /* CTL_NET names: */
--- linux-2.6.17-rc4-mm3.orig/kernel/sysctl.c
+++ linux-2.6.17-rc4-mm3/kernel/sysctl.c
@@ -77,6 +77,12 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int print_fatal_signals;
 
+#if defined(CONFIG_ADAPTIVE_READAHEAD)
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+static int one = 1;
+#endif
+
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
 int nmi_watchdog_enabled;
@@ -987,6 +993,28 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+	{
+		.ctl_name	= VM_READAHEAD_RATIO,
+		.procname	= "readahead_ratio",
+		.data		= &readahead_ratio,
+		.maxlen		= sizeof(readahead_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+	{
+		.ctl_name	= VM_READAHEAD_HIT_RATE,
+		.procname	= "readahead_hit_rate",
+		.data		= &readahead_hit_rate,
+		.maxlen		= sizeof(readahead_hit_rate),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
--- linux-2.6.17-rc4-mm3.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.17-rc4-mm3/Documentation/sysctl/vm.txt
@@ -31,6 +31,8 @@ Currently, these files are in /proc/sys/
 - zone_reclaim_interval
 - panic_on_oom
 - swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
 
 ==============================================================
 
@@ -202,3 +204,38 @@ copying back pages from swap into the sw
 practice it can take many minutes before the vm is idle enough.
 
 The default value is 1.
+
+==============================================================
+
+readahead_ratio
+
+This limits readahead size to percent of the thrashing threshold.
+The thrashing threshold is dynamicly estimated from the _history_ read
+speed and system load, to deduce the _future_ readahead request size.
+
+Set it to a smaller value if you have not enough memory for all the
+concurrent readers, or the I/O loads fluctuate a lot. But if there's
+plenty of memory(>2MB per reader), a bigger value may help performance.
+
+readahead_ratio also selects the readahead logic:
+	VALUE	CODE PATH
+	-------------------------------------------
+	    0	disable readahead totally
+	  1-9	select the stock readahead logic
+	10-inf	select the adaptive readahead logic
+
+The default value is 50.  Reasonable values would be [50, 100].
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (readahead-pages : accessed-pages).
+Useful only when (readahead_ratio >= 10). If the previous readahead
+request has bad hit rate, the kernel will be reluctant to do the next
+readahead.
+
+Larger values help catch more sparse access patterns. Be aware that
+readahead of the sparse patterns sacrifices memory for speed.
+
+The default value is 1.  It is recommended to keep the value below 16.

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 14/32] readahead: state based method - routines
       [not found] ` <20060527155132.649338979@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-stateful-routines.patch --]
[-- Type: text/plain, Size: 9428 bytes --]

Extend struct file_ra_state to support the adaptive read-ahead logic.
Also define some helpers for it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/fs.h |   74 +++++++++++++++++---
 mm/readahead.c     |  189 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 251 insertions(+), 12 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/fs.h
+++ linux-2.6.17-rc4-mm3/include/linux/fs.h
@@ -613,21 +613,75 @@ struct fown_struct {
 
 /*
  * Track a single file's readahead state
+ *
+ * Diagram for the adaptive readahead logic:
+ *
+ *  |--------- old chunk ------->|-------------- new chunk -------------->|
+ *  +----------------------------+----------------------------------------+
+ *  |               #            |                  #                     |
+ *  +----------------------------+----------------------------------------+
+ *                  ^            ^                  ^                     ^
+ *  file_ra_state.la_index    .ra_index   .lookahead_index      .readahead_index
+ *
+ * Common used deduced sizes:
+ *                               |----------- readahead size ------------>|
+ *  +----------------------------+----------------------------------------+
+ *  |               #            |                  #                     |
+ *  +----------------------------+----------------------------------------+
+ *                  |------- invoke interval ------>|-- lookahead size -->|
  */
 struct file_ra_state {
-	unsigned long start;		/* Current window */
-	unsigned long size;
-	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
-	unsigned long cache_hit;	/* cache hit count*/
-	unsigned long prev_page;	/* Cache last read() position */
-	unsigned long ahead_start;	/* Ahead window */
-	unsigned long ahead_size;
-	unsigned long ra_pages;		/* Maximum readahead window */
+	union {
+		struct { /* stock read-ahead */
+			unsigned long start;		/* Current window */
+			unsigned long size;
+			unsigned long ahead_start;	/* Ahead window */
+			unsigned long ahead_size;
+			unsigned long cache_hit;	/* cache hit count */
+		};
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+		struct { /* adaptive read-ahead */
+			pgoff_t la_index;		/* old chunk */
+			pgoff_t ra_index;
+			pgoff_t lookahead_index;	/* new chunk */
+			pgoff_t readahead_index;
+
+			/*
+			 * Read-ahead hits.
+			 * 	i.e. # of distinct read-ahead pages accessed.
+			 *
+			 * What is a read-ahead sequence?
+			 * 	A collection of sequential read-ahead requests.
+			 * To put it simple:
+			 * 	Normally a seek starts a new sequence.
+			 */
+			u16	hit0;	/* for the current request */
+			u16	hit1;	/* for the current sequence */
+			u16	hit2;	/* for the previous sequence */
+			u16	hit3;	/* for the prev-prev sequence */
+
+			/*
+			 * Snapshot of the (node's) read-ahead aging value
+			 * on time of I/O submission.
+			 */
+			unsigned long age;
+		};
+#endif
+	};
+
+	/* mmap read-around */
 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
+
+	unsigned long flags;	/* RA_FLAG_xxx | ra_class_old | ra_class_new */
+	unsigned long prev_page;	/* Cache last read() position */
+	unsigned long ra_pages;		/* Maximum readahead window */
 };
-#define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
-#define RA_FLAG_INCACHE 0x02	/* file is already in cache */
+#define RA_FLAG_MISS	(1UL<<31) /* a cache miss occured against this file */
+#define RA_FLAG_INCACHE	(1UL<<30) /* file is already in cache */
+#define RA_FLAG_MMAP		(1UL<<29) /* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD	(1UL<<28) /* disable look-ahead */
+#define RA_FLAG_EOF		(1UL<<27) /* readahead hits EOF */
 
 struct file {
 	/*
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -817,6 +817,191 @@ static unsigned long node_readahead_agin
 }
 
 /*
+ * Some helpers for querying/building a read-ahead request.
+ *
+ * Diagram for some variable names used frequently:
+ *
+ *                                   |<------- la_size ------>|
+ *                  +-----------------------------------------+
+ *                  |                #                        |
+ *                  +-----------------------------------------+
+ *      ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+static enum ra_class ra_class_new(struct file_ra_state *ra)
+{
+	return ra->flags & RA_CLASS_MASK;
+}
+
+static inline enum ra_class ra_class_old(struct file_ra_state *ra)
+{
+	return (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+}
+
+static unsigned long ra_readahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->ra_index;
+}
+
+static unsigned long ra_lookahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->lookahead_index;
+}
+
+static unsigned long ra_invoke_interval(struct file_ra_state *ra)
+{
+	return ra->lookahead_index - ra->la_index;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+	return ra->hit0 * readahead_hit_rate >=
+					(ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+	if (index < ra->la_index || index >= ra->readahead_index)
+		return 0;
+
+	if (index >= ra->ra_index)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static void ra_set_class(struct file_ra_state *ra,
+				enum ra_class ra_class)
+{
+	unsigned long flags_mask;
+	unsigned long flags;
+	unsigned long old_ra_class;
+
+	flags_mask = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+	flags = ra->flags & flags_mask;
+
+	old_ra_class = ra_class_new(ra) << RA_CLASS_SHIFT;
+
+	ra->flags = flags | old_ra_class | ra_class;
+
+	/*
+	 * Add request-hit up to sequence-hit and reset the former.
+	 */
+	ra->hit1 += ra->hit0;
+	ra->hit0 = 0;
+
+	/*
+	 * Manage the read-ahead sequences' hit counts.
+	 * 	- the stateful method continues any existing sequence;
+	 * 	- all other methods starts a new one.
+	 */
+	if (ra_class != RA_CLASS_STATE) {
+		ra->hit3 = ra->hit2;
+		ra->hit2 = ra->hit1;
+		ra->hit1 = 0;
+	}
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static void ra_set_index(struct file_ra_state *ra,
+				pgoff_t la_index, pgoff_t ra_index)
+{
+	ra->la_index = la_index;
+	ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static void ra_set_size(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+	ra->readahead_index = ra->ra_index + ra_size;
+	ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+			struct address_space *mapping, struct file *filp)
+{
+	unsigned long ra_size;
+	unsigned long la_size;
+	pgoff_t eof_index;
+	int actual;
+
+	eof_index = /* it's a past-the-end index! */
+		DIV_ROUND_UP(i_size_read(mapping->host), PAGE_CACHE_SIZE);
+
+	if (unlikely(ra->ra_index >= eof_index))
+		return 0;
+
+	/*
+	 * Snap to EOF, if the request
+	 * 	- crossed the EOF boundary;
+	 * 	- is close to EOF(explained below).
+	 *
+	 * Imagine a file sized 18 pages, and we dicided to read-ahead the
+	 * first 16 pages. It is highly possible that in the near future we
+	 * will have to do another read-ahead for the remaining 2 pages,
+	 * which is an unfavorable small I/O.
+	 *
+	 * So we prefer to take a bit risk to enlarge the current read-ahead,
+	 * to eliminate possible future small I/O.
+	 */
+	if (ra->readahead_index + ra_readahead_size(ra)/4 > eof_index) {
+		ra->readahead_index = eof_index;
+		if (ra->lookahead_index > eof_index)
+			ra->lookahead_index = eof_index;
+		ra->flags |= RA_FLAG_EOF;
+	}
+
+	/* Disable look-ahead for loopback file. */
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		ra->lookahead_index = ra->readahead_index;
+
+	/* Take down the current read-ahead aging value. */
+	ra->age = node_readahead_aging();
+
+	ra_size = ra_readahead_size(ra);
+	la_size = ra_lookahead_size(ra);
+	actual = __do_page_cache_readahead(mapping, filp,
+					ra->ra_index, ra_size, la_size);
+
+#ifdef CONFIG_DEBUG_READAHEAD
+	if (ra->flags & RA_FLAG_MMAP)
+		ra_account(ra, RA_EVENT_READAHEAD_MMAP, actual);
+	if (ra->readahead_index == eof_index)
+		ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+	if (la_size)
+		ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+	if (ra_size > actual)
+		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
+	ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+			ra_class_name[ra_class_new(ra)],
+			mapping->host->i_ino, ra->la_index,
+			ra->ra_index, ra_size, la_size, actual);
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+	return actual;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:
@@ -888,10 +1073,10 @@ static void ra_account(struct file_ra_st
 		return;
 
 	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
-		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		c = ra_class_old(ra);
 		pages = -pages;
 	} else if (ra)
-		c = ra->flags & RA_CLASS_MASK;
+		c = ra_class_new(ra);
 	else
 		c = RA_CLASS_NONE;
 

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 15/32] readahead: state based method
       [not found] ` <20060527155133.216888332@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-stateful.patch --]
[-- Type: text/plain, Size: 6174 bytes --]

This is the fast code path of adaptive read-ahead.

MAJOR STEPS
===========

        - estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.


THE REFERENCE MODEL
===================

        1. inactive list has constant length and page flow speed
        2. the observed stream receives a steady flow of read requests
        3. no page activation, so that the inactive list forms a pipe

With that we get the picture showed below.

|<------------------------- constant length ------------------------->|
<<<<<<<<<<<<<<<<<<<<<<<<< steady flow of pages <<<<<<<<<<<<<<<<<<<<<<<<
+---------------------------------------------------------------------+
|tail                        inactive list                        head|
|   =======                  ==========----                           |
|   chunk A(stale pages)     chunk B(stale + fresh pages)             |
+---------------------------------------------------------------------+


REAL WORLD ISSUES
=================

Real world workloads will always have fluctuations (violation of assumption
1 and 2). To counteract it, a tunable parameter readahead_ratio is introduced
to make the estimation conservative enough. Violation of assumption 3 will
not lead to thrashing, it is there just for simplicity of discussion.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 147 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1002,6 +1002,153 @@ static int ra_dispatch(struct file_ra_st
 }
 
 /*
+ * Deduce the read-ahead/look-ahead size from primitive values.
+ *
+ * Input:
+ *	- @ra_size stores the estimated thrashing-threshold.
+ *	- @la_size stores the look-ahead size of previous request.
+ */
+static int adjust_rala(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	unsigned long stream_shift = *la_size;
+
+	/*
+	 * Substract the old look-ahead to get real safe size for the next
+	 * read-ahead request.
+	 */
+	if (*ra_size > *la_size)
+		*ra_size -= *la_size;
+	else {
+		ra_account(NULL, RA_EVENT_READAHEAD_SHRINK, *ra_size);
+		return 0;
+	}
+
+	/*
+	 * Set new la_size according to the (still large) ra_size.
+	 */
+	*la_size = *ra_size / LOOKAHEAD_RATIO;
+
+	/*
+	 * Apply upper limits.
+	 */
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	/*
+	 * Make sure stream_shift is not too small.
+	 * (So that the next global_shift will not be too small.)
+	 */
+	stream_shift += (*ra_size - *la_size);
+	if (stream_shift < *ra_size / 4)
+		*la_size -= (*ra_size / 4 - stream_shift);
+
+	return 1;
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ *    It is returned to make the next read-ahead request.
+ * 2. the remained safe space for the current chunk
+ *    It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will vibrate
+ * more on light load(with small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * This figure illustrates the formula used in the function:
+ * While the stream reads stream_shift pages inside the chunks,
+ * the chunks are shifted global_shift pages inside inactive_list.
+ *
+ *      chunk A                    chunk B
+ *                          |<=============== global_shift ================|
+ *  +-------------+         +-------------------+                          |
+ *  |       #     |         |           #       |            inactive_list |
+ *  +-------------+         +-------------------+                     head |
+ *          |---->|         |---------->|
+ *             |                  |
+ *             +-- stream_shift --+
+ */
+static unsigned long compute_thrashing_threshold(struct file_ra_state *ra,
+							unsigned long *remain)
+{
+	unsigned long global_size;
+	unsigned long global_shift;
+	unsigned long stream_shift;
+	unsigned long ra_size;
+	uint64_t ll;
+
+	global_size = nr_free_inactive_pages_node(numa_node_id());
+	global_shift = node_readahead_aging() - ra->age;
+	global_shift |= 1UL;
+	stream_shift = ra_invoke_interval(ra);
+
+	/* future safe space */
+	ll = (uint64_t) stream_shift * (global_size >> 9) * readahead_ratio * 5;
+	do_div(ll, global_shift);
+	ra_size = ll;
+
+	/* remained safe space */
+	if (global_size > global_shift) {
+		ll = (uint64_t) stream_shift * (global_size - global_shift);
+		do_div(ll, global_shift);
+		*remain = ll;
+	} else
+		*remain = 0;
+
+	ddprintk("compute_thrashing_threshold: "
+			"at %lu ra %lu=%lu*%lu/%lu, remain %lu for %lu\n",
+			ra->readahead_index, ra_size,
+			stream_shift, global_size, global_shift,
+			*remain, ra_lookahead_size(ra));
+
+	return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+			struct file_ra_state *ra,
+			struct page *page, pgoff_t index,
+			unsigned long req_size, unsigned long ra_max)
+{
+	unsigned long ra_old;
+	unsigned long ra_size;
+	unsigned long la_size;
+	unsigned long remain_space;
+	unsigned long growth_limit;
+
+	la_size = ra->readahead_index - index;
+	ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+	if (page && remain_space <= la_size && la_size > 1) {
+		rescue_pages(page, la_size);
+		return 0;
+	}
+
+	ra_old = ra_readahead_size(ra);
+	growth_limit = req_size;
+	growth_limit += ra_max / 16;
+	growth_limit += (2 + readahead_ratio / 64) * ra_old;
+	if (growth_limit > ra_max)
+		growth_limit = ra_max;
+
+	if (!adjust_rala(growth_limit, &ra_size, &la_size))
+		return 0;
+
+	ra_set_class(ra, RA_CLASS_STATE);
+	ra_set_index(ra, index, ra->readahead_index);
+	ra_set_size(ra, ra_size, la_size);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 18/32] readahead: initial method - thrashing guard size
       [not found] ` <20060527155134.715578802@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-initial-size-thrash.patch --]
[-- Type: text/plain, Size: 1607 bytes --]

backing_dev_info.ra_thrash_bytes is dynamicly updated to be a little above
the thrashing safe read-ahead size. It is used in the initial method where
the thrashing threshold for the particular reader is still unknown.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -796,6 +796,22 @@ out:
 }
 
 /*
+ * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
+ * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
+ * _minimal_ read-ahead thrashing threshold on the device.
+ *
+ * (*) Note that being a bit risky can _help_ overall performance.
+ */
+static inline void update_ra_thrash_bytes(struct backing_dev_info *bdi,
+						unsigned long ra_size)
+{
+	ra_size <<= PAGE_CACHE_SHIFT;
+	bdi->ra_thrash_bytes = (bdi->ra_thrash_bytes < ra_size) ?
+				(ra_size + bdi->ra_thrash_bytes * 1023) / 1024:
+				(ra_size + bdi->ra_thrash_bytes *    7) /    8;
+}
+
+/*
  * The node's accumulated aging activities.
  */
 static unsigned long node_readahead_aging(void)
@@ -1144,6 +1160,10 @@ state_based_readahead(struct address_spa
 	if (!adjust_rala(growth_limit, &ra_size, &la_size))
 		return 0;
 
+	/* ra_size in its _steady_ state reflects thrashing threshold */
+	if (page && ra_old + ra_old / 8 >= ra_size)
+		update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
 	ra_set_class(ra, RA_CLASS_STATE);
 	ra_set_index(ra, index, ra->readahead_index);
 	ra_set_size(ra, ra_size, la_size);

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 20/32] readahead: initial method - user recommended size
       [not found] ` <20060527155135.584918734@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-initial-size-recommend.patch --]
[-- Type: text/plain, Size: 1890 bytes --]

backing_dev_info.ra_pages0 is a user configurable parameter that controls
the readahead size on start-of-file.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 block/ll_rw_blk.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/block/ll_rw_blk.c
+++ linux-2.6.17-rc4-mm3/block/ll_rw_blk.c
@@ -3811,6 +3811,29 @@ queue_ra_store(struct request_queue *q, 
 	return ret;
 }
 
+static ssize_t queue_initial_ra_show(struct request_queue *q, char *page)
+{
+	int kb = q->backing_dev_info.ra_pages0 << (PAGE_CACHE_SHIFT - 10);
+
+	return queue_var_show(kb, (page));
+}
+
+static ssize_t
+queue_initial_ra_store(struct request_queue *q, const char *page, size_t count)
+{
+	unsigned long kb, ra;
+	ssize_t ret = queue_var_store(&kb, page, count);
+
+	ra = kb >> (PAGE_CACHE_SHIFT - 10);
+	q->backing_dev_info.ra_pages0 = ra;
+
+	ra = kb * 1024;
+	if (q->backing_dev_info.ra_expect_bytes > ra)
+		q->backing_dev_info.ra_expect_bytes = ra;
+
+	return ret;
+}
+
 static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
 {
 	int max_sectors_kb = q->max_sectors >> 1;
@@ -3868,6 +3891,12 @@ static struct queue_sysfs_entry queue_ra
 	.store = queue_ra_store,
 };
 
+static struct queue_sysfs_entry queue_initial_ra_entry = {
+	.attr = {.name = "initial_ra_kb", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_initial_ra_show,
+	.store = queue_initial_ra_store,
+};
+
 static struct queue_sysfs_entry queue_max_sectors_entry = {
 	.attr = {.name = "max_sectors_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_max_sectors_show,
@@ -3888,6 +3917,7 @@ static struct queue_sysfs_entry queue_io
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
+	&queue_initial_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_iosched_entry.attr,

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 22/32] readahead: backward prefetching method
       [not found] ` <20060527155136.503037461@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-backward.patch --]
[-- Type: text/plain, Size: 1439 bytes --]

Readahead policy for reading backward.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1536,6 +1536,46 @@ initial_readahead(struct address_space *
 }
 
 /*
+ * Backward prefetching.
+ *
+ * No look-ahead and thrashing safety guard: should be unnecessary.
+ */
+static int
+try_read_backward(struct file_ra_state *ra, pgoff_t begin_index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	pgoff_t end_index;
+
+	/* Are we reading backward? */
+	if (begin_index > ra->prev_page)
+		return 0;
+
+	if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD &&
+					ra_has_index(ra, ra->prev_page)) {
+		ra_size += 2 * ra->hit0;
+		end_index = ra->la_index;
+	} else {
+		ra_size += ra_size + ra_size * (readahead_hit_rate - 1) / 2;
+		end_index = ra->prev_page;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	/* Read traces close enough to be covered by the prefetching? */
+	if (end_index > begin_index + ra_size)
+		return 0;
+
+	begin_index = end_index - ra_size;
+
+	ra_set_class(ra, RA_CLASS_BACKWARD);
+	ra_set_index(ra, begin_index, begin_index);
+	ra_set_size(ra, ra_size, 0);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 24/32] readahead: thrashing recovery method
       [not found] ` <20060527155137.552915509@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  2006-05-27 22:04     ` [PATCH 23/32] readahead: seeking reads method Ingo Oeser
  0 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-onthrash.patch --]
[-- Type: text/plain, Size: 1736 bytes --]

Readahead policy after thrashing.

It tries to recover gracefully from the thrashing.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 42 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1619,6 +1619,48 @@ try_readahead_on_seek(struct file_ra_sta
 }
 
 /*
+ * Readahead thrashing recovery.
+ */
+static unsigned long
+thrashing_recovery_readahead(struct address_space *mapping,
+				struct file *filp, struct file_ra_state *ra,
+				pgoff_t index, unsigned long ra_max)
+{
+	unsigned long ra_size;
+
+	if (probe_page(mapping, index - 1))
+		ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+						ra->readahead_index - index);
+	ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+						ra->readahead_index - index);
+
+	/*
+	 * Some thrashing occur in (ra_index, la_index], in which case the
+	 * old read-ahead chunk is lost soon after the new one is allocated.
+	 * Ensure that we recover all needed pages in the old chunk.
+	 */
+	if (index < ra->ra_index)
+		ra_size = ra->ra_index - index;
+	else {
+		/* After thrashing, we know the exact thrashing-threshold. */
+		ra_size = ra->hit0;
+		update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
+		/* And we'd better be a bit conservative. */
+		ra_size = ra_size * 3 / 4;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	ra_set_class(ra, RA_CLASS_THRASHING);
+	ra_set_index(ra, index, index);
+	ra_set_size(ra, ra_size, ra_size / LOOKAHEAD_RATIO);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 25/32] readahead: call scheme
       [not found] ` <20060527155138.046726658@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 9534 bytes --]

The read-ahead logic is called when the reading hits
        - a PG_readahead marked page;
        - a non-present page.

ra.prev_page should be properly setup on entrance, and readahead_cache_hit()
should be called on every page reference to maintain the cache_hits counter.

This call scheme achieves the following goals:
        - makes all stateful/stateless methods happy;
        - eliminates the cache hit problem naturally;
        - lives in harmony with application managed read-aheads via
          fadvise/madvise.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |    6 ++
 mm/filemap.c       |   51 ++++++++++++++++-
 mm/readahead.c     |  155 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 208 insertions(+), 4 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1033,6 +1033,12 @@ void handle_ra_miss(struct address_space
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 void fastcall readahead_close(struct file *file);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t first_index, pgoff_t index, pgoff_t last_index);
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page);
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
 extern int readahead_ratio;
--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -869,14 +869,32 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
+
+		if (!prefer_adaptive_readahead() && index == next_index)
 			next_index = page_cache_readahead(mapping, &ra, filp,
 					index, last_index - index);
 
 find_page:
 		page = find_get_page(mapping, index);
+		if (prefer_adaptive_readahead()) {
+			if (unlikely(page == NULL)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, NULL,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+				page = find_get_page(mapping, index);
+			} else if (PageReadahead(page)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, page,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+			}
+		}
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
 
@@ -884,6 +902,9 @@ find_page:
 			page_cache_release(prev_page);
 		prev_page = page;
 
+		if (prefer_adaptive_readahead())
+			readahead_cache_hit(&ra, page);
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1027,6 +1048,8 @@ no_cached_page:
 
 out:
 	*_ra = ra;
+	if (prefer_adaptive_readahead())
+		_ra->prev_page = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
@@ -1312,6 +1335,7 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	ra->flags |= RA_FLAG_MMAP;
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
 retry_all:
@@ -1329,19 +1353,33 @@ retry_all:
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
+	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
+
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
 	page = find_get_page(mapping, pgoff);
+	if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+		if (!page) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, NULL,
+						pgoff, pgoff, pgoff + 1);
+			page = find_get_page(mapping, pgoff);
+		} else if (PageReadahead(page)) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, page,
+						pgoff, pgoff, pgoff + 1);
+		}
+	}
 	if (!page) {
 		unsigned long ra_pages;
 
 		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, ra, pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1378,6 +1416,9 @@ retry_find:
 	if (!did_readaround)
 		ra->mmap_hit++;
 
+	if (prefer_adaptive_readahead())
+		readahead_cache_hit(ra, page);
+
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.
@@ -1392,6 +1433,8 @@ success:
 	mark_page_accessed(page);
 	if (type)
 		*type = majmin;
+	if (prefer_adaptive_readahead())
+		ra->prev_page = page->index;
 	return page;
 
 outside_data_content:
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1679,6 +1679,161 @@ static inline void get_readahead_bounds(
 				(128*1024) / PAGE_CACHE_SIZE), *ra_max / 2);
 }
 
+/**
+ * page_cache_readahead_adaptive - thrashing safe adaptive read-ahead
+ * @mapping, @ra, @filp: the same as page_cache_readahead()
+ * @prev_page: the page at @index-1, may be NULL to let the function find it
+ * @page: the page at @index, or NULL if non-present
+ * @begin_index, @index, @end_index: offsets into @mapping
+ * 		[@begin_index, @end_index) is the read the caller is performing
+ *	 	@index indicates the page to be read now
+ *
+ * page_cache_readahead_adaptive() is the entry point of the adaptive
+ * read-ahead logic. It tries a set of methods in turn to determine the
+ * appropriate readahead action and submits the readahead I/O.
+ *
+ * The caller is expected to point ra->prev_page to the previously accessed
+ * page, and to call it on two conditions:
+ * 1. @page == NULL
+ *    A cache miss happened, some pages have to be read in
+ * 2. @page != NULL && PageReadahead(@page)
+ *    A look-ahead mark encountered, this is set by a previous read-ahead
+ *    invocation to instruct the caller to give the function a chance to
+ *    check up and do next read-ahead in advance.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t begin_index, pgoff_t index, pgoff_t end_index)
+{
+	unsigned long size;
+	unsigned long ra_min;
+	unsigned long ra_max;
+	int ret;
+
+	might_sleep();
+
+	if (page) {
+		if(!TestClearPageReadahead(page))
+			return 0;
+		if (bdi_read_congested(mapping->backing_dev_info)) {
+			ra_account(ra, RA_EVENT_IO_CONGESTION,
+							end_index - index);
+			return 0;
+		}
+	}
+
+	if (page)
+		ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+				ra->readahead_index - ra->lookahead_index);
+	else if (index)
+		ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+	size = end_index - index;
+	get_readahead_bounds(ra, &ra_min, &ra_max);
+
+	/* readahead disabled? */
+	if (unlikely(!ra_max || !readahead_ratio)) {
+		size = max_sane_readahead(size);
+		goto readit;
+	}
+
+	/*
+	 * Start of file.
+	 */
+	if (index == 0)
+		return initial_readahead(mapping, filp, ra, size);
+
+	/*
+	 * State based sequential read-ahead.
+	 */
+	if (!debug_option(disable_stateful_method) &&
+			index == ra->lookahead_index && ra_cache_hit_ok(ra))
+		return state_based_readahead(mapping, filp, ra, page,
+							index, size, ra_max);
+
+	/*
+	 * Recover from possible thrashing.
+	 */
+	if (!page && index == ra->prev_page + 1 && ra_has_index(ra, index))
+		return thrashing_recovery_readahead(mapping, filp, ra,
+								index, ra_max);
+
+	/*
+	 * Backward read-ahead.
+	 */
+	if (!page && begin_index == index &&
+				try_read_backward(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Context based sequential read-ahead.
+	 */
+	ret = try_context_based_readahead(mapping, ra, prev_page, page,
+							index, ra_min, ra_max);
+	if (ret > 0)
+		return ra_dispatch(ra, mapping, filp);
+	if (ret < 0)
+		return 0;
+
+	/* No action on look ahead time? */
+	if (page) {
+		ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+						ra->readahead_index - index);
+		return 0;
+	}
+
+	/*
+	 * Random read that follows a sequential one.
+	 */
+	if (try_readahead_on_seek(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Random read.
+	 */
+	if (size > ra_max)
+		size = ra_max;
+
+readit:
+	size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+	ra_account(ra, RA_EVENT_RANDOM_READ, size);
+	dprintk("random_read(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+			mapping->host->i_ino, mapping->nrpages,
+			begin_index, index, end_index, size);
+
+	return size;
+}
+
+/**
+ * readahead_cache_hit - adaptive read-ahead feedback function
+ * @ra: file_ra_state which holds the readahead state
+ * @page: the page just accessed
+ *
+ * readahead_cache_hit() is the feedback route of the adaptive read-ahead
+ * logic. It must be called on every access on the read-ahead pages.
+ */
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page)
+{
+	if (PageActive(page) || PageReferenced(page))
+		return;
+
+	if (!PageUptodate(page))
+		ra_account(ra, RA_EVENT_IO_BLOCK, 1);
+
+	if (!ra_has_index(ra, page->index))
+		return;
+
+	ra->hit0++;
+
+	if (page->index >= ra->ra_index)
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+	else
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
 /*
  * When closing a normal readonly file,
  * 	- on cache hit:  increase `backing_dev_info.ra_expect_bytes' slowly;

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 26/32] readahead: laptop mode
       [not found] ` <20060527155138.454809673@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Bart Samwel

[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3330 bytes --]

When the laptop drive is spinned down, defer look-ahead to spin up time.

The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.

------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:

# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda      # this is the max possible value

Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
  which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/writeback.h |    6 ++++++
 mm/page-writeback.c       |    2 +-
 mm/readahead.c            |   31 +++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/writeback.h
+++ linux-2.6.17-rc4-mm3/include/linux/writeback.h
@@ -86,6 +86,12 @@ void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
 
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+	return !timer_pending(&laptop_mode_wb_timer);
+}
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
--- linux-2.6.17-rc4-mm3.orig/mm/page-writeback.c
+++ linux-2.6.17-rc4-mm3/mm/page-writeback.c
@@ -389,7 +389,7 @@ static void wb_timer_fn(unsigned long un
 static void laptop_timer_fn(unsigned long unused);
 
 static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
  * Periodic writeback of "old" data.
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -16,6 +16,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/writeback.h>
 #include <asm/div64.h>
 
 /*
@@ -796,6 +797,31 @@ out:
 }
 
 /*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static int renew_lookahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				pgoff_t index, pgoff_t new_index)
+{
+	struct page *page;
+
+	if (index == ra->lookahead_index &&
+			new_index >= ra->readahead_index)
+		return 1;
+
+	page = radix_tree_lookup(&mapping->page_tree, new_index);
+	if (!page)
+		return 1;
+
+	SetPageReadahead(page);
+	if (ra->lookahead_index == index)
+		ra->lookahead_index = new_index;
+
+	return 0;
+}
+
+/*
  * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
  * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
  * _minimal_ read-ahead thrashing threshold on the device.
@@ -1722,6 +1748,11 @@ page_cache_readahead_adaptive(struct add
 							end_index - index);
 			return 0;
 		}
+		if (laptop_mode && laptop_spinned_down()) {
+			if (!renew_lookahead(mapping, ra, index,
+						index + LAPTOP_POLL_INTERVAL))
+				return 0;
+		}
 	}
 
 	if (page)

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 27/32] readahead: loop case
       [not found] ` <20060527155140.035991503@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-loop-case.patch --]
[-- Type: text/plain, Size: 894 bytes --]

Disable look-ahead for loop file.

Loopback files normally contain filesystems, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

I'd like to thank Tero Grundstr?m for uncovering the loopback problem.

 drivers/block/loop.c |    6 ++++++
 1 files changed, 6 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/drivers/block/loop.c
+++ linux-2.6.17-rc4-mm3/drivers/block/loop.c
@@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
 	mapping = file->f_mapping;
 	inode = mapping->host;
 
+	/*
+	 * The upper layer should already do proper look-ahead,
+	 * one more look-ahead here only ruins the cache hit rate.
+	 */
+	file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
 	if (!(file->f_mode & FMODE_WRITE))
 		lo_flags |= LO_FLAGS_READ_ONLY;
 

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 30/32] readahead: debug radix tree new functions
       [not found] ` <20060527155141.697607086@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-radix-tree.patch --]
[-- Type: text/plain, Size: 1653 bytes --]

Do some sanity checkings on the newly added radix tree code.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -70,6 +70,8 @@ enum ra_class {
 	RA_CLASS_COUNT
 };
 
+#define DEBUG_READAHEAD_RADIXTREE
+
 /* Read-ahead events to be accounted. */
 enum ra_event {
 	RA_EVENT_CACHE_MISS,		/* read cache misses */
@@ -1294,6 +1296,16 @@ static pgoff_t find_segtail(struct addre
 	cond_resched();
 	read_lock_irq(&mapping->tree_lock);
 	ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	BUG_ON(!__probe_page(mapping, index));
+	WARN_ON(ra_index < index);
+	if (ra_index != index && !__probe_page(mapping, ra_index - 1))
+		printk(KERN_ERR "radix_tree_scan_hole(index=%lu ra_index=%lu "
+				"max_scan=%lu nrpages=%lu) fooled!\n",
+				index, ra_index, max_scan, mapping->nrpages);
+	if (ra_index != ~0UL && ra_index - index < max_scan)
+		WARN_ON(__probe_page(mapping, ra_index));
+#endif
 	read_unlock_irq(&mapping->tree_lock);
 
 	if (ra_index <= index + max_scan)
@@ -1377,6 +1389,13 @@ static unsigned long query_page_cache_se
 	read_lock_irq(&mapping->tree_lock);
 	index = radix_tree_scan_hole_backward(&mapping->page_tree,
 							offset - 1, ra_max);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	WARN_ON(index > offset - 1);
+	if (index != offset - 1)
+		WARN_ON(!__probe_page(mapping, index + 1));
+	if (index && offset - 1 - index < ra_max)
+		WARN_ON(__probe_page(mapping, index));
+#endif
 
 	*remain = offset - index;
 

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 31/32] readahead: debug traces showing accessed file names
       [not found] ` <20060527155142.129761018@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-traces-file-list.patch --]
[-- Type: text/plain, Size: 1019 bytes --]

Print file names on their first read-ahead, for tracing file access patterns.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1039,6 +1039,20 @@ static int ra_dispatch(struct file_ra_st
 		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
 	ra_account(ra, RA_EVENT_READAHEAD, actual);
 
+	if (!ra->ra_index && filp->f_dentry->d_inode) {
+		char *fn;
+		static char path[1024];
+		unsigned long size;
+
+		size = (i_size_read(filp->f_dentry->d_inode)+1023)/1024;
+		fn = d_path(filp->f_dentry, filp->f_vfsmnt, path, 1000);
+		if (!IS_ERR(fn))
+			ddprintk("ino %lu is %s size %luK by %s(%d)\n",
+					filp->f_dentry->d_inode->i_ino,
+					fn, size,
+					current->comm, current->pid);
+	}
+
 	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
 			ra_class_name[ra_class_new(ra)],
 			mapping->host->i_ino, ra->la_index,

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 32/32] readahead: debug traces showing read patterns
       [not found] ` <20060527155142.715530234@localhost.localdomain>
@ 2006-05-27 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-27 15:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-traces-access-pattern.patch --]
[-- Type: text/plain, Size: 2664 bytes --]

Print all relavant read requests to help discover the access pattern.

If you are experiencing performance problems, or want to help improve
the read-ahead logic, please send me the trace data. Thanks.

- Preparations

# Compile kernel with option CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
echo 8 > /debug/readahead/debug_level # show verbose printk traces
# do one benchmark/task
echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/filemap.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -45,6 +45,12 @@ static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+#ifdef CONFIG_DEBUG_READAHEAD
+extern u32 debug_level;
+#else
+#define debug_level 0
+#endif /* CONFIG_DEBUG_READAHEAD */
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -851,6 +857,10 @@ void do_generic_mapping_read(struct addr
 	if (!isize)
 		goto out;
 
+	if (debug_level >= 5)
+		printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+			inode->i_ino, index, last_index - index);
+
 	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 	for (;;) {
 		struct page *page;
@@ -905,6 +915,11 @@ find_page:
 		if (prefer_adaptive_readahead())
 			readahead_cache_hit(&ra, page);
 
+		if (debug_level >= 7)
+			printk(KERN_DEBUG "read-page(ino=%lu, idx=%lu, io=%s)\n",
+				inode->i_ino, index,
+				PageUptodate(page) ? "hit" : "miss");
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1356,7 +1371,6 @@ retry_all:
 	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
-
 	/*
 	 * Do we have something in the page cache already?
 	 */
@@ -1419,6 +1433,13 @@ retry_find:
 	if (prefer_adaptive_readahead())
 		readahead_cache_hit(ra, page);
 
+	if (debug_level >= 6)
+		printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s, io=%s)\n",
+			inode->i_ino, pgoff,
+			VM_RandomReadHint(area) ? "random" :
+			(VM_SequentialReadHint(area) ? "sequential" : "none"),
+			PageUptodate(page) ? "hit" : "miss");
+
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
  2006-05-27 15:48 ` [PATCH 00/32] Adaptive readahead V14 Wu Fengguang
@ 2006-05-27 17:29   ` Michael Tokarev
       [not found]     ` <20060528120815.GB6478@mail.ustc.edu.cn>
  0 siblings, 1 reply; 30+ messages in thread
From: Michael Tokarev @ 2006-05-27 17:29 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel

Wu Fengguang wrote:
> Andrew,
> 
> This is the 14th release of the adaptive readahead patchset.

A question I wanted to ask for quite some time already but for
some reason didn't ask...

How the new readahead logic works with media read errors?
Current linux behavior is questionable (it killed my dvd drive
for example, due to too many retries to read a single bad block
on a CD-Rom), it - I think - should be to stop reading ahead if
an read error occurs, instead of re-trying, and only retry to
read that block (if at all) when and only when an application
asks for that block.  I'm unsure when it should "resume reading
ahead" again (ie, like, setting ra to 0 on first error, and
restoring it back if we trying to read past the bad block.. or
set it to 0, and try to increase it on subsequent reads one by
one back to the original value, or...) - but that's probably
different story, for now, i think just setting ra to 0 on read
error will be sufficient...

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 23/32] readahead: seeking reads method
  2006-05-27 15:49   ` [PATCH 24/32] readahead: thrashing recovery method Wu Fengguang
@ 2006-05-27 22:04     ` Ingo Oeser
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Oeser @ 2006-05-27 22:04 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Hello, 

On Saturday, 27. May 2006 17:49, Wu Fengguang wrote:
> Readahead policy on read after seeking.
> 
> It tries to detect sequences like:
> 	seek(), 5*read(); seek(), 6*read(); seek(), 4*read(); ...

This is basically "fast forward" and "fast backward" of media players.
Did you try some tests with that? Hard disk recording people will love you.

Just watch a nice movies while testing this and you can enjoy both: 
Your work succeeding and jumping around in a good movie :-)


Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
       [not found]     ` <20060528120815.GB6478@mail.ustc.edu.cn>
@ 2006-05-28 12:08       ` Wu Fengguang
  2006-05-28 19:23         ` Michael Tokarev
  0 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2006-05-28 12:08 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-kernel

On Sat, May 27, 2006 at 09:29:46PM +0400, Michael Tokarev wrote:
> How the new readahead logic works with media read errors?
> Current linux behavior is questionable (it killed my dvd drive
> for example, due to too many retries to read a single bad block
> on a CD-Rom), it - I think - should be to stop reading ahead if
> an read error occurs, instead of re-trying, and only retry to
> read that block (if at all) when and only when an application
> asks for that block.  I'm unsure when it should "resume reading
> ahead" again (ie, like, setting ra to 0 on first error, and
> restoring it back if we trying to read past the bad block.. or
> set it to 0, and try to increase it on subsequent reads one by
> one back to the original value, or...) - but that's probably
> different story, for now, i think just setting ra to 0 on read
> error will be sufficient...

It's not quite reasonable for readahead to worry about media errors.
If the media fails, fix it. Or it will hurt read sooner or later.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
  2006-05-28 12:08       ` Wu Fengguang
@ 2006-05-28 19:23         ` Michael Tokarev
       [not found]           ` <20060529030152.GA5994@mail.ustc.edu.cn>
  0 siblings, 1 reply; 30+ messages in thread
From: Michael Tokarev @ 2006-05-28 19:23 UTC (permalink / raw)
  To: Wu Fengguang, Michael Tokarev, linux-kernel

Wu Fengguang wrote:
> 
> It's not quite reasonable for readahead to worry about media errors.
> If the media fails, fix it. Or it will hurt read sooner or later.

Well... In reality, it is just the opposite.

Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream).  Let's assume the unreadable block is number U.

But current behavior is just insane.  An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time.  Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.

Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1.  Retrying again as in previous step.

And so on, up to when an app requests block number U-1.  And when,
finally, it requests block U, it receives read error.

So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).

This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore.  Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.

With that logic, an app also becomes unkillable (at least for some
time) -- ie, even when I knew something's wrong and the CDrom should
not behave like it was, I wasn't able to stop it until I powered the
machine off (just unplugged the power cable) - but.. too late.

Yes, bad media is just that - a bad thing.  But it's not a reason to
force power unplug to stop the process, and not a reason to burn a
drive (or anything else).  And this is where readahead comes into
play - it IS read-ahead logic who's responsible for the situation.

And there's alot of scratched/whatever CD-Rom drives out there -
unreadable CDrom (or a floppy which is already ancient, or some
other media) - you can't just say to every user out there that
linux isn't compatible with all people's stuff and those people
should "fix" it before ever trying to insert it into their linux
machine...

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
       [not found]           ` <20060529030152.GA5994@mail.ustc.edu.cn>
@ 2006-05-29  3:01             ` Wu Fengguang
  2006-05-30  9:23             ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-29  3:01 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-kernel, Jens Axboe

On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> Wu Fengguang wrote:
> > 
> > It's not quite reasonable for readahead to worry about media errors.
> > If the media fails, fix it. Or it will hurt read sooner or later.
> 
> Well... In reality, it is just the opposite.
> 
> Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> In order to "fix" it, one have to read it and write to another CD-rom,
> or something.. or just ignore the error (if it's just a skip in a video
> stream).  Let's assume the unreadable block is number U.
> 
> But current behavior is just insane.  An application requests block
> number N, which is before U. Kernel tries to read-ahead blocks N..U.
> Cdrom drive tries to read it, re-read it.. for some time.  Finally,
> when all the N..U-1 blocks are read, kernel returns block number N
> (as requested) to an application, successefully.
> 
> Now an app requests block number N+1, and kernel tries to read
> blocks N+1..U+1.  Retrying again as in previous step.
> 
> And so on, up to when an app requests block number U-1.  And when,
> finally, it requests block U, it receives read error.
> 
> So, kernel currentry tries to re-read the same failing block as
> many times as the current readahead value (256 (times?) by default).

Good insight... But I'm not sure about it.

Jens, will a bad sector cause the _whole_ request to fail?
Or only the page that contains the bad sector?

> This whole process already killed my cdrom drive (I posted about it
> to LKML several months ago) - literally, the drive has fried, and
> does not work anymore.  Ofcourse that problem was a bug in firmware
> (or whatever) of the drive *too*, but.. main problem with that is
> current readahead logic as described above.
> 
> With that logic, an app also becomes unkillable (at least for some
> time) -- ie, even when I knew something's wrong and the CDrom should
> not behave like it was, I wasn't able to stop it until I powered the
> machine off (just unplugged the power cable) - but.. too late.
> 
> Yes, bad media is just that - a bad thing.  But it's not a reason to
> force power unplug to stop the process, and not a reason to burn a
> drive (or anything else).  And this is where readahead comes into
> play - it IS read-ahead logic who's responsible for the situation.
> 
> And there's alot of scratched/whatever CD-Rom drives out there -
> unreadable CDrom (or a floppy which is already ancient, or some
> other media) - you can't just say to every user out there that
> linux isn't compatible with all people's stuff and those people
> should "fix" it before ever trying to insert it into their linux
> machine...
> 
> Thanks.
> 
> /mjt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
       [not found]           ` <20060529030152.GA5994@mail.ustc.edu.cn>
  2006-05-29  3:01             ` Wu Fengguang
@ 2006-05-30  9:23             ` Jens Axboe
       [not found]               ` <20060530113221.GA8665@mail.ustc.edu.cn>
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2006-05-30  9:23 UTC (permalink / raw)
  To: Wu Fengguang, Michael Tokarev, linux-kernel

On Mon, May 29 2006, Wu Fengguang wrote:
> On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > Wu Fengguang wrote:
> > > 
> > > It's not quite reasonable for readahead to worry about media errors.
> > > If the media fails, fix it. Or it will hurt read sooner or later.
> > 
> > Well... In reality, it is just the opposite.
> > 
> > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > In order to "fix" it, one have to read it and write to another CD-rom,
> > or something.. or just ignore the error (if it's just a skip in a video
> > stream).  Let's assume the unreadable block is number U.
> > 
> > But current behavior is just insane.  An application requests block
> > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > Cdrom drive tries to read it, re-read it.. for some time.  Finally,
> > when all the N..U-1 blocks are read, kernel returns block number N
> > (as requested) to an application, successefully.
> > 
> > Now an app requests block number N+1, and kernel tries to read
> > blocks N+1..U+1.  Retrying again as in previous step.
> > 
> > And so on, up to when an app requests block number U-1.  And when,
> > finally, it requests block U, it receives read error.
> > 
> > So, kernel currentry tries to re-read the same failing block as
> > many times as the current readahead value (256 (times?) by default).
> 
> Good insight... But I'm not sure about it.
> 
> Jens, will a bad sector cause the _whole_ request to fail?
> Or only the page that contains the bad sector?

Depends entirely on the driver, and that point we've typically lost the
fact that this is a read-ahead request and could just be tossed. In
fact, the entire request may consist of read-ahead as well as normal
read entries.

For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
       [not found]               ` <20060530113221.GA8665@mail.ustc.edu.cn>
@ 2006-05-30 11:32                 ` Wu Fengguang
  2006-05-30 12:29                 ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-30 11:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Michael Tokarev, linux-kernel, Andrew Morton

On Tue, May 30, 2006 at 11:23:10AM +0200, Jens Axboe wrote:
> On Mon, May 29 2006, Wu Fengguang wrote:
> > On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > > Wu Fengguang wrote:
> > > > 
> > > > It's not quite reasonable for readahead to worry about media errors.
> > > > If the media fails, fix it. Or it will hurt read sooner or later.
> > > 
> > > Well... In reality, it is just the opposite.
> > > 
> > > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > > In order to "fix" it, one have to read it and write to another CD-rom,
> > > or something.. or just ignore the error (if it's just a skip in a video
> > > stream).  Let's assume the unreadable block is number U.
> > > 
> > > But current behavior is just insane.  An application requests block
> > > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > > Cdrom drive tries to read it, re-read it.. for some time.  Finally,
> > > when all the N..U-1 blocks are read, kernel returns block number N
> > > (as requested) to an application, successefully.
> > > 
> > > Now an app requests block number N+1, and kernel tries to read
> > > blocks N+1..U+1.  Retrying again as in previous step.
> > > 
> > > And so on, up to when an app requests block number U-1.  And when,
> > > finally, it requests block U, it receives read error.
> > > 
> > > So, kernel currentry tries to re-read the same failing block as
> > > many times as the current readahead value (256 (times?) by default).
> > 
> > Good insight... But I'm not sure about it.
> > 
> > Jens, will a bad sector cause the _whole_ request to fail?
> > Or only the page that contains the bad sector?
> 
> Depends entirely on the driver, and that point we've typically lost the
> fact that this is a read-ahead request and could just be tossed. In
> fact, the entire request may consist of read-ahead as well as normal
> read entries.
> 
> For ide-cd, it tends do only end the first part of the request on a
> medium error. So you may see a lot of repeats :/

Another question about it:
        If the block layer issued a request, which happened to contain
        R ranges of B bad blocks, i.e. 3 ranges of 9 bad-blocks:
                ___b_____bb___________bbbbbb____
        How many retries will incur? 1, 3, 9, or something else?
        If it is 3 or more, then we are even more bad luck :(

Will it be suitable to _automatically_ apply the following retracting
policy on I/O error? Please comment if there's better ways:

--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -983,6 +983,7 @@ readpage:
 				}
 				unlock_page(page);
 				error = -EIO;
+				ra.ra_pages /= 2;
 				goto readpage_error;
 			}
 			unlock_page(page);
@@ -1535,6 +1536,7 @@ page_not_uptodate:
 	 * Things didn't work out. Return zero to tell the
 	 * mm layer so, possibly freeing the page cache page first.
 	 */
+	ra->ra_pages /= 2;
 	page_cache_release(page);
 	return NULL;
 }


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
       [not found]               ` <20060530113221.GA8665@mail.ustc.edu.cn>
  2006-05-30 11:32                 ` Wu Fengguang
@ 2006-05-30 12:29                 ` Jens Axboe
       [not found]                   ` <20060530143417.GA9126@mail.ustc.edu.cn>
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2006-05-30 12:29 UTC (permalink / raw)
  To: Wu Fengguang, Michael Tokarev, linux-kernel, Andrew Morton

On Tue, May 30 2006, Wu Fengguang wrote:
> On Tue, May 30, 2006 at 11:23:10AM +0200, Jens Axboe wrote:
> > On Mon, May 29 2006, Wu Fengguang wrote:
> > > On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > > > Wu Fengguang wrote:
> > > > > 
> > > > > It's not quite reasonable for readahead to worry about media errors.
> > > > > If the media fails, fix it. Or it will hurt read sooner or later.
> > > > 
> > > > Well... In reality, it is just the opposite.
> > > > 
> > > > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > > > In order to "fix" it, one have to read it and write to another CD-rom,
> > > > or something.. or just ignore the error (if it's just a skip in a video
> > > > stream).  Let's assume the unreadable block is number U.
> > > > 
> > > > But current behavior is just insane.  An application requests block
> > > > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > > > Cdrom drive tries to read it, re-read it.. for some time.  Finally,
> > > > when all the N..U-1 blocks are read, kernel returns block number N
> > > > (as requested) to an application, successefully.
> > > > 
> > > > Now an app requests block number N+1, and kernel tries to read
> > > > blocks N+1..U+1.  Retrying again as in previous step.
> > > > 
> > > > And so on, up to when an app requests block number U-1.  And when,
> > > > finally, it requests block U, it receives read error.
> > > > 
> > > > So, kernel currentry tries to re-read the same failing block as
> > > > many times as the current readahead value (256 (times?) by default).
> > > 
> > > Good insight... But I'm not sure about it.
> > > 
> > > Jens, will a bad sector cause the _whole_ request to fail?
> > > Or only the page that contains the bad sector?
> > 
> > Depends entirely on the driver, and that point we've typically lost the
> > fact that this is a read-ahead request and could just be tossed. In
> > fact, the entire request may consist of read-ahead as well as normal
> > read entries.
> > 
> > For ide-cd, it tends do only end the first part of the request on a
> > medium error. So you may see a lot of repeats :/
> 
> Another question about it:
>         If the block layer issued a request, which happened to contain
>         R ranges of B bad blocks, i.e. 3 ranges of 9 bad-blocks:
>                 ___b_____bb___________bbbbbb____
>         How many retries will incur? 1, 3, 9, or something else?
>         If it is 3 or more, then we are even more bad luck :(

Again, this is driver specific. But for ide-cd, if it's using PIO the
right thing should happen since we do each chunk individually. For dma
it looks much worse, since we only get an EIO back from the hardware for
the entire range. It wont do the right thing at all, only for the very
last thing when get get past the last bbbbbb block.

> Will it be suitable to _automatically_ apply the following retracting
> policy on I/O error? Please comment if there's better ways:

Probably it should be even more aggressively scaling down. The real
problem is the drivers of course, we should spend some time fixing them
up too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/32] Adaptive readahead V14
       [not found]                   ` <20060530143417.GA9126@mail.ustc.edu.cn>
@ 2006-05-30 14:34                     ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2006-05-30 14:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Michael Tokarev, linux-kernel, Andrew Morton

On Tue, May 30, 2006 at 02:29:34PM +0200, Jens Axboe wrote:
> On Tue, May 30 2006, Wu Fengguang wrote:
> > On Tue, May 30, 2006 at 11:23:10AM +0200, Jens Axboe wrote:
> > > On Mon, May 29 2006, Wu Fengguang wrote:
> > > > On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > > > > Wu Fengguang wrote:
> > > > > > 
> > > > > > It's not quite reasonable for readahead to worry about media errors.
> > > > > > If the media fails, fix it. Or it will hurt read sooner or later.
> > > > > 
> > > > > Well... In reality, it is just the opposite.
> > > > > 
> > > > > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > > > > In order to "fix" it, one have to read it and write to another CD-rom,
> > > > > or something.. or just ignore the error (if it's just a skip in a video
> > > > > stream).  Let's assume the unreadable block is number U.
> > > > > 
> > > > > But current behavior is just insane.  An application requests block
> > > > > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > > > > Cdrom drive tries to read it, re-read it.. for some time.  Finally,
> > > > > when all the N..U-1 blocks are read, kernel returns block number N
> > > > > (as requested) to an application, successefully.
> > > > > 
> > > > > Now an app requests block number N+1, and kernel tries to read
> > > > > blocks N+1..U+1.  Retrying again as in previous step.
> > > > > 
> > > > > And so on, up to when an app requests block number U-1.  And when,
> > > > > finally, it requests block U, it receives read error.
> > > > > 
> > > > > So, kernel currentry tries to re-read the same failing block as
> > > > > many times as the current readahead value (256 (times?) by default).
> > > > 
> > > > Good insight... But I'm not sure about it.
> > > > 
> > > > Jens, will a bad sector cause the _whole_ request to fail?
> > > > Or only the page that contains the bad sector?
> > > 
> > > Depends entirely on the driver, and that point we've typically lost the
> > > fact that this is a read-ahead request and could just be tossed. In
> > > fact, the entire request may consist of read-ahead as well as normal
> > > read entries.
> > > 
> > > For ide-cd, it tends do only end the first part of the request on a
> > > medium error. So you may see a lot of repeats :/
> > 
> > Another question about it:
> >         If the block layer issued a request, which happened to contain
> >         R ranges of B bad blocks, i.e. 3 ranges of 9 bad-blocks:
> >                 ___b_____bb___________bbbbbb____
> >         How many retries will incur? 1, 3, 9, or something else?
> >         If it is 3 or more, then we are even more bad luck :(
> 
> Again, this is driver specific. But for ide-cd, if it's using PIO the
> right thing should happen since we do each chunk individually. For dma
> it looks much worse, since we only get an EIO back from the hardware for
> the entire range. It wont do the right thing at all, only for the very
> last thing when get get past the last bbbbbb block.
> 
> > Will it be suitable to _automatically_ apply the following retracting
> > policy on I/O error? Please comment if there's better ways:
> 
> Probably it should be even more aggressively scaling down. The real
> problem is the drivers of course, we should spend some time fixing them
> up too.

nod, it's so frustrating...

Updated the patch, please comment if necessary.

With this patch, retries are reduced from, say, 256, to 5.

Wu
---

--- linux.orig/mm/filemap.c
+++ linux/mm/filemap.c
@@ -809,6 +809,32 @@ grab_cache_page_nowait(struct address_sp
 EXPORT_SYMBOL(grab_cache_page_nowait);
 
 /*
+ * CD/DVDs are error prone. When a medium error occurs, the driver may fail
+ * a _large_ part of the i/o request. Imagine the worst scenario:
+ *
+ *      ---R__________________________________________B__________
+ *         ^ reading here                             ^ bad block(assume 4k)
+ *
+ * read(R) => miss => readahead(R...B) => media error => frustrating retries
+ * => failing the whole request => read(R) => read(R+1) =>
+ * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
+ * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
+ * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
+ *
+ * It is going insane. Fix it by quickly scale down the readahead size.
+ */
+static void shrink_readahead_size_eio(struct file *filp,
+					struct file_ra_state *ra)
+{
+	if (!ra->ra_pages)
+		return;
+
+	ra->ra_pages /= 4;
+	printk(KERN_WARNING "Retracting readahead size of %s to %lu\n",
+			filp->f_dentry->d_iname, ra->ra_pages);
+}
+
+/*
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level
  * stuff.
@@ -983,6 +1009,7 @@ readpage:
 				}
 				unlock_page(page);
 				error = -EIO;
+				shrink_readahead_size_eio(filp, &ra);
 				goto readpage_error;
 			}
 			unlock_page(page);
@@ -1535,6 +1562,7 @@ page_not_uptodate:
 	 * Things didn't work out. Return zero to tell the
 	 * mm layer so, possibly freeing the page cache page first.
 	 */
+	shrink_readahead_size_eio(file, ra);
 	page_cache_release(page);
 	return NULL;
 }

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2006-05-30 14:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20060527154849.927021763@localhost.localdomain>
2006-05-27 15:48 ` [PATCH 00/32] Adaptive readahead V14 Wu Fengguang
2006-05-27 17:29   ` Michael Tokarev
     [not found]     ` <20060528120815.GB6478@mail.ustc.edu.cn>
2006-05-28 12:08       ` Wu Fengguang
2006-05-28 19:23         ` Michael Tokarev
     [not found]           ` <20060529030152.GA5994@mail.ustc.edu.cn>
2006-05-29  3:01             ` Wu Fengguang
2006-05-30  9:23             ` Jens Axboe
     [not found]               ` <20060530113221.GA8665@mail.ustc.edu.cn>
2006-05-30 11:32                 ` Wu Fengguang
2006-05-30 12:29                 ` Jens Axboe
     [not found]                   ` <20060530143417.GA9126@mail.ustc.edu.cn>
2006-05-30 14:34                     ` Wu Fengguang
     [not found] ` <20060527155125.911021581@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 01/32] readahead: kconfig options Wu Fengguang
     [not found] ` <20060527155127.522802387@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 04/32] mm: introduce PG_readahead Wu Fengguang
     [not found] ` <20060527155128.472551240@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 06/32] readahead: delay page release in do_generic_mapping_read() Wu Fengguang
     [not found] ` <20060527155129.001886224@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 07/32] readahead: insert cond_resched() calls Wu Fengguang
     [not found] ` <20060527155129.653903854@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 08/32] readahead: {MIN,MAX}_RA_PAGES Wu Fengguang
     [not found] ` <20060527155130.013773601@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 09/32] readahead: events accounting Wu Fengguang
     [not found] ` <20060527155130.538411854@localhost.localdomain>
2006-05-27 15:48   ` [PATCH 10/32] readahead: rescue_pages() Wu Fengguang
     [not found] ` <20060527155131.200177171@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 11/32] readahead: sysctl parameters Wu Fengguang
     [not found] ` <20060527155132.649338979@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 14/32] readahead: state based method - routines Wu Fengguang
     [not found] ` <20060527155133.216888332@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 15/32] readahead: state based method Wu Fengguang
     [not found] ` <20060527155134.715578802@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 18/32] readahead: initial method - thrashing guard size Wu Fengguang
     [not found] ` <20060527155135.584918734@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 20/32] readahead: initial method - user recommended size Wu Fengguang
     [not found] ` <20060527155136.503037461@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 22/32] readahead: backward prefetching method Wu Fengguang
     [not found] ` <20060527155137.552915509@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 24/32] readahead: thrashing recovery method Wu Fengguang
2006-05-27 22:04     ` [PATCH 23/32] readahead: seeking reads method Ingo Oeser
     [not found] ` <20060527155138.046726658@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 25/32] readahead: call scheme Wu Fengguang
     [not found] ` <20060527155138.454809673@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 26/32] readahead: laptop mode Wu Fengguang
     [not found] ` <20060527155140.035991503@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 27/32] readahead: loop case Wu Fengguang
     [not found] ` <20060527155141.697607086@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 30/32] readahead: debug radix tree new functions Wu Fengguang
     [not found] ` <20060527155142.129761018@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 31/32] readahead: debug traces showing accessed file names Wu Fengguang
     [not found] ` <20060527155142.715530234@localhost.localdomain>
2006-05-27 15:49   ` [PATCH 32/32] readahead: debug traces showing read patterns Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox