[PATCH 0/9] on-demand readahead

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] on-demand readahead
       [not found] <20070516224752.500812933@mail.ustc.edu.cn>
@ 2007-05-16 22:47 ` Fengguang Wu
       [not found] ` <20070516224818.246719596@mail.ustc.edu.cn>
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

Andrew,

This is the standalone on-demand readahead patchset that applies to
linux-2.6.22-rc1-mm1. It _replaces_ the current readahead algorithm with
the on-demand readahead algorithm that was first introduced in mail
http://lkml.org/lkml/2007/4/25/182.

patchset:

[PATCH 1/9] readahead: introduce PG_readahead
[PATCH 2/9] readahead: add look-ahead support to __do_page_cache_readahead()
[PATCH 3/9] readahead: MIN_RA_PAGES/MAX_RA_PAGES macros
[PATCH 4/9] readahead: data structure and routines
[PATCH 5/9] readahead: on-demand readahead logic
[PATCH 6/9] readahead: convert filemap invocations
[PATCH 7/9] readahead: convert splice invocations
[PATCH 8/9] readahead: convert ext3/ext4 invocations
[PATCH 9/9] readahead: remove the old algorithm

diffstat:

 fs/ext3/dir.c              |   14
 fs/ext4/dir.c              |   14
 fs/splice.c                |   18 -
 include/linux/fs.h         |   73 ++++-
 include/linux/mm.h         |    5
 include/linux/page-flags.h |    5
 mm/filemap.c               |   51 ++-
 mm/page_alloc.c            |    2
 mm/readahead.c             |  511 +++++++++++++----------------------
 9 files changed, 322 insertions(+), 371 deletions(-)

Regards,
Fengguang Wu
--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/9] readahead: introduce PG_readahead
       [not found] ` <20070516224818.246719596@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  2007-05-19  6:28     ` Andrew Morton
  2007-06-12  1:04   ` Rusty Russell
  1 sibling, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: mm-introduce-pg_readahead.patch --]
[-- Type: text/plain, Size: 1907 bytes --]

Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader:
Hey, it's time to invoke the read-ahead logic.  For the sake of I/O pipelining,
don't wait until it runs out of cached pages!

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---

 include/linux/page-flags.h |    5 +++++
 mm/page_alloc.c            |    2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

--- linux-2.6.22-rc1-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.22-rc1-mm1/include/linux/page-flags.h
@@ -92,6 +92,7 @@
 #define PG_lazyfree		20	/* MADV_FREE potential throwaway */
 
 #define PG_booked		21	/* Has blocks reserved on-disk */
+#define PG_readahead		22	/* Reminder to do async read-ahead */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
@@ -206,6 +207,10 @@ static inline void SetPageUptodate(struc
 #define SetPageBooked(page)	set_bit(PG_booked, &(page)->flags)
 #define ClearPageBooked(page)	clear_bit(PG_booked, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
+#define ClearPageReadahead(page) clear_bit(PG_readahead, &(page)->flags)
+
 #define PageReclaim(page)	test_bit(PG_reclaim, &(page)->flags)
 #define SetPageReclaim(page)	set_bit(PG_reclaim, &(page)->flags)
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
--- linux-2.6.22-rc1-mm1.orig/mm/page_alloc.c
+++ linux-2.6.22-rc1-mm1/mm/page_alloc.c
@@ -659,7 +659,7 @@ static int prep_new_page(struct page *pa
 	if (PageReserved(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/9] readahead: add look-ahead support to __do_page_cache_readahead()
       [not found] ` <20070516224818.384876951@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-add-look-ahead-support-to-__do_page_cache_readahead.patch --]
[-- Type: text/plain, Size: 2998 bytes --]

Add look-ahead support to __do_page_cache_readahead().

It works by
	- mark the Nth backwards page with PG_readahead,
	(which instructs the page's first reader to invoke readahead)
	- and only do the marking for newly allocated pages.
	(to prevent blindly doing readahead on already cached pages)

Look-ahead is a technique to achieve I/O pipelining:
While the application is working through a chunk of cached pages,
the kernel reads-ahead the next chunk of pages _before_ time of need.
It effectively hides low level I/O latencies to high level applications.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -265,7 +265,8 @@ out:
  */
 static int
 __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read)
+			pgoff_t offset, unsigned long nr_to_read,
+			unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
 	struct page *page;
@@ -278,7 +279,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -301,6 +302,8 @@ __do_page_cache_readahead(struct address
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
+		if (page_idx == nr_to_read - lookahead_size)
+			SetPageReadahead(page);
 		ret++;
 	}
 	read_unlock_irq(&mapping->tree_lock);
@@ -337,7 +340,7 @@ int force_page_cache_readahead(struct ad
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk);
+						offset, this_chunk, 0);
 		if (err < 0) {
 			ret = err;
 			break;
@@ -384,7 +387,7 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
@@ -404,7 +407,7 @@ blockable_page_cache_readahead(struct ad
 	if (!block && bdi_read_congested(mapping->backing_dev_info))
 		return 0;
 
-	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 
 	return check_ra_success(ra, nr_to_read, actual);
 }
@@ -449,7 +452,7 @@ static int make_ahead_window(struct addr
  * @req_size: hint: total size of the read which the caller is performing in
  *            PAGE_CACHE_SIZE units
  *
- * page_cache_readahead() is the main function.  If performs the adaptive
+ * page_cache_readahead() is the main function.  It performs the adaptive
  * readahead window size management and submits the readahead I/O.
  *
  * Note that @filp is purely used for passing on to the ->readpage[s]()

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/9] readahead: MIN_RA_PAGES/MAX_RA_PAGES macros
       [not found] ` <20070516224818.544351896@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-minmax_ra_pages.patch --]
[-- Type: text/plain, Size: 1467 bytes --]

Define two convenient macros for read-ahead:
	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
page sizes like 64k.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -21,8 +21,16 @@ void default_unplug_io_fn(struct backing
 }
 EXPORT_SYMBOL(default_unplug_io_fn);
 
+/*
+ * Convienent macros for min/max read-ahead pages.
+ * Note that MAX_RA_PAGES is rounded down, while MIN_RA_PAGES is rounded up.
+ * The latter is necessary for systems with large page size(i.e. 64k).
+ */
+#define MAX_RA_PAGES	(VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
+#define MIN_RA_PAGES	DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
 struct backing_dev_info default_backing_dev_info = {
-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+	.ra_pages	= MAX_RA_PAGES,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
@@ -51,7 +59,7 @@ static inline unsigned long get_max_read
 
 static inline unsigned long get_min_readahead(struct file_ra_state *ra)
 {
-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	return MIN_RA_PAGES;
 }
 
 static inline void reset_ahead_window(struct file_ra_state *ra)

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/9] readahead: data structure and routines
       [not found] ` <20070516224818.683288460@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  2007-06-12  3:30   ` Rusty Russell
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-data-structure-and-routines.patch --]
[-- Type: text/plain, Size: 3850 bytes --]

Extend struct file_ra_state to support the on-demand readahead logic.
Also define some helpers for it.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---

 include/linux/fs.h |   64 +++++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c     |   19 ++++++++++++
 2 files changed, 83 insertions(+)

--- linux-2.6.22-rc1-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc1-mm1/include/linux/fs.h
@@ -702,6 +702,10 @@ struct fown_struct {
 
 /*
  * Track a single file's readahead state
+ *
+ *  ================#============|==================#==================|
+ *                  ^            ^                  ^                  ^
+ *  file_ra_state.la_index    .ra_index   .lookahead_index   .readahead_index
  */
 struct file_ra_state {
 	unsigned long start;		/* Current window */
@@ -711,6 +715,12 @@ struct file_ra_state {
 	unsigned long prev_index;	/* Cache last read() position */
 	unsigned long ahead_start;	/* Ahead window */
 	unsigned long ahead_size;
+
+	pgoff_t la_index;               /* enqueue time */
+	pgoff_t ra_index;               /* begin offset */
+	pgoff_t lookahead_index;        /* time to do next readahead */
+	pgoff_t readahead_index;        /* end offset */
+
 	unsigned long ra_pages;		/* Maximum readahead window */
 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
@@ -719,6 +729,60 @@ struct file_ra_state {
 #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
 #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
 
+/*
+ * Measuring read-ahead sizes.
+ *
+ *                  |----------- readahead size ------------>|
+ *  ===#============|==================#=====================|
+ *     |------- invoke interval ------>|-- lookahead size -->|
+ */
+static inline unsigned long ra_readahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->ra_index;
+}
+
+static inline unsigned long ra_lookahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->lookahead_index;
+}
+
+static inline unsigned long ra_invoke_interval(struct file_ra_state *ra)
+{
+	return ra->lookahead_index - ra->la_index;
+}
+
+/*
+ * Check if @index falls in the readahead windows.
+ */
+static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+	return (index >= ra->la_index &&
+		index <  ra->readahead_index);
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static inline void ra_set_index(struct file_ra_state *ra,
+				pgoff_t la_index, pgoff_t ra_index)
+{
+	ra->la_index = la_index;
+	ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static inline void ra_set_size(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+	ra->readahead_index = ra->ra_index + ra_size;
+	ra->lookahead_index = ra->ra_index + ra_size - la_size;
+}
+
+unsigned long ra_submit(struct file_ra_state *ra,
+		       struct address_space *mapping, struct file *filp);
+
 struct file {
 	/*
 	 * fu_list becomes invalid after file_free is called and queued via
--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -592,3 +592,22 @@ unsigned long max_sane_readahead(unsigne
 	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+unsigned long ra_submit(struct file_ra_state *ra,
+		       struct address_space *mapping, struct file *filp)
+{
+	unsigned long ra_size;
+	unsigned long la_size;
+	int actual;
+
+	ra_size = ra_readahead_size(ra);
+	la_size = ra_lookahead_size(ra);
+	actual = __do_page_cache_readahead(mapping, filp,
+					ra->ra_index, ra_size, la_size);
+
+	return actual;
+}
+EXPORT_SYMBOL_GPL(ra_submit);

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/9] readahead: on-demand readahead logic
       [not found] ` <20070516224818.841068730@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  2007-05-19  6:23     ` Andrew Morton
  2007-06-12  4:36   ` Rusty Russell
  1 sibling, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-ondemand-minimal.patch --]
[-- Type: text/plain, Size: 15672 bytes --]

This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.

It is designed to be called on demand:
	- on a missing page, to do synchronous readahead
	- on a lookahead page, to do asynchronous readahead

In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows
to one single window.

HEURISTICS

The logic deals with four cases:

	- sequential-next
		found a consistent readahead window, so push it forward

	- random
		standalone small read, so read as is

	- sequential-first
		create a new readahead window for a sequential/oversize request

	- lookahead-clueless
		hit a lookahead page not associated with the readahead window,
		so create a new readahead window and ramp it up

In each case, three parameters are determined:

	- readahead index: where the next readahead begins
	- readahead size:  how much to readahead
	- lookahead size:  when to do the next readahead (for pipelining)


BEHAVIORS

The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:

	- It no longer imposes strict sequential checks.
	  It might help some interleaved cases, and clustered random reads.
	  It does introduce risks of a random lookahead hit triggering an
	  unexpected readahead. But in general it is more likely to do good
	  than to do evil.

	- Interleaved reads are supported in a minimal way.
	  Their chances of being detected and proper handled are still low.

	- Readahead thrashings are better handled.
	  The current readahead leads to tiny average I/O sizes, because it
	  never turn back for the thrashed pages.  They have to be fault in
	  by do_generic_mapping_read() one by one.  Whereas the on-demand
	  readahead will redo readahead for them.


OVERHEADS

The new code reduced the overheads of

	- excessively calling the readahead routine on small sized reads
	  (the current readahead code insists on seeing all requests)

	- doing a lot of pointless page-cache lookups for small cached files
	  (the current readahead only turns itself off after 256 cache hits,
	  unfortunately most files are < 1MB, so never see that chance)

That accounts for speedup of
	- 0.3% on 1-page sequential reads on sparse file
	- 1.2% on 1-page cache hot sequential reads
	- 3.2% on 256-page cache hot sequential reads
	- 1.3% on cache hot `tar /lib`

However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.


PERFORMANCE

The basic benchmark setup is
	- 2.6.20 kernel with on-demand readahead
	- 1MB max readahead size
	- 2.9GHz Intel Core 2 CPU
	- 2GB memory
	- 160G/8M Hitachi SATA II 7200 RPM disk

The benchmarks show that
	- it maintains the same performance for trivial sequential/random reads
	- sysbench/OLTP performance on MySQL gains up to 8%
	- performance on readahead thrashing gains up to 3 times


iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k

			       2.6.20          on-demand      gain
first run
	  "  Initial write "   61437.27        64521.53      +5.0%
	  "        Rewrite "   47893.02        48335.20      +0.9%
	  "           Read "   62111.84        62141.49      +0.0%
	  "        Re-read "   62242.66        62193.17      -0.1%
	  "   Reverse Read "   50031.46        49989.79      -0.1%
	  "    Stride read "    8657.61         8652.81      -0.1%
	  "    Random read "   13914.28        13898.23      -0.1%
	  " Mixed workload "   19069.27        19033.32      -0.2%
	  "   Random write "   14849.80        14104.38      -5.0%
	  "         Pwrite "   62955.30        65701.57      +4.4%
	  "          Pread "   62209.99        62256.26      +0.1%

second run
	  "  Initial write "   60810.31        66258.69      +9.0%
	  "        Rewrite "   49373.89        57833.66     +17.1%
	  "           Read "   62059.39        62251.28      +0.3%
	  "        Re-read "   62264.32        62256.82      -0.0%
	  "   Reverse Read "   49970.96        50565.72      +1.2%
	  "    Stride read "    8654.81         8638.45      -0.2%
	  "    Random read "   13901.44        13949.91      +0.3%
	  " Mixed workload "   19041.32        19092.04      +0.3%
	  "   Random write "   14019.99        14161.72      +1.0%
	  "         Pwrite "   64121.67        68224.17      +6.4%
	  "          Pread "   62225.08        62274.28      +0.1%

In summary, writes are unstable, reads are pretty close on average:

			  access pattern  2.6.20  on-demand   gain
				   Read  62085.61  62196.38  +0.2%
				Re-read  62253.49  62224.99  -0.0%
			   Reverse Read  50001.21  50277.75  +0.6%
			    Stride read   8656.21   8645.63  -0.1%
			    Random read  13907.86  13924.07  +0.1%
	 		 Mixed workload  19055.29  19062.68  +0.0%
				  Pread  62217.53  62265.27  +0.1%


aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

					2.6.20      on-demand  delta
			sequential	 92.57s      92.54s    -0.0%
			random		311.87s     312.15s    +0.1%


sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
	 --file-total-size=4G --file-block-size=64K \
	 --num-threads=001 --max-requests=10000 --max-time=900 run

				threads    2.6.20   on-demand    delta
		first run
				      1   59.1974s    59.2262s  +0.0%
				      2   58.0575s    58.2269s  +0.3%
				      4   48.0545s    47.1164s  -2.0%
				      8   41.0684s    41.2229s  +0.4%
				     16   35.8817s    36.4448s  +1.6%
				     32   32.6614s    32.8240s  +0.5%
				     64   23.7601s    24.1481s  +1.6%
				    128   24.3719s    23.8225s  -2.3%
				    256   23.2366s    22.0488s  -5.1%

		second run
				      1   59.6720s    59.5671s  -0.2%
				      8   41.5158s    41.9541s  +1.1%
				     64   25.0200s    23.9634s  -4.2%
				    256   22.5491s    20.9486s  -7.1%

Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:

                sum all up               495.046s    491.514s   -0.7%


sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
	 --mysql-socket=/var/run/mysqld/mysqld.sock \
	 --mysql-user=root --mysql-password=readahead \
	 --num-threads=064 --max-requests=10000 --max-time=900 run

	10000-transactions run
				threads    2.6.20   on-demand    gain
				      1     62.81       64.56   +2.8%
				      2     67.97       70.93   +4.4%
				      4     81.81       85.87   +5.0%
				      8     94.60       97.89   +3.5%
				     16     99.07      104.68   +5.7%
				     32     95.93      104.28   +8.7%
				     64     96.48      103.68   +7.5%
	5000-transactions run
				      1     48.21       48.65   +0.9%
				      8     68.60       70.19   +2.3%
				     64     70.57       74.72   +5.9%
	2000-transactions run
				      1     37.57       38.04   +1.3%
				      2     38.43       38.99   +1.5%
				      4     45.39       46.45   +2.3%
				      8     51.64       52.36   +1.4%
				     16     54.39       55.18   +1.5%
				     32     52.13       54.49   +4.5%
				     64     54.13       54.61   +0.9%

That's interesting results. Some investigations show that
	- MySQL is accessing the db file non-uniformly: some parts are
	  more hot than others
	- It is mostly doing 4-page random reads, and sometimes doing two
	  reads in a row, the latter one triggers a 16-page readahead.
	- The on-demand readahead leaves many lookahead pages (flagged
	  PG_readahead) there. Many of them will be hit, and trigger
	  more readahead pages. Which might save more seeks.
	- Naturally, the readahead windows tend to lie in hot areas,
	  and the lookahead pages in hot areas is more likely to be hit.
	- The more overall read density, the more possible gain.

That also explains the adaptive readahead tricks for clustered random reads.


readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.

			      max throughput     min avg I/O size
		2.6.20:            5MB/s               16KB
		on-demand:        15MB/s              140KB

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/mm.h |    6 +
 mm/readahead.c     |  174 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 180 insertions(+)

--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -611,3 +611,177 @@ unsigned long ra_submit(struct file_ra_s
 	return actual;
 }
 EXPORT_SYMBOL_GPL(ra_submit);
+
+/*
+ *  Get the previous window size, ramp it up, and
+ *  return it as the new window size.
+ */
+static unsigned long get_next_ra_size2(struct file_ra_state *ra,
+						unsigned long max)
+{
+	unsigned long cur = ra->readahead_index - ra->ra_index;
+	unsigned long newsize;
+
+	if (cur < max / 16)
+		newsize = cur * 4;
+	else
+		newsize = cur * 2;
+
+	return min(newsize, max);
+}
+
+/*
+ * On-demand readahead design.
+ *
+ * The fields in struct file_ra_state represent the most-recently-executed
+ * readahead attempt:
+ *
+ *                    |-------- last readahead window -------->|
+ *       |-- application walking here -->|
+ * ======#============|==================#=====================|
+ *       ^la_index    ^ra_index          ^lookahead_index      ^readahead_index
+ *
+ * [ra_index, readahead_index) represents the last readahead window.
+ *
+ * [la_index, lookahead_index] is where the application would be walking(in
+ * the common case of cache-cold sequential reads): the last window was
+ * established when the application was at la_index, and the next window will
+ * be bring in when the application reaches lookahead_index.
+ *
+ * To overlap application thinking time and disk I/O time, we do
+ * `readahead pipelining': Do not wait until the application consumed all
+ * readahead pages and stalled on the missing page at readahead_index;
+ * Instead, submit an asynchronous readahead I/O as early as the application
+ * reads on the page at lookahead_index. Normally lookahead_index will be
+ * equal to ra_index, for maximum pipelining.
+ *
+ * In interleaved sequential reads, concurrent streams on the same fd can
+ * be invalidating each other's readahead state. So we flag the new readahead
+ * page at lookahead_index with PG_readahead, and use it as readahead
+ * indicator. The flag won't be set on already cached pages, to avoid the
+ * readahead-for-nothing fuss, saving pointless page cache lookups.
+ *
+ * prev_index tracks the last visited page in the _previous_ read request.
+ * It should be maintained by the caller, and will be used for detecting
+ * small random reads. Note that the readahead algorithm checks loosely
+ * for sequential patterns. Hence interleaved reads might be served as
+ * sequential ones.
+ *
+ * There is a special-case: if the first page which the application tries to
+ * read happens to be the first page of the file, it is assumed that a linear
+ * read is about to happen and the window is immediately set to the initial size
+ * based on I/O request size and the max_readahead.
+ *
+ * The code ramps up the readahead size aggressively at first, but slow down as
+ * it approaches max_readhead.
+ */
+
+/*
+ * A minimal readahead algorithm for trivial sequential/random reads.
+ */
+static unsigned long
+ondemand_readahead(struct address_space *mapping,
+		   struct file_ra_state *ra, struct file *filp,
+		   struct page *page, pgoff_t offset,
+		   unsigned long req_size)
+{
+	unsigned long max;	/* max readahead pages */
+	pgoff_t ra_index;	/* readahead index */
+	unsigned long ra_size;	/* readahead size */
+	unsigned long la_size;	/* lookahead size */
+	int sequential;
+
+	max = ra->ra_pages;
+	sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
+
+	/*
+	 * Lookahead/readahead hit, assume sequential access.
+	 * Ramp up sizes, and push forward the readahead window.
+	 */
+	if (offset && (offset == ra->lookahead_index ||
+			offset == ra->readahead_index)) {
+		ra_index = ra->readahead_index;
+		ra_size = get_next_ra_size2(ra, max);
+		la_size = ra_size;
+		goto fill_ra;
+	}
+
+	/*
+	 * Standalone, small read.
+	 * Read as is, and do not pollute the readahead state.
+	 */
+	if (!page && !sequential) {
+		return __do_page_cache_readahead(mapping, filp,
+						offset, req_size, 0);
+	}
+
+	/*
+	 * It may be one of
+	 * 	- first read on start of file
+	 * 	- sequential cache miss
+	 * 	- oversize random read
+	 * Start readahead for it.
+	 */
+	ra_index = offset;
+	ra_size = get_init_ra_size(req_size, max);
+	la_size = ra_size > req_size ? ra_size - req_size : ra_size;
+
+	/*
+	 * Hit on a lookahead page without valid readahead state.
+	 * E.g. interleaved reads.
+	 * Not knowing its readahead pos/size, bet on the minimal possible one.
+	 */
+	if (page) {
+		ra_index++;
+		ra_size = min(4 * ra_size, max);
+	}
+
+fill_ra:
+	ra_set_index(ra, offset, ra_index);
+	ra_set_size(ra, ra_size, la_size);
+
+	return ra_submit(ra, mapping, filp);
+}
+
+/**
+ * page_cache_readahead_ondemand - generic file readahead
+ * @mapping: address_space which holds the pagecache and I/O vectors
+ * @ra: file_ra_state which holds the readahead state
+ * @filp: passed on to ->readpage() and ->readpages()
+ * @page: the page at @offset, or NULL if non-present
+ * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
+ * @req_size: hint: total size of the read which the caller is performing in
+ *            PAGE_CACHE_SIZE units
+ *
+ * page_cache_readahead_ondemand() is the entry point of readahead logic.
+ * This function should be called when it is time to perform readahead:
+ * 1) @page == NULL
+ *    A cache miss happened, time for synchronous readahead.
+ * 2) @page != NULL && PageReadahead(@page)
+ *    A look-ahead hit occured, time for asynchronous readahead.
+ */
+unsigned long
+page_cache_readahead_ondemand(struct address_space *mapping,
+				struct file_ra_state *ra, struct file *filp,
+				struct page *page, pgoff_t offset,
+				unsigned long req_size)
+{
+	/* no read-ahead */
+	if (!ra->ra_pages)
+		return 0;
+
+	if (page) {
+		ClearPageReadahead(page);
+
+		/*
+		 * Defer asynchronous read-ahead on IO congestion.
+		 */
+		if (bdi_read_congested(mapping->backing_dev_info))
+			return 0;
+	}
+
+	/* do read-ahead */
+	return ondemand_readahead(mapping, ra, filp, page,
+					offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);
--- linux-2.6.22-rc1-mm1.orig/include/linux/mm.h
+++ linux-2.6.22-rc1-mm1/include/linux/mm.h
@@ -1157,6 +1157,12 @@ int do_page_cache_readahead(struct addre
 			pgoff_t offset, unsigned long nr_to_read);
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
+unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
+			  struct file_ra_state *ra,
+			  struct file *filp,
+			  struct page *page,
+			  pgoff_t offset,
+			  unsigned long size);
 unsigned long page_cache_readahead(struct address_space *mapping,
 			  struct file_ra_state *ra,
 			  struct file *filp,

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 6/9] readahead: convert filemap invocations
       [not found] ` <20070516224818.963553696@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-invocations-filemap.patch --]
[-- Type: text/plain, Size: 2767 bytes --]

Convert filemap reads to use on-demand readahead.

The new call scheme is to
- call readahead on non-cached page
- call readahead on look-ahead page
- update prev_index when finished with the read request

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 mm/filemap.c |   51 +++++++++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 20 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc1-mm1/mm/filemap.c
@@ -916,15 +916,20 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
-			next_index = page_cache_readahead(mapping, &ra, filp,
-					index, last_index - index);
-
 find_page:
 		page = find_get_page(mapping, index);
-		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
-			goto no_cached_page;
+		if (!page) {
+			page_cache_readahead_ondemand(mapping,
+					&ra, filp, page,
+					index, last_index - index);
+			page = find_get_page(mapping, index);
+			if (unlikely(page == NULL))
+				goto no_cached_page;
+		}
+		if (PageReadahead(page)) {
+			page_cache_readahead_ondemand(mapping,
+					&ra, filp, page,
+					index, last_index - index);
 		}
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
@@ -1075,6 +1080,7 @@ no_cached_page:
 
 out:
 	*_ra = ra;
+	_ra->prev_index = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
@@ -1378,26 +1384,30 @@ struct page *filemap_fault(struct vm_are
 		goto no_cached_page;
 
 	/*
-	 * The readahead code wants to be told about each and every page
-	 * so it can build and shrink its windows appropriately
-	 *
-	 * For sequential accesses, we use the generic readahead logic.
-	 */
-	if (VM_SequentialReadHint(vma))
-		page_cache_readahead(mapping, ra, file, fdata->pgoff, 1);
-
-	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
 	page = find_lock_page(mapping, fdata->pgoff);
+	/*
+	 * For sequential accesses, we use the generic readahead logic.
+	 */
+	if (VM_SequentialReadHint(vma)) {
+		if (!page) {
+			page_cache_readahead_ondemand(mapping, ra, file, page,
+							   fdata->pgoff, 1);
+			page = find_lock_page(mapping, fdata->pgoff);
+			if (!page)
+				goto no_cached_page;
+		}
+		if (PageReadahead(page)) {
+			page_cache_readahead_ondemand(mapping, ra, file, page,
+							   fdata->pgoff, 1);
+		}
+	}
+
 	if (!page) {
 		unsigned long ra_pages;
 
-		if (VM_SequentialReadHint(vma)) {
-			handle_ra_miss(mapping, ra, fdata->pgoff);
-			goto no_cached_page;
-		}
 		ra->mmap_miss++;
 
 		/*
@@ -1450,6 +1460,7 @@ retry_find:
 	 * Found the page and have a reference on it.
 	 */
 	mark_page_accessed(page);
+	ra->prev_index = page->index;
 	return page;
 
 outside_data_content:

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 7/9] readahead: convert splice invocations
       [not found] ` <20070516224819.131727893@mail.ustc.edu.cn>
@ 2007-05-16 22:47   ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-invocations-splice.patch --]
[-- Type: text/plain, Size: 1598 bytes --]

Convert splice reads to use on-demand readahead.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/splice.c |   18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/fs/splice.c
+++ linux-2.6.22-rc1-mm1/fs/splice.c
@@ -289,12 +289,6 @@ __generic_file_splice_read(struct file *
 		nr_pages = PIPE_BUFFERS;
 
 	/*
-	 * Don't try to 2nd guess the read-ahead logic, call into
-	 * page_cache_readahead() like the page cache reads would do.
-	 */
-	page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages);
-
-	/*
 	 * Now fill in the holes:
 	 */
 	error = 0;
@@ -317,11 +311,8 @@ __generic_file_splice_read(struct file *
 		 */
 		page = find_get_page(mapping, index);
 		if (!page) {
-			/*
-			 * Make sure the read-ahead engine is notified
-			 * about this failure.
-			 */
-			handle_ra_miss(mapping, &in->f_ra, index);
+			page_cache_readahead_ondemand(mapping, &in->f_ra, in,
+					NULL, index, nr_pages - spd.nr_pages);
 
 			/*
 			 * page didn't exist, allocate one.
@@ -368,6 +359,10 @@ __generic_file_splice_read(struct file *
 		this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
 		page = pages[page_nr];
 
+		if (PageReadahead(page))
+			page_cache_readahead_ondemand(mapping, &in->f_ra, in,
+					page, index, nr_pages - page_nr);
+
 		/*
 		 * If the page isn't uptodate, we may need to start io on it
 		 */
@@ -456,6 +451,7 @@ fill_it:
 	 */
 	while (page_nr < nr_pages)
 		page_cache_release(pages[page_nr++]);
+	in->f_ra.prev_index = index;
 
 	if (spd.nr_pages)
 		return splice_to_pipe(pipe, &spd);

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 8/9] readahead: convert ext3/ext4 invocations
       [not found] ` <20070516224819.281192313@mail.ustc.edu.cn>
@ 2007-05-16 22:48   ` Fengguang Wu
  2007-05-19 12:19   ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-invocations-dir.patch --]
[-- Type: text/plain, Size: 2107 bytes --]

Convert ext3/ext4 dir reads to use on-demand readahead.

Readahead for dirs operates _not_ on file level, but on blockdev level.
This makes a difference when the data blocks are not continuous.
And the read routine is somehow opaque: there's no handy info about the status
of current page. So a simplified call scheme is employed: to call into
readahead whenever the current page falls out of readahead windows.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/ext3/dir.c |   14 ++++++++------
 fs/ext4/dir.c |   14 ++++++++------
 2 files changed, 16 insertions(+), 12 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/fs/ext3/dir.c
+++ linux-2.6.22-rc1-mm1/fs/ext3/dir.c
@@ -136,12 +136,14 @@ static int ext3_readdir(struct file * fi
 		err = ext3_get_blocks_handle(NULL, inode, blk, 1,
 						&map_bh, 0, 0);
 		if (err > 0) {
-			page_cache_readahead(sb->s_bdev->bd_inode->i_mapping,
-				&filp->f_ra,
-				filp,
-				map_bh.b_blocknr >>
-					(PAGE_CACHE_SHIFT - inode->i_blkbits),
-				1);
+			pgoff_t index = map_bh.b_blocknr >>
+					(PAGE_CACHE_SHIFT - inode->i_blkbits);
+			if (!ra_has_index(&filp->f_ra, index))
+				page_cache_readahead_ondemand(
+					sb->s_bdev->bd_inode->i_mapping,
+					&filp->f_ra, filp,
+					NULL, index, 1);
+			filp->f_ra.prev_index = index;
 			bh = ext3_bread(NULL, inode, blk, 0, &err);
 		}
 
--- linux-2.6.22-rc1-mm1.orig/fs/ext4/dir.c
+++ linux-2.6.22-rc1-mm1/fs/ext4/dir.c
@@ -135,12 +135,14 @@ static int ext4_readdir(struct file * fi
 		map_bh.b_state = 0;
 		err = ext4_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0);
 		if (err > 0) {
-			page_cache_readahead(sb->s_bdev->bd_inode->i_mapping,
-				&filp->f_ra,
-				filp,
-				map_bh.b_blocknr >>
-					(PAGE_CACHE_SHIFT - inode->i_blkbits),
-				1);
+			pgoff_t index = map_bh.b_blocknr >>
+					(PAGE_CACHE_SHIFT - inode->i_blkbits);
+			if (!ra_has_index(&filp->f_ra, index))
+				page_cache_readahead_ondemand(
+					sb->s_bdev->bd_inode->i_mapping,
+					&filp->f_ra, filp,
+					NULL, index, 1);
+			filp->f_ra.prev_index = index;
 			bh = ext4_bread(NULL, inode, blk, 0, &err);
 		}
 

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 9/9] readahead: remove the old algorithm
       [not found] ` <20070516224819.420933490@mail.ustc.edu.cn>
@ 2007-05-16 22:48   ` Fengguang Wu
  2007-05-19 12:18   ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-16 22:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

[-- Attachment #1: readahead-cleanup.patch --]
[-- Type: text/plain, Size: 16137 bytes --]

Remove the old readahead algorithm.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/fs.h |   11 -
 include/linux/mm.h |    7 
 mm/readahead.c     |  373 ++-----------------------------------------
 3 files changed, 26 insertions(+), 365 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc1-mm1/include/linux/fs.h
@@ -708,14 +708,6 @@ struct fown_struct {
  *  file_ra_state.la_index    .ra_index   .lookahead_index   .readahead_index
  */
 struct file_ra_state {
-	unsigned long start;		/* Current window */
-	unsigned long size;
-	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
-	unsigned long cache_hit;	/* cache hit count*/
-	unsigned long prev_index;	/* Cache last read() position */
-	unsigned long ahead_start;	/* Ahead window */
-	unsigned long ahead_size;
-
 	pgoff_t la_index;               /* enqueue time */
 	pgoff_t ra_index;               /* begin offset */
 	pgoff_t lookahead_index;        /* time to do next readahead */
@@ -724,10 +716,9 @@ struct file_ra_state {
 	unsigned long ra_pages;		/* Maximum readahead window */
 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
+	unsigned long prev_index;	/* Cache last read() position */
 	unsigned int prev_offset;	/* Offset where last read() ended in a page */
 };
-#define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
-#define RA_FLAG_INCACHE 0x02	/* file is already in cache */
 
 /*
  * Measuring read-ahead sizes.
--- linux-2.6.22-rc1-mm1.orig/include/linux/mm.h
+++ linux-2.6.22-rc1-mm1/include/linux/mm.h
@@ -1163,13 +1163,6 @@ unsigned long page_cache_readahead_ondem
 			  struct page *page,
 			  pgoff_t offset,
 			  unsigned long size);
-unsigned long page_cache_readahead(struct address_space *mapping,
-			  struct file_ra_state *ra,
-			  struct file *filp,
-			  pgoff_t offset,
-			  unsigned long size);
-void handle_ra_miss(struct address_space *mapping, 
-		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 
 /* Do stack extension */
--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -49,82 +49,6 @@ file_ra_state_init(struct file_ra_state 
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
 
-/*
- * Return max readahead size for this inode in number-of-pages.
- */
-static inline unsigned long get_max_readahead(struct file_ra_state *ra)
-{
-	return ra->ra_pages;
-}
-
-static inline unsigned long get_min_readahead(struct file_ra_state *ra)
-{
-	return MIN_RA_PAGES;
-}
-
-static inline void reset_ahead_window(struct file_ra_state *ra)
-{
-	/*
-	 * ... but preserve ahead_start + ahead_size value,
-	 * see 'recheck:' label in page_cache_readahead().
-	 * Note: We never use ->ahead_size as rvalue without
-	 * checking ->ahead_start != 0 first.
-	 */
-	ra->ahead_size += ra->ahead_start;
-	ra->ahead_start = 0;
-}
-
-static inline void ra_off(struct file_ra_state *ra)
-{
-	ra->start = 0;
-	ra->flags = 0;
-	ra->size = 0;
-	reset_ahead_window(ra);
-	return;
-}
-
-/*
- * Set the initial window size, round to next power of 2 and square
- * for small size, x 4 for medium, and x 2 for large
- * for 128k (32 page) max ra
- * 1-8 page = 32k initial, > 8 page = 128k initial
- */
-static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
-{
-	unsigned long newsize = roundup_pow_of_two(size);
-
-	if (newsize <= max / 32)
-		newsize = newsize * 4;
-	else if (newsize <= max / 4)
-		newsize = newsize * 2;
-	else
-		newsize = max;
-	return newsize;
-}
-
-/*
- * Set the new window size, this is called only when I/O is to be submitted,
- * not for each call to readahead.  If a cache miss occured, reduce next I/O
- * size, else increase depending on how close to max we are.
- */
-static inline unsigned long get_next_ra_size(struct file_ra_state *ra)
-{
-	unsigned long max = get_max_readahead(ra);
-	unsigned long min = get_min_readahead(ra);
-	unsigned long cur = ra->size;
-	unsigned long newsize;
-
-	if (ra->flags & RA_FLAG_MISS) {
-		ra->flags &= ~RA_FLAG_MISS;
-		newsize = max((cur - 2), min);
-	} else if (cur < max / 16) {
-		newsize = 4 * cur;
-	} else {
-		newsize = 2 * cur;
-	}
-	return min(newsize, max);
-}
-
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
 /**
@@ -201,66 +125,6 @@ out:
 }
 
 /*
- * Readahead design.
- *
- * The fields in struct file_ra_state represent the most-recently-executed
- * readahead attempt:
- *
- * start:	Page index at which we started the readahead
- * size:	Number of pages in that read
- *              Together, these form the "current window".
- *              Together, start and size represent the `readahead window'.
- * prev_index:  The page which the readahead algorithm most-recently inspected.
- *              It is mainly used to detect sequential file reading.
- *              If page_cache_readahead sees that it is again being called for
- *              a page which it just looked at, it can return immediately without
- *              making any state changes.
- * offset:      Offset in the prev_index where the last read ended - used for
- *              detection of sequential file reading.
- * ahead_start,
- * ahead_size:  Together, these form the "ahead window".
- * ra_pages:	The externally controlled max readahead for this fd.
- *
- * When readahead is in the off state (size == 0), readahead is disabled.
- * In this state, prev_index is used to detect the resumption of sequential I/O.
- *
- * The readahead code manages two windows - the "current" and the "ahead"
- * windows.  The intent is that while the application is walking the pages
- * in the current window, I/O is underway on the ahead window.  When the
- * current window is fully traversed, it is replaced by the ahead window
- * and the ahead window is invalidated.  When this copying happens, the
- * new current window's pages are probably still locked.  So
- * we submit a new batch of I/O immediately, creating a new ahead window.
- *
- * So:
- *
- *   ----|----------------|----------------|-----
- *       ^start           ^start+size
- *                        ^ahead_start     ^ahead_start+ahead_size
- *
- *         ^ When this page is read, we submit I/O for the
- *           ahead window.
- *
- * A `readahead hit' occurs when a read request is made against a page which is
- * the next sequential page. Ahead window calculations are done only when it
- * is time to submit a new IO.  The code ramps up the size agressively at first,
- * but slow down as it approaches max_readhead.
- *
- * Any seek/ramdom IO will result in readahead being turned off.  It will resume
- * at the first sequential access.
- *
- * There is a special-case: if the first page which the application tries to
- * read happens to be the first page of the file, it is assumed that a linear
- * read is about to happen and the window is immediately set to the initial size
- * based on I/O request size and the max_readahead.
- *
- * This function is to be called for every read request, rather than when
- * it is time to perform readahead.  It is called only once for the entire I/O
- * regardless of size unless readahead is unable to start enough I/O to satisfy
- * the request (I/O request > max_readahead).
- */
-
-/*
  * do_page_cache_readahead actually reads a chunk of disk.  It allocates all
  * the pages first, then submits them all for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
@@ -295,7 +159,7 @@ __do_page_cache_readahead(struct address
 	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
-		
+
 		if (page_offset > end_index)
 			break;
 
@@ -361,28 +225,6 @@ int force_page_cache_readahead(struct ad
 }
 
 /*
- * Check how effective readahead is being.  If the amount of started IO is
- * less than expected then the file is partly or fully in pagecache and
- * readahead isn't helping.
- *
- */
-static inline int check_ra_success(struct file_ra_state *ra,
-			unsigned long nr_to_read, unsigned long actual)
-{
-	if (actual == 0) {
-		ra->cache_hit += nr_to_read;
-		if (ra->cache_hit >= VM_MAX_CACHE_HIT) {
-			ra_off(ra);
-			ra->flags |= RA_FLAG_INCACHE;
-			return 0;
-		}
-	} else {
-		ra->cache_hit=0;
-	}
-	return 1;
-}
-
-/*
  * This version skips the IO if the queue is read-congested, and will tell the
  * block layer to abandon the readahead if request allocation would block.
  *
@@ -399,191 +241,6 @@ int do_page_cache_readahead(struct addre
 }
 
 /*
- * Read 'nr_to_read' pages starting at page 'offset'. If the flag 'block'
- * is set wait till the read completes.  Otherwise attempt to read without
- * blocking.
- * Returns 1 meaning 'success' if read is successful without switching off
- * readahead mode. Otherwise return failure.
- */
-static int
-blockable_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read,
-			struct file_ra_state *ra, int block)
-{
-	int actual;
-
-	if (!block && bdi_read_congested(mapping->backing_dev_info))
-		return 0;
-
-	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
-
-	return check_ra_success(ra, nr_to_read, actual);
-}
-
-static int make_ahead_window(struct address_space *mapping, struct file *filp,
-				struct file_ra_state *ra, int force)
-{
-	int block, ret;
-
-	ra->ahead_size = get_next_ra_size(ra);
-	ra->ahead_start = ra->start + ra->size;
-
-	block = force || (ra->prev_index >= ra->ahead_start);
-	ret = blockable_page_cache_readahead(mapping, filp,
-			ra->ahead_start, ra->ahead_size, ra, block);
-
-	if (!ret && !force) {
-		/* A read failure in blocking mode, implies pages are
-		 * all cached. So we can safely assume we have taken
-		 * care of all the pages requested in this call.
-		 * A read failure in non-blocking mode, implies we are
-		 * reading more pages than requested in this call.  So
-		 * we safely assume we have taken care of all the pages
-		 * requested in this call.
-		 *
-		 * Just reset the ahead window in case we failed due to
-		 * congestion.  The ahead window will any way be closed
-		 * in case we failed due to excessive page cache hits.
-		 */
-		reset_ahead_window(ra);
-	}
-
-	return ret;
-}
-
-/**
- * page_cache_readahead - generic adaptive readahead
- * @mapping: address_space which holds the pagecache and I/O vectors
- * @ra: file_ra_state which holds the readahead state
- * @filp: passed on to ->readpage() and ->readpages()
- * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
- * @req_size: hint: total size of the read which the caller is performing in
- *            PAGE_CACHE_SIZE units
- *
- * page_cache_readahead() is the main function.  It performs the adaptive
- * readahead window size management and submits the readahead I/O.
- *
- * Note that @filp is purely used for passing on to the ->readpage[s]()
- * handler: it may refer to a different file from @mapping (so we may not use
- * @filp->f_mapping or @filp->f_path.dentry->d_inode here).
- * Also, @ra may not be equal to &@filp->f_ra.
- *
- */
-unsigned long
-page_cache_readahead(struct address_space *mapping, struct file_ra_state *ra,
-		     struct file *filp, pgoff_t offset, unsigned long req_size)
-{
-	unsigned long max, newsize;
-	int sequential;
-
-	/*
-	 * We avoid doing extra work and bogusly perturbing the readahead
-	 * window expansion logic.
-	 */
-	if (offset == ra->prev_index && --req_size)
-		++offset;
-
-	/* Note that prev_index == -1 if it is a first read */
-	sequential = (offset == ra->prev_index + 1);
-	ra->prev_index = offset;
-	ra->prev_offset = 0;
-
-	max = get_max_readahead(ra);
-	newsize = min(req_size, max);
-
-	/* No readahead or sub-page sized read or file already in cache */
-	if (newsize == 0 || (ra->flags & RA_FLAG_INCACHE))
-		goto out;
-
-	ra->prev_index += newsize - 1;
-
-	/*
-	 * Special case - first read at start of file. We'll assume it's
-	 * a whole-file read and grow the window fast.  Or detect first
-	 * sequential access
-	 */
-	if (sequential && ra->size == 0) {
-		ra->size = get_init_ra_size(newsize, max);
-		ra->start = offset;
-		if (!blockable_page_cache_readahead(mapping, filp, offset,
-							 ra->size, ra, 1))
-			goto out;
-
-		/*
-		 * If the request size is larger than our max readahead, we
-		 * at least want to be sure that we get 2 IOs in flight and
-		 * we know that we will definitly need the new I/O.
-		 * once we do this, subsequent calls should be able to overlap
-		 * IOs,* thus preventing stalls. so issue the ahead window
-		 * immediately.
-		 */
-		if (req_size >= max)
-			make_ahead_window(mapping, filp, ra, 1);
-
-		goto out;
-	}
-
-	/*
-	 * Now handle the random case:
-	 * partial page reads and first access were handled above,
-	 * so this must be the next page otherwise it is random
-	 */
-	if (!sequential) {
-		ra_off(ra);
-		blockable_page_cache_readahead(mapping, filp, offset,
-				 newsize, ra, 1);
-		goto out;
-	}
-
-	/*
-	 * If we get here we are doing sequential IO and this was not the first
-	 * occurence (ie we have an existing window)
-	 */
-	if (ra->ahead_start == 0) {	 /* no ahead window yet */
-		if (!make_ahead_window(mapping, filp, ra, 0))
-			goto recheck;
-	}
-
-	/*
-	 * Already have an ahead window, check if we crossed into it.
-	 * If so, shift windows and issue a new ahead window.
-	 * Only return the #pages that are in the current window, so that
-	 * we get called back on the first page of the ahead window which
-	 * will allow us to submit more IO.
-	 */
-	if (ra->prev_index >= ra->ahead_start) {
-		ra->start = ra->ahead_start;
-		ra->size = ra->ahead_size;
-		make_ahead_window(mapping, filp, ra, 0);
-recheck:
-		/* prev_index shouldn't overrun the ahead window */
-		ra->prev_index = min(ra->prev_index,
-			ra->ahead_start + ra->ahead_size - 1);
-	}
-
-out:
-	return ra->prev_index + 1;
-}
-EXPORT_SYMBOL_GPL(page_cache_readahead);
-
-/*
- * handle_ra_miss() is called when it is known that a page which should have
- * been present in the pagecache (we just did some readahead there) was in fact
- * not found.  This will happen if it was evicted by the VM (readahead
- * thrashing)
- *
- * Turn on the cache miss flag in the RA struct, this will cause the RA code
- * to reduce the RA size on the next read.
- */
-void handle_ra_miss(struct address_space *mapping,
-		struct file_ra_state *ra, pgoff_t offset)
-{
-	ra->flags |= RA_FLAG_MISS;
-	ra->flags &= ~RA_FLAG_INCACHE;
-	ra->cache_hit = 0;
-}
-
-/*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
@@ -613,19 +270,39 @@ unsigned long ra_submit(struct file_ra_s
 EXPORT_SYMBOL_GPL(ra_submit);
 
 /*
+ * Set the initial window size, round to next power of 2 and square
+ * for small size, x 4 for medium, and x 2 for large
+ * for 128k (32 page) max ra
+ * 1-8 page = 32k initial, > 8 page = 128k initial
+ */
+static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
+{
+	unsigned long newsize = roundup_pow_of_two(size);
+
+	if (newsize <= max / 32)
+		newsize = newsize * 4;
+	else if (newsize <= max / 4)
+		newsize = newsize * 2;
+	else
+		newsize = max;
+
+	return newsize;
+}
+
+/*
  *  Get the previous window size, ramp it up, and
  *  return it as the new window size.
  */
-static unsigned long get_next_ra_size2(struct file_ra_state *ra,
+static unsigned long get_next_ra_size(struct file_ra_state *ra,
 						unsigned long max)
 {
 	unsigned long cur = ra->readahead_index - ra->ra_index;
 	unsigned long newsize;
 
 	if (cur < max / 16)
-		newsize = cur * 4;
+		newsize = 4 * cur;
 	else
-		newsize = cur * 2;
+		newsize = 2 * cur;
 
 	return min(newsize, max);
 }
@@ -701,7 +378,7 @@ ondemand_readahead(struct address_space 
 	if (offset && (offset == ra->lookahead_index ||
 			offset == ra->readahead_index)) {
 		ra_index = ra->readahead_index;
-		ra_size = get_next_ra_size2(ra, max);
+		ra_size = get_next_ra_size(ra, max);
 		la_size = ra_size;
 		goto fill_ra;
 	}

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
  2007-05-16 22:47   ` [PATCH 5/9] readahead: on-demand readahead logic Fengguang Wu
@ 2007-05-19  6:23     ` Andrew Morton
       [not found]       ` <20070519130202.GB6095@mail.ustc.edu.cn>
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2007-05-19  6:23 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

On Thu, 17 May 2007 06:47:57 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> This is a minimal readahead algorithm that aims to replace the current one.
> It is more flexible and reliable, while maintaining almost the same behavior
> and performance. Also it is full integrated with adaptive readahead.
> 
> It is designed to be called on demand:
> 	- on a missing page, to do synchronous readahead
> 	- on a lookahead page, to do asynchronous readahead
> 
> In this way it eliminated the awkward workarounds for cache hit/miss,
> readahead thrashing, retried read, and unaligned read. It also adopts the
> data structure introduced by adaptive readahead, parameterizes readahead
> pipelining with `lookahead_index', and reduces the current/ahead windows
> to one single window.
> 
> HEURISTICS
> 
> The logic deals with four cases:
> 
> ...
>

That would have to be the best changelog I've ever seen ;) Thanks for
persisting with this.

> sysbench oltp (trans/sec): up to 8% gain

Have you given any thought to identifying workloads which may be worsened
by your changes?  Attempt to deliberately expose any weak spots?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
  2007-05-16 22:47   ` [PATCH 1/9] readahead: introduce PG_readahead Fengguang Wu
@ 2007-05-19  6:28     ` Andrew Morton
  2007-05-19 11:35       ` Andi Kleen
       [not found]       ` <20070519123031.GA6095@mail.ustc.edu.cn>
  0 siblings, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2007-05-19  6:28 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> Introduce a new page flag: PG_readahead.

Is there any way in which we can avoid adding a new page flag?

We have the advantage that if the kernel very occasionally gets the wrong
result for PageReadahead(page), nothing particularly bad will happen, so we
can do racy things.

>From a quick peek, it appears that PG_readahead is only ever set against
non-uptodate pages.  If true we could perhaps exploit that: say,
PageReadahead(page) == PG_referenced && !PG_uptodate?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
  2007-05-19  6:28     ` Andrew Morton
@ 2007-05-19 11:35       ` Andi Kleen
  2007-05-19 15:19         ` Andrew Morton
       [not found]       ` <20070519123031.GA6095@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2007-05-19 11:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov,
	Steven Pratt, Ram Pai

> We have the advantage that if the kernel very occasionally gets the wrong
> result for PageReadahead(page), nothing particularly bad will happen, so we
> can do racy things.
 
On 64bit there is no particular shortage of page flags. If you ever do
racy things please do them 32bit only.

-Andi


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 9/9] readahead: remove the old algorithm
       [not found] ` <20070516224819.420933490@mail.ustc.edu.cn>
  2007-05-16 22:48   ` [PATCH 9/9] readahead: remove the old algorithm Fengguang Wu
@ 2007-05-19 12:18   ` Andi Kleen
       [not found]     ` <20070519131713.GA6510@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2007-05-19 12:18 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Thu, May 17, 2007 at 06:48:01AM +0800, Fengguang Wu wrote:
> Remove the old readahead algorithm.

FWIW - i read the complete patchkit. While I'm not a particular expert
on that code i didn't see anything I didn't like

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/9] readahead: convert ext3/ext4 invocations
       [not found] ` <20070516224819.281192313@mail.ustc.edu.cn>
  2007-05-16 22:48   ` [PATCH 8/9] readahead: convert ext3/ext4 invocations Fengguang Wu
@ 2007-05-19 12:19   ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2007-05-19 12:19 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Thu, May 17, 2007 at 06:48:00AM +0800, Fengguang Wu wrote:
> Convert ext3/ext4 dir reads to use on-demand readahead.
> 
> Readahead for dirs operates _not_ on file level, but on blockdev level.
> This makes a difference when the data blocks are not continuous.
> And the read routine is somehow opaque: there's no handy info about the status
> of current page. So a simplified call scheme is employed: to call into
> readahead whenever the current page falls out of readahead windows.

ext2 too would be nice. Also when it goes in it might be worth
contacting linux-fsdevel with a quick pointer to this so that other
file systems can possibly benefit too.
-Andi


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
       [not found]       ` <20070519123031.GA6095@mail.ustc.edu.cn>
@ 2007-05-19 12:30         ` Fengguang Wu
  2007-05-19 15:25           ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-05-19 12:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai, linux-ext4@vger.kernel.org

On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > Introduce a new page flag: PG_readahead.
> 
> Is there any way in which we can avoid adding a new page flag?
> 
> We have the advantage that if the kernel very occasionally gets the wrong
> result for PageReadahead(page), nothing particularly bad will happen, so we
> can do racy things.
> 
> >From a quick peek, it appears that PG_readahead is only ever set against
> non-uptodate pages.  If true we could perhaps exploit that: say,
> PageReadahead(page) == PG_referenced && !PG_uptodate?

PG_uptodate will flip to 1 before the reader touches the page :(

However, it may be possible to share the same bit with PG_reclaim or PG_booked.
Which one would be preferred?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found]       ` <20070519130202.GB6095@mail.ustc.edu.cn>
@ 2007-05-19 13:02         ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-19 13:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai

On Fri, May 18, 2007 at 11:23:35PM -0700, Andrew Morton wrote:
> 
> That would have to be the best changelog I've ever seen ;) Thanks for
> persisting with this.

Thank you :)

> > sysbench oltp (trans/sec): up to 8% gain
> 
> Have you given any thought to identifying workloads which may be worsened
> by your changes?  Attempt to deliberately expose any weak spots?

Yeah. All possible downsides I can imagine are:

- CPU overheads
  Only random reads will be hurt.
  That's 1% slow down for _sparse files_, and should be much smaller
  when real I/O is involved.

- Behavior changes
  It do not enforce strict check sequentialness.
  - it is in general a good behavior for interleaved reads and
    clustered-and-intermixed-random/sequential workloads.
  - it might lead to more readahead misses
    E.g. a random read sequence of 0,1,4,12,28,60,92,124,156,188,220
    that is weird enough to start the readahead and hit all the
    lookahead pages.
    I highly doubt the possibility of such patterns happen in real
    world. But if ever it happens repeatedly for some user, he can
    work it around easily by tuning readahead_kb to some other value.

So, it is only a possibility that some random workload may be
worsened. But it's really hard to find one real world example.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 9/9] readahead: remove the old algorithm
       [not found]     ` <20070519131713.GA6510@mail.ustc.edu.cn>
@ 2007-05-19 13:17       ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-05-19 13:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, Jens Axboe, Oleg Nesterov,
	Steven Pratt, Ram Pai

On Sat, May 19, 2007 at 02:18:55PM +0200, Andi Kleen wrote:
> On Thu, May 17, 2007 at 06:48:01AM +0800, Fengguang Wu wrote:
> > Remove the old readahead algorithm.
> 
> FWIW - i read the complete patchkit. While I'm not a particular expert
> on that code i didn't see anything I didn't like

Thank you.

Wu


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
  2007-05-19 11:35       ` Andi Kleen
@ 2007-05-19 15:19         ` Andrew Morton
  0 siblings, 0 replies; 34+ messages in thread
From: Andrew Morton @ 2007-05-19 15:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Fengguang Wu, linux-kernel, Jens Axboe, Oleg Nesterov,
	Steven Pratt, Ram Pai

On Sat, 19 May 2007 13:35:01 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > We have the advantage that if the kernel very occasionally gets the wrong
> > result for PageReadahead(page), nothing particularly bad will happen, so we
> > can do racy things.
>  
> On 64bit there is no particular shortage of page flags.

I think pretty much all of the upper 32 bits got used for ia64 fields,
but it's never been very clear how many were actually used or needed.

> If you ever do racy things please do them 32bit only.

hrm.  It *really* won't matter if we make one suboptimal readahead decision
every second day.  And there's value in having the same behaviour on all
architectures.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
  2007-05-19 12:30         ` Fengguang Wu
@ 2007-05-19 15:25           ` Andrew Morton
       [not found]             ` <20070520030904.GA9176@mail.ustc.edu.cn>
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2007-05-19 15:25 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-kernel, Andi Kleen, Jens Axboe, Oleg Nesterov, Steven Pratt,
	Ram Pai, linux-ext4@vger.kernel.org

On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> > On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > 
> > > Introduce a new page flag: PG_readahead.
> > 
> > Is there any way in which we can avoid adding a new page flag?
> > 
> > We have the advantage that if the kernel very occasionally gets the wrong
> > result for PageReadahead(page), nothing particularly bad will happen, so we
> > can do racy things.
> > 
> > >From a quick peek, it appears that PG_readahead is only ever set against
> > non-uptodate pages.  If true we could perhaps exploit that: say,
> > PageReadahead(page) == PG_referenced && !PG_uptodate?
> 
> PG_uptodate will flip to 1 before the reader touches the page :(

OK.

> However, it may be possible to share the same bit with PG_reclaim or PG_booked.
> Which one would be preferred?

I'd like PG_booked to go away too - I don't think we've put that under the
microscope yet.  If it remains then its scope will be "defined by the
filesystem", so readahead shouldn't use it.  PG_reclaim belongs to core VFS
so that's much better.

Let's not do anything ugly, slow or silly in there, but please do take a
look, see if there is an opportunity here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
       [not found]             ` <20070520030904.GA9176@mail.ustc.edu.cn>
@ 2007-05-20  3:09               ` Fengguang Wu
  2007-05-20  7:10                 ` Christoph Lameter
  0 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-05-20  3:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai, linux-ext4@vger.kernel.org

On Sat, May 19, 2007 at 08:25:04AM -0700, Andrew Morton wrote:
> On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> > > On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > > 
> > > > Introduce a new page flag: PG_readahead.
> > > 
> > > Is there any way in which we can avoid adding a new page flag?
> > > 
> > > We have the advantage that if the kernel very occasionally gets the wrong
> > > result for PageReadahead(page), nothing particularly bad will happen, so we
> > > can do racy things.
> > > 
> > > >From a quick peek, it appears that PG_readahead is only ever set against
> > > non-uptodate pages.  If true we could perhaps exploit that: say,
> > > PageReadahead(page) == PG_referenced && !PG_uptodate?
> > 
> > PG_uptodate will flip to 1 before the reader touches the page :(
> 
> OK.
> 
> > However, it may be possible to share the same bit with PG_reclaim or PG_booked.
> > Which one would be preferred?
> 
> I'd like PG_booked to go away too - I don't think we've put that under the
> microscope yet.  If it remains then its scope will be "defined by the
> filesystem", so readahead shouldn't use it.  PG_reclaim belongs to core VFS
> so that's much better.
> 
> Let's not do anything ugly, slow or silly in there, but please do take a
> look, see if there is an opportunity here.

The reuse code would look like the attached one.
It still needs more testing, and would fail if Christoph reuses
PG_reclaim in higher order pagecache in the future.

Fengguang Wu
---

Subject: mm: share PG_readahead and PG_reclaim

Share the same page flag bit for PG_readahead and PG_reclaim.

One is used only on file reads, another is only for emergency writes.
One is used mostly for fresh/young pages, another is for old pages.

Combinations of possible interactions are:

a) clear PG_reclaim => implicit clear of PG_readahead
	it will delay an asynchronous readahead into a synchronous one
	it actually does _good_ for readahead:
		the pages will be reclaimed soon, it's readahead thrashing!
		in this case, synchronous readahead makes more sense.

b) clear PG_readahead => implicit clear of PG_reclaim
	one(and only one) page will not be reclaimed in time
	it can be avoided by checking PageWriteback(page) in readahead first

c) set PG_reclaim => implicit set of PG_readahead
	will confuse readahead and make it restart the size rampup process
	it's a trivial problem, and can mostly be avoided by checking
	PageWriteback(page) first in readahead

d) set PG_readahead => implicit set of PG_reclaim
	PG_readahead will never be set on already cached pages.
	PG_reclaim will always be cleared on dirtying a page.
	so not a problem.

In summary,
	a)   we get better behavior
	b,d) possible interactions can be avoided
	c)   racy condition exists that might affect readahead, but the chance
	     is _really_ low, and the hurt on readahead is trivial.

Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/page-flags.h |    4 +++-
 mm/page-writeback.c        |    1 +
 mm/readahead.c             |    4 ++++
 3 files changed, 8 insertions(+), 1 deletion(-)

--- linux-2.6.22-rc1-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.22-rc1-mm1/include/linux/page-flags.h
@@ -92,7 +92,9 @@
 #define PG_lazyfree		20	/* MADV_FREE potential throwaway */
 
 #define PG_booked		21	/* Has blocks reserved on-disk */
-#define PG_readahead		22	/* Reminder to do async read-ahead */
+
+/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+#define PG_readahead		PG_reclaim	/* Reminder to do async read-ahead */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
--- linux-2.6.22-rc1-mm1.orig/mm/page-writeback.c
+++ linux-2.6.22-rc1-mm1/mm/page-writeback.c
@@ -922,6 +922,7 @@ int clear_page_dirty_for_io(struct page 
 
 	BUG_ON(!PageLocked(page));
 
+	ClearPageReclaim(page);
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		/*
 		 * Yes, Virginia, this is indeed insane.
--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -447,6 +447,10 @@ page_cache_readahead_ondemand(struct add
 	if (!ra->ra_pages)
 		return 0;
 
+	/* It's PG_reclaim! */
+	if (PageWriteback(page))
+		return 0;
+
 	if (page) {
 		ClearPageReadahead(page);
 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
  2007-05-20  3:09               ` Fengguang Wu
@ 2007-05-20  7:10                 ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2007-05-20  7:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai, linux-ext4@vger.kernel.org

On Sun, 20 May 2007, Fengguang Wu wrote:

> The reuse code would look like the attached one.
> It still needs more testing, and would fail if Christoph reuses
> PG_reclaim in higher order pagecache in the future.

Do not worry about that. All higher order pagecache patchsets remove that
sharing because we will then need these pages on the LRU and thus
PG_reclaim cannot be shared anymore.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
       [not found] ` <20070516224818.246719596@mail.ustc.edu.cn>
  2007-05-16 22:47   ` [PATCH 1/9] readahead: introduce PG_readahead Fengguang Wu
@ 2007-06-12  1:04   ` Rusty Russell
       [not found]     ` <20070612025201.GA6447@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2007-06-12  1:04 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> plain text document attachment (mm-introduce-pg_readahead.patch)
> Introduce a new page flag: PG_readahead.
> 
> It acts as a look-ahead mark, which tells the page reader:
> Hey, it's time to invoke the read-ahead logic.  For the sake of I/O pipelining,
> don't wait until it runs out of cached pages!

Hi Fengguang!

	I've been reading your patches, and I have some (possibly dumb!)
questions.

For this patch: why set a bit in the page, rather than keep a value
inside the "struct file_ra_state"?

Thanks,
Rusty.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/9] readahead: introduce PG_readahead
       [not found]     ` <20070612025201.GA6447@mail.ustc.edu.cn>
@ 2007-06-12  2:52       ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-06-12  2:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

Hi Rusty,

On Tue, Jun 12, 2007 at 11:04:54AM +1000, Rusty Russell wrote:
> On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> > plain text document attachment (mm-introduce-pg_readahead.patch)
> > Introduce a new page flag: PG_readahead.
> > 
> > It acts as a look-ahead mark, which tells the page reader:
> > Hey, it's time to invoke the read-ahead logic.  For the sake of I/O pipelining,
> > don't wait until it runs out of cached pages!
> 
> Hi Fengguang!
> 
> 	I've been reading your patches, and I have some (possibly dumb!)
> questions.
> 
> For this patch: why set a bit in the page, rather than keep a value
> inside the "struct file_ra_state"?

Good question. I should have documented it in the patch ;)

The short answer:
there can be multiple read streams per fd, i.e.  interleaved reads.

file_ra_state can not easily track all of the streams. Solaris zfs
does it by managing a list in struct dnode. While PG_readahead plus
the context based readahead(http://lkml.org/lkml/2006/11/15/54) makes
another solution. The two solutions are comparable in complexity.
The context based readahead is a bit more feature rich: it is sensible
to memory pressure, and supports an unlimited number of streams. Note
that random reads can be regarded as a huge number of one-shot streams,
which can interfere with zfs's list-based readahead states badly.

Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] readahead: data structure and routines
       [not found] ` <20070516224818.683288460@mail.ustc.edu.cn>
  2007-05-16 22:47   ` [PATCH 4/9] readahead: data structure and routines Fengguang Wu
@ 2007-06-12  3:30   ` Rusty Russell
       [not found]     ` <20070612120706.GB9624@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2007-06-12  3:30 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote: 
> /*
>   * Track a single file's readahead state
> + *
> + *  ================#============|==================#==================|
> + *                  ^            ^                  ^                  ^
> + *  file_ra_state.la_index    .ra_index   .lookahead_index   .readahead_index
>   */
>  struct file_ra_state {
>  	unsigned long start;		/* Current window */
> @@ -711,6 +715,12 @@ struct file_ra_state {
>  	unsigned long prev_index;	/* Cache last read() position */
>  	unsigned long ahead_start;	/* Ahead window */
>  	unsigned long ahead_size;
> +
> +	pgoff_t la_index;               /* enqueue time */
> +	pgoff_t ra_index;               /* begin offset */
> +	pgoff_t lookahead_index;        /* time to do next readahead */
> +	pgoff_t readahead_index;        /* end offset */
> +

Hi Fengguang,

I found these variables a little confusing.  la_index is the last offset
passed to ondemand_readahead, so perhaps "last_request_start" is a
better name?  The comment "enqueue time" seems strange, too.

ra_index seems ok, although "readahead_start" might be better.  Perhaps
readahead_index should be expressed as readahead_size, which is how it
seems to be used.  Perhaps "lookahead_index" is best expressed as a
buffer at the end of the readahead zone (readahead_min?).

ie:
	pgoff_t last_request_start;     /* start of req which triggered readahead */
	pgoff_t readahead_start;        /* Where readahead started */
	pgoff_t readahead_size;         /* PAGE_CACHE_SIZE units of readahead */
	pgoff_t readahead_min;          /* readahead_size left before we recalc */

This gets rid of many of the accessors, I think, and avoids introducing
a new term to understand (lookahead).

> +/*
> + * Where is the old read-ahead and look-ahead?
> + */
> +static inline void ra_set_index(struct file_ra_state *ra,
> +				pgoff_t la_index, pgoff_t ra_index)
> +{
> +	ra->la_index = la_index;
> +	ra->ra_index = ra_index;
> +}
> +
> +/*
> + * Where is the new read-ahead and look-ahead?
> + */
> +static inline void ra_set_size(struct file_ra_state *ra,
> +				unsigned long ra_size, unsigned long la_size)
> +{
> +	ra->readahead_index = ra->ra_index + ra_size;
> +	ra->lookahead_index = ra->ra_index + ra_size - la_size;
> +}

These are only called in one place, so I think it's clearer to do this
there directly.  But I see you exported ra_submit, too, even though it's
only used in the same file.  Are there plans for other users?

Thanks,
Rusty.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found] ` <20070516224818.841068730@mail.ustc.edu.cn>
  2007-05-16 22:47   ` [PATCH 5/9] readahead: on-demand readahead logic Fengguang Wu
@ 2007-06-12  4:36   ` Rusty Russell
       [not found]     ` <20070612103518.GA9624@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2007-06-12  4:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> +static unsigned long
> +ondemand_readahead(struct address_space *mapping,
> +		   struct file_ra_state *ra, struct file *filp,
> +		   struct page *page, pgoff_t offset,
> +		   unsigned long req_size)
> +{
> +	unsigned long max;	/* max readahead pages */
> +	pgoff_t ra_index;	/* readahead index */
> +	unsigned long ra_size;	/* readahead size */
> +	unsigned long la_size;	/* lookahead size */
> +	int sequential;
> +
> +	max = ra->ra_pages;
> +	sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);

Hi again!

This <= 1UL seems weird.  prev_index is end of last request, so I'd
expect offset == prev_index + 1 for sequential reads?  Does offset ==
ra->prev_index happen?  If not, this would be clearer as (offset ==
ra->prev_index + 1).

(prev_index is not a great name either, but that's not your patch 8).

> +	/*
> +	 * Lookahead/readahead hit, assume sequential access.
> +	 * Ramp up sizes, and push forward the readahead window.
> +	 */
> +	if (offset && (offset == ra->lookahead_index ||
> +			offset == ra->readahead_index)) {
> +		ra_index = ra->readahead_index;
> +		ra_size = get_next_ra_size2(ra, max);
> +		la_size = ra_size;
> +		goto fill_ra;
> +	}

Will offset hit lookahead_index or readahead_index exactly?  Should this
be checking the range from offset to offset + req_size?

> +	ra_index = offset;
> +	ra_size = get_init_ra_size(req_size, max);
> +	la_size = ra_size > req_size ? ra_size - req_size : ra_size;

So if we're doing a big sequential read, ra_size < req_size, so next
time offset will be > ra->readahead_index and the "ramp up sizes" code
won't get run?

> +	/*
> +	 * Hit on a lookahead page without valid readahead state.
> +	 * E.g. interleaved reads.
> +	 * Not knowing its readahead pos/size, bet on the minimal possible one.
> +	 */
> +	if (page) {
> +		ra_index++;
> +		ra_size = min(4 * ra_size, max);
> +	}

If I understand correctly, it's expected to happen when we have multiple
streams: we previously marked the lookahead page, but then the other
stream changed the ra to somewhere else in the file.  We now change it
back to our stream, but we've lost information so we make it up.

This seems a little like two functions crammed into one.  Do you think
page_cache_readahead_ondemand() should be split into
"page_cache_readahead()" which doesn't take a page*, and
"page_cache_check_readahead_page()" which is an inline which does the
PageReadahead(page) check as well?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found]     ` <20070612103518.GA9624@mail.ustc.edu.cn>
@ 2007-06-12 10:35       ` Fengguang Wu
  2007-06-13  1:40       ` Rusty Russell
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-06-12 10:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

Hi Rusty,

On Tue, Jun 12, 2007 at 02:36:26PM +1000, Rusty Russell wrote:
> On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> > +static unsigned long
> > +ondemand_readahead(struct address_space *mapping,
> > +		   struct file_ra_state *ra, struct file *filp,
> > +		   struct page *page, pgoff_t offset,
> > +		   unsigned long req_size)
> > +{
> > +	unsigned long max;	/* max readahead pages */
> > +	pgoff_t ra_index;	/* readahead index */
> > +	unsigned long ra_size;	/* readahead size */
> > +	unsigned long la_size;	/* lookahead size */
> > +	int sequential;
> > +
> > +	max = ra->ra_pages;
> > +	sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
>
> This <= 1UL seems weird.  prev_index is end of last request, so I'd
> expect offset == prev_index + 1 for sequential reads?  Does offset ==
> ra->prev_index happen?  If not, this would be clearer as (offset ==
> ra->prev_index + 1).

It's possible to have (offset == ra->prev_index) when someone is doing
1K reads or 10K reads, which do not always align to page boundaries.

> (prev_index is not a great name either, but that's not your patch 8).

It was just renamed from `prev_page', hehe.

> > +	/*
> > +	 * Lookahead/readahead hit, assume sequential access.
> > +	 * Ramp up sizes, and push forward the readahead window.
> > +	 */
> > +	if (offset && (offset == ra->lookahead_index ||
> > +			offset == ra->readahead_index)) {
> > +		ra_index = ra->readahead_index;
> > +		ra_size = get_next_ra_size2(ra, max);
> > +		la_size = ra_size;
> > +		goto fill_ra;
> > +	}
>
> Will offset hit lookahead_index or readahead_index exactly?  Should this
> be checking the range from offset to offset + req_size?

Yes, normally lookahead_index will be hit(1). But in case readahead is
canceled because of IO congestion at that time, readahead_index will
be hit later(2).

The readahead code is called on two possible conditions:
(1) page != NULL and PageReadahead(page)
It will be an asynchronous readahead.
In this case, (offset == ra->lookahead_index) indicates sequential
reads that have been associated with a valid readahead window.

(2) page == NULL
It will be a synchronous readahead.
In this case, (offset == ra->readahead_index) indicates sequential
reads that has just consumed all of the readahead pages.

> > +	ra_index = offset;
> > +	ra_size = get_init_ra_size(req_size, max);
> > +	la_size = ra_size > req_size ? ra_size - req_size : ra_size;
>
> So if we're doing a big sequential read, ra_size < req_size, so next
> time offset will be > ra->readahead_index and the "ramp up sizes" code
> won't get run?

For big reads, (ra_size = max) and (max < req_size), la_size will be
equal to ra_size, or max. So after this readahead invocation submits
max pages of I/O and returns to do_generic_mapping_read(), it will
*immediately* be called again because of lookahead hit:

                if (!page) {
                        page_cache_readahead_ondemand(mapping,
                                        &ra, filp, page,
                                        index, last_index - index);
                        page = find_get_page(mapping, index);
                        if (unlikely(page == NULL))
                                goto no_cached_page;
                }
lookahead hit:  if (PageReadahead(page)) {
                        page_cache_readahead_ondemand(mapping,
                                        &ra, filp, page,
                                        index, last_index - index);
                }

Then it will submit another max pages of readahead I/O, whether or not
the size ramp up code will be executed: either the remaining request
size is still > max and get_init_ra_size() is called, or the remaining
request size is <= max and get_next_ra_size() is called, in both cases
they will return max.  This behavior is inherited from the current
readahead, and makes sense.

> > +	/*
> > +	 * Hit on a lookahead page without valid readahead state.
> > +	 * E.g. interleaved reads.
> > +	 * Not knowing its readahead pos/size, bet on the minimal possible one.
> > +	 */
> > +	if (page) {
> > +		ra_index++;
> > +		ra_size = min(4 * ra_size, max);
> > +	}
>
> If I understand correctly, it's expected to happen when we have multiple
> streams: we previously marked the lookahead page, but then the other
> stream changed the ra to somewhere else in the file.  We now change it
> back to our stream, but we've lost information so we make it up.

Yeah, exactly!

> This seems a little like two functions crammed into one.  Do you think
> page_cache_readahead_ondemand() should be split into
> "page_cache_readahead()" which doesn't take a page*, and
> "page_cache_check_readahead_page()" which is an inline which does the
> PageReadahead(page) check as well?

page_cache_check_readahead_page(..., page) is a good idea.
But which part of the code should we put to page_cache_readahead()
that does not take a page param?

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] readahead: data structure and routines
       [not found]     ` <20070612120706.GB9624@mail.ustc.edu.cn>
@ 2007-06-12 12:07       ` Fengguang Wu
  2007-06-13  0:27       ` Rusty Russell
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-06-12 12:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

Hi Rusty,

On Tue, Jun 12, 2007 at 01:30:50PM +1000, Rusty Russell wrote:
> On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote: 
> > /*
> >   * Track a single file's readahead state
> > + *
> > + *  ================#============|==================#==================|
> > + *                  ^            ^                  ^                  ^
> > + *  file_ra_state.la_index    .ra_index   .lookahead_index   .readahead_index
> >   */
> >  struct file_ra_state {
> >  	unsigned long start;		/* Current window */
> > @@ -711,6 +715,12 @@ struct file_ra_state {
> >  	unsigned long prev_index;	/* Cache last read() position */
> >  	unsigned long ahead_start;	/* Ahead window */
> >  	unsigned long ahead_size;
> > +
> > +	pgoff_t la_index;               /* enqueue time */
> > +	pgoff_t ra_index;               /* begin offset */
> > +	pgoff_t lookahead_index;        /* time to do next readahead */
> > +	pgoff_t readahead_index;        /* end offset */
> > +
> 
> I found these variables a little confusing.  la_index is the last offset
> passed to ondemand_readahead, so perhaps "last_request_start" is a
> better name?  The comment "enqueue time" seems strange, too.

Yes, they are a bit confusing. Sorry for the bad naming and comments!

The precise meanings can be:
la_index - the time (where we are reading) when the readahead window is established
ra_index - where the readahead window starts
lookahead_index - the time (on reading of which) to push forward the readahead window
readahead_index - where the readahead window ends, or
                  where the next readahead should start with

In normal case, when the readahead window is pushed forward, the
following holds:
                la_index = lookahead_index; lookahead_index = new-value;
                ra_index = readahead_index; readahead_index = new-value;

> ra_index seems ok, although "readahead_start" might be better.  Perhaps
> readahead_index should be expressed as readahead_size, which is how it
> seems to be used.  Perhaps "lookahead_index" is best expressed as a
> buffer at the end of the readahead zone (readahead_min?).
> 
> ie:
> 	pgoff_t last_request_start;     /* start of req which triggered readahead */
> 	pgoff_t readahead_start;        /* Where readahead started */
> 	pgoff_t readahead_size;         /* PAGE_CACHE_SIZE units of readahead */
> 	pgoff_t readahead_min;          /* readahead_size left before we recalc */
> 
> This gets rid of many of the accessors, I think, and avoids introducing
> a new term to understand (lookahead).

Both indexes and sizes will be used in the code. So calculates may
always be necessary somewhere. If there's a good naming scheme, either
form(index/size based) is OK to me :)

In fact, there are two kind of windows and one buffer:

                    |---------- readahead window ----------->|                                                          
    ===#============|==================#=====================|                                                          
       |--- reader walking window ---->|--- async buffer --->|        

'lookahead' is not a standard term, while 'readahead_min' may be
confusing for some people?  Anyway, I'd like to propose two more
possible schemes:

	pgoff_t ahead_start;     /* readahead window */
        pgoff_t ahead_end;
	pgoff_t reader_start;    /* on read of which the ahead window was established */
	pgoff_t reader_end;      /* on read of which the ahead window will be pushed forward */

or preferably:

	pgoff_t start;                 /* where readahead started */
        unsigned long size;            /* # of readahead pages */
	unsigned long async_size;      /* do asynchronous readahead when there are only # of pages ahead */

	unsigned long async_size_old;  /* TODO: this one is not needed for now */

Any opinions? Thanks.

> > +/*
> > + * Where is the old read-ahead and look-ahead?
> > + */
> > +static inline void ra_set_index(struct file_ra_state *ra,
> > +				pgoff_t la_index, pgoff_t ra_index)
> > +{
> > +	ra->la_index = la_index;
> > +	ra->ra_index = ra_index;
> > +}
> > +
> > +/*
> > + * Where is the new read-ahead and look-ahead?
> > + */
> > +static inline void ra_set_size(struct file_ra_state *ra,
> > +				unsigned long ra_size, unsigned long la_size)
> > +{
> > +	ra->readahead_index = ra->ra_index + ra_size;
> > +	ra->lookahead_index = ra->ra_index + ra_size - la_size;
> > +}
> 
> These are only called in one place, so I think it's clearer to do this
> there directly.  But I see you exported ra_submit, too, even though it's
> only used in the same file.  Are there plans for other users?

Yes, if we are to re-introduce the adaptive readahead, the functions
will be reused.

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] readahead: data structure and routines
       [not found]     ` <20070612120706.GB9624@mail.ustc.edu.cn>
  2007-06-12 12:07       ` Fengguang Wu
@ 2007-06-13  0:27       ` Rusty Russell
       [not found]         ` <20070613030701.GA6494@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2007-06-13  0:27 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Tue, 2007-06-12 at 20:07 +0800, Fengguang Wu wrote:
> Hi Rusty,

Hi Fengguang,

> or preferably:
> 
> 	pgoff_t start;                 /* where readahead started */
>         unsigned long size;            /* # of readahead pages */
> 	unsigned long async_size;      /* do asynchronous readahead when there are only # of pages ahead */
> 
> 	unsigned long async_size_old;  /* TODO: this one is not needed for now */
> 
> Any opinions? Thanks.

These names and comments are really nice.  I think the code will become
more readable after this, too.

Did you want me to try to make this patch, or did you want to do it?

Thanks,
Rusty.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found]     ` <20070612103518.GA9624@mail.ustc.edu.cn>
  2007-06-12 10:35       ` Fengguang Wu
@ 2007-06-13  1:40       ` Rusty Russell
       [not found]         ` <20070613040044.GB6494@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2007-06-13  1:40 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Tue, 2007-06-12 at 18:35 +0800, Fengguang Wu wrote:
> > This seems a little like two functions crammed into one.  Do you think
> > page_cache_readahead_ondemand() should be split into
> > "page_cache_readahead()" which doesn't take a page*, and
> > "page_cache_check_readahead_page()" which is an inline which does the
> > PageReadahead(page) check as well?
> 
> page_cache_check_readahead_page(..., page) is a good idea.
> But which part of the code should we put to page_cache_readahead()
> that does not take a page param?

OK, here's my attempt.  I also made them return void, since none of the
callers seem to mind.  I didn't change the internals much: I think
they're pretty clear and I didn't want to clash if you decided to rename
the "ra" fields.  It compiles and boots.

Thoughts?

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff -r 3c8ae0c37063 fs/ext3/dir.c
--- a/fs/ext3/dir.c	Tue Jun 12 17:17:25 2007 +1000
+++ b/fs/ext3/dir.c	Wed Jun 13 11:17:42 2007 +1000
@@ -139,10 +139,10 @@ static int ext3_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
-				page_cache_readahead_ondemand(
+				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
-					NULL, index, 1);
+					index, 1);
 			filp->f_ra.prev_index = index;
 			bh = ext3_bread(NULL, inode, blk, 0, &err);
 		}
diff -r 3c8ae0c37063 fs/ext4/dir.c
--- a/fs/ext4/dir.c	Tue Jun 12 17:17:25 2007 +1000
+++ b/fs/ext4/dir.c	Wed Jun 13 11:19:56 2007 +1000
@@ -138,10 +138,10 @@ static int ext4_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
-				page_cache_readahead_ondemand(
+				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
-					NULL, index, 1);
+					index, 1);
 			filp->f_ra.prev_index = index;
 			bh = ext4_bread(NULL, inode, blk, 0, &err);
 		}
diff -r 3c8ae0c37063 fs/splice.c
--- a/fs/splice.c	Tue Jun 12 17:17:25 2007 +1000
+++ b/fs/splice.c	Wed Jun 13 11:18:38 2007 +1000
@@ -312,8 +312,8 @@ __generic_file_splice_read(struct file *
 		 */
 		page = find_get_page(mapping, index);
 		if (!page) {
-			page_cache_readahead_ondemand(mapping, &in->f_ra, in,
-					NULL, index, nr_pages - spd.nr_pages);
+			page_cache_sync_readahead(mapping, &in->f_ra, in,
+					index, nr_pages - spd.nr_pages);
 
 			/*
 			 * page didn't exist, allocate one.
@@ -360,8 +360,7 @@ __generic_file_splice_read(struct file *
 		this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
 		page = pages[page_nr];
 
-		if (PageReadahead(page))
-			page_cache_readahead_ondemand(mapping, &in->f_ra, in,
+		page_cache_check_readahead_page(mapping, &in->f_ra, in,
 					page, index, nr_pages - page_nr);
 
 		/*
diff -r 3c8ae0c37063 include/linux/mm.h
--- a/include/linux/mm.h	Tue Jun 12 17:17:25 2007 +1000
+++ b/include/linux/mm.h	Wed Jun 13 11:25:10 2007 +1000
@@ -1146,12 +1146,34 @@ int do_page_cache_readahead(struct addre
 			pgoff_t offset, unsigned long nr_to_read);
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
-unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
-			  struct file_ra_state *ra,
-			  struct file *filp,
-			  struct page *page,
-			  pgoff_t offset,
-			  unsigned long size);
+
+void page_cache_sync_readahead(struct address_space *mapping,
+			       struct file_ra_state *ra,
+			       struct file *filp,
+			       pgoff_t offset,
+			       unsigned long size);
+
+void page_cache_async_readahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				struct file *filp,
+				struct page *pg,
+				pgoff_t offset,
+				unsigned long size);
+
+/* If page has PG_readahead flag set, call async readahead logic. */
+static inline void
+page_cache_check_readahead_page(struct address_space *mapping,
+				struct file_ra_state *ra,
+				struct file *filp,
+				struct page *pg,
+				pgoff_t offset,
+				unsigned long size)
+{
+	if (!PageReadahead(pg))
+		return;
+	page_cache_async_readahead(mapping, ra, filp, pg, offset, size);
+}
+
 unsigned long max_sane_readahead(unsigned long nr);
 
 /* Do stack extension */
diff -r 3c8ae0c37063 mm/filemap.c
--- a/mm/filemap.c	Tue Jun 12 17:17:25 2007 +1000
+++ b/mm/filemap.c	Wed Jun 13 11:22:04 2007 +1000
@@ -913,18 +913,16 @@ find_page:
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
-			page_cache_readahead_ondemand(mapping,
-					&ra, filp, page,
+			page_cache_sync_readahead(mapping,
+					&ra, filp,
 					index, last_index - index);
 			page = find_get_page(mapping, index);
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
-		if (PageReadahead(page)) {
-			page_cache_readahead_ondemand(mapping,
+		page_cache_check_readahead_page(mapping,
 					&ra, filp, page,
 					index, last_index - index);
-		}
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1382,16 +1380,14 @@ retry_find:
 	 */
 	if (VM_SequentialReadHint(vma)) {
 		if (!page) {
-			page_cache_readahead_ondemand(mapping, ra, file, page,
+			page_cache_sync_readahead(mapping, ra, file,
 							   fdata->pgoff, 1);
 			page = find_lock_page(mapping, fdata->pgoff);
 			if (!page)
 				goto no_cached_page;
 		}
-		if (PageReadahead(page)) {
-			page_cache_readahead_ondemand(mapping, ra, file, page,
+		page_cache_check_readahead_page(mapping, ra, file, page,
 							   fdata->pgoff, 1);
-		}
 	}
 
 	if (!page) {
diff -r 3c8ae0c37063 mm/readahead.c
--- a/mm/readahead.c	Tue Jun 12 17:17:25 2007 +1000
+++ b/mm/readahead.c	Wed Jun 13 11:09:39 2007 +1000
@@ -351,7 +351,7 @@ static unsigned long
 static unsigned long
 ondemand_readahead(struct address_space *mapping,
 		   struct file_ra_state *ra, struct file *filp,
-		   struct page *page, pgoff_t offset,
+		   bool hit_lookahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
 	unsigned long max;	/* max readahead pages */
@@ -379,7 +379,7 @@ ondemand_readahead(struct address_space 
 	 * Standalone, small read.
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	if (!page && !sequential) {
+	if (!hit_lookahead_marker && !sequential) {
 		return __do_page_cache_readahead(mapping, filp,
 						offset, req_size, 0);
 	}
@@ -400,7 +400,7 @@ ondemand_readahead(struct address_space 
 	 * E.g. interleaved reads.
 	 * Not knowing its readahead pos/size, bet on the minimal possible one.
 	 */
-	if (page) {
+	if (hit_lookahead_marker) {
 		ra_index++;
 		ra_size = min(4 * ra_size, max);
 	}
@@ -413,50 +413,71 @@ fill_ra:
 }
 
 /**
- * page_cache_readahead_ondemand - generic file readahead
+ * page_cache_sync_readahead - generic file readahead
  * @mapping: address_space which holds the pagecache and I/O vectors
  * @ra: file_ra_state which holds the readahead state
  * @filp: passed on to ->readpage() and ->readpages()
- * @page: the page at @offset, or NULL if non-present
  * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
  * @req_size: hint: total size of the read which the caller is performing in
  *            PAGE_CACHE_SIZE units
  *
- * page_cache_readahead_ondemand() is the entry point of readahead logic.
- * This function should be called when it is time to perform readahead:
- * 1) @page == NULL
- *    A cache miss happened, time for synchronous readahead.
- * 2) @page != NULL && PageReadahead(@page)
- *    A look-ahead hit occured, time for asynchronous readahead.
- */
-unsigned long
-page_cache_readahead_ondemand(struct address_space *mapping,
-				struct file_ra_state *ra, struct file *filp,
-				struct page *page, pgoff_t offset,
-				unsigned long req_size)
+ * page_cache_sync_readahead() should be called when a cache miss happened:
+ * it will submit the read.  The readahead logic may decide to piggyback more
+ * pages onto the read request if access patterns suggest it will improve
+ * performance.
+ */
+void page_cache_sync_readahead(struct address_space *mapping,
+			       struct file_ra_state *ra, struct file *filp,
+			       pgoff_t offset, unsigned long req_size)
 {
 	/* no read-ahead */
 	if (!ra->ra_pages)
-		return 0;
-
-	if (page) {
-		/*
-		 * It can be PG_reclaim.
-		 */
-		if (PageWriteback(page))
-			return 0;
-
-		ClearPageReadahead(page);
-
-		/*
-		 * Defer asynchronous read-ahead on IO congestion.
-		 */
-		if (bdi_read_congested(mapping->backing_dev_info))
-			return 0;
-	}
+		return;
 
 	/* do read-ahead */
-	return ondemand_readahead(mapping, ra, filp, page,
-					offset, req_size);
-}
-EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);
+	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
+
+/**
+ * page_cache_async_readahead - file readahead for marked pages
+ * @mapping: address_space which holds the pagecache and I/O vectors
+ * @ra: file_ra_state which holds the readahead state
+ * @filp: passed on to ->readpage() and ->readpages()
+ * @page: the page at @offset which has the PageReadahead flag set
+ * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
+ * @req_size: hint: total size of the read which the caller is performing in
+ *            PAGE_CACHE_SIZE units
+ *
+ * page_cache_async_ondemand() should be called when a page is used which
+ * has the PageReadahead flag: this is a marker to suggest that the application
+ * has used up enough of the readahead window that we should start pulling in
+ * more pages. */
+void
+page_cache_async_readahead(struct address_space *mapping,
+			   struct file_ra_state *ra, struct file *filp,
+			   struct page *page, pgoff_t offset,
+			   unsigned long req_size)
+{
+	/* no read-ahead */
+	if (!ra->ra_pages)
+		return;
+
+	/*
+	 * Same bit is used for PG_readahead and PG_reclaim.
+	 */
+	if (PageWriteback(page))
+		return;
+
+	ClearPageReadahead(page);
+
+	/*
+	 * Defer asynchronous read-ahead on IO congestion.
+	 */
+	if (bdi_read_congested(mapping->backing_dev_info))
+		return;
+
+	/* do read-ahead */
+	ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_async_readahead);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] readahead: data structure and routines
       [not found]         ` <20070613030701.GA6494@mail.ustc.edu.cn>
@ 2007-06-13  3:07           ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-06-13  3:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

Hi Rusty,

On Wed, Jun 13, 2007 at 10:27:19AM +1000, Rusty Russell wrote:
> On Tue, 2007-06-12 at 20:07 +0800, Fengguang Wu wrote:
> > or preferably:
> > 
> > 	pgoff_t start;                 /* where readahead started */
> >         unsigned long size;            /* # of readahead pages */
> > 	unsigned long async_size;      /* do asynchronous readahead when there are only # of pages ahead */
> > 
> > 	unsigned long async_size_old;  /* TODO: this one is not needed for now */
> > 
> > Any opinions? Thanks.
> 
> These names and comments are really nice.  I think the code will become
> more readable after this, too.

Thank you!

> Did you want me to try to make this patch, or did you want to do it?

I'll make it, after your patch, hehe.

Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found]         ` <20070613040044.GB6494@mail.ustc.edu.cn>
@ 2007-06-13  4:00           ` Fengguang Wu
  2007-06-13  5:51           ` Rusty Russell
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-06-13  4:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Wed, Jun 13, 2007 at 11:40:33AM +1000, Rusty Russell wrote:
> On Tue, 2007-06-12 at 18:35 +0800, Fengguang Wu wrote:
> > > This seems a little like two functions crammed into one.  Do you think
> > > page_cache_readahead_ondemand() should be split into
> > > "page_cache_readahead()" which doesn't take a page*, and
> > > "page_cache_check_readahead_page()" which is an inline which does the
> > > PageReadahead(page) check as well?
> > 
> > page_cache_check_readahead_page(..., page) is a good idea.
> > But which part of the code should we put to page_cache_readahead()
> > that does not take a page param?
> 
> OK, here's my attempt.  I also made them return void, since none of the
> callers seem to mind.  I didn't change the internals much: I think
> they're pretty clear and I didn't want to clash if you decided to rename
> the "ra" fields.  It compiles and boots.
> 
> Thoughts?

Thank you, comments follow.

> +void page_cache_sync_readahead(struct address_space *mapping,
> +			       struct file_ra_state *ra,
> +			       struct file *filp,
> +			       pgoff_t offset,
> +			       unsigned long size);
> +
> +void page_cache_async_readahead(struct address_space *mapping,
> +				struct file_ra_state *ra,
> +				struct file *filp,
> +				struct page *pg,
> +				pgoff_t offset,
> +				unsigned long size);

Got your idea: it looks like a nice split.

> +/* If page has PG_readahead flag set, call async readahead logic. */
> +static inline void
> +page_cache_check_readahead_page(struct address_space *mapping,
> +				struct file_ra_state *ra,
> +				struct file *filp,
> +				struct page *pg,
> +				pgoff_t offset,
> +				unsigned long size)
> +{
> +	if (!PageReadahead(pg))
> +		return;
> +	page_cache_async_readahead(mapping, ra, filp, pg, offset, size);
> +}

This function might not be necessary?
I guess the explicit code is clear enough.

>  static unsigned long
>  ondemand_readahead(struct address_space *mapping,
>  		   struct file_ra_state *ra, struct file *filp,
> -		   struct page *page, pgoff_t offset,
> +		   bool hit_lookahead_marker, pgoff_t offset,

Or use names like async/is_async for hit_lookahead_marker?
Seems that you have accepted the 'lookahead' term ;)

>  		   unsigned long req_size)
>  {
>  	unsigned long max;	/* max readahead pages */
> @@ -379,7 +379,7 @@ ondemand_readahead(struct address_space 
>  	 * Standalone, small read.
>  	 * Read as is, and do not pollute the readahead state.
>  	 */
> -	if (!page && !sequential) {
> +	if (!hit_lookahead_marker && !sequential) {
>  		return __do_page_cache_readahead(mapping, filp,
>  						offset, req_size, 0);
>  	}
> @@ -400,7 +400,7 @@ ondemand_readahead(struct address_space 
>  	 * E.g. interleaved reads.
>  	 * Not knowing its readahead pos/size, bet on the minimal possible one.
>  	 */
> -	if (page) {
> +	if (hit_lookahead_marker) {
>  		ra_index++;
>  		ra_size = min(4 * ra_size, max);
>  	}

> +/**
> + * page_cache_async_readahead - file readahead for marked pages
> + * @mapping: address_space which holds the pagecache and I/O vectors
> + * @ra: file_ra_state which holds the readahead state
> + * @filp: passed on to ->readpage() and ->readpages()
> + * @page: the page at @offset which has the PageReadahead flag set

                                               ^PG_readahead

> + * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
> + * @req_size: hint: total size of the read which the caller is performing in
> + *            PAGE_CACHE_SIZE units

                 ^in pagecache pages?
//Christoph is changing the fixed PAGE_CACHE_SIZE to variable ones.

> + *
> + * page_cache_async_ondemand() should be called when a page is used which
> + * has the PageReadahead flag: this is a marker to suggest that the application

              ^PG_readahead

> + * has used up enough of the readahead window that we should start pulling in
> + * more pages. */
> +void
> +page_cache_async_readahead(struct address_space *mapping,
> +			   struct file_ra_state *ra, struct file *filp,
> +			   struct page *page, pgoff_t offset,
> +			   unsigned long req_size)
> +{
> +	/* no read-ahead */
> +	if (!ra->ra_pages)
> +		return;
> +
> +	/*
> +	 * Same bit is used for PG_readahead and PG_reclaim.

I like this new comment!

> +	 */
> +	if (PageWriteback(page))
> +		return;
> +
> +	ClearPageReadahead(page);

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found]         ` <20070613040044.GB6494@mail.ustc.edu.cn>
  2007-06-13  4:00           ` Fengguang Wu
@ 2007-06-13  5:51           ` Rusty Russell
       [not found]             ` <20070613070724.GA6146@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2007-06-13  5:51 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Wed, 2007-06-13 at 12:00 +0800, Fengguang Wu wrote:
> On Wed, Jun 13, 2007 at 11:40:33AM +1000, Rusty Russell wrote:
> > +/* If page has PG_readahead flag set, call async readahead logic. */
> > +static inline void
> > +page_cache_check_readahead_page(struct address_space *mapping,
> > +				struct file_ra_state *ra,
> > +				struct file *filp,
> > +				struct page *pg,
> > +				pgoff_t offset,
> > +				unsigned long size)
> > +{
> > +	if (!PageReadahead(pg))
> > +		return;
> > +	page_cache_async_readahead(mapping, ra, filp, pg, offset, size);
> > +}
> 
> This function might not be necessary?
> I guess the explicit code is clear enough.

Hi Fengguang,

	Yes, I think you're right.

> >  static unsigned long
> >  ondemand_readahead(struct address_space *mapping,
> >  		   struct file_ra_state *ra, struct file *filp,
> > -		   struct page *page, pgoff_t offset,
> > +		   bool hit_lookahead_marker, pgoff_t offset,
> 
> Or use names like async/is_async for hit_lookahead_marker?

I wasn't sure.  The argument says "we've hit a marker, so do readahead
even if doesn't look sequential.".  Now, you and I know that only
happens if it's an async readahead, but that's not really relevant to
this function.

> Seems that you have accepted the 'lookahead' term ;)

Yes, but I shouldn't, because marker is still called PG_readahead 8)  I
changed this to "hit_readahead_marker" instead.

> > +/**
> > + * page_cache_async_readahead - file readahead for marked pages
> > + * @mapping: address_space which holds the pagecache and I/O vectors
> > + * @ra: file_ra_state which holds the readahead state
> > + * @filp: passed on to ->readpage() and ->readpages()
> > + * @page: the page at @offset which has the PageReadahead flag set
> 
>                                                ^PG_readahead

Oh, yes.

> > + * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
> > + * @req_size: hint: total size of the read which the caller is performing in
> > + *            PAGE_CACHE_SIZE units
> 
>                  ^in pagecache pages?
> //Christoph is changing the fixed PAGE_CACHE_SIZE to variable ones.

This came from your patch, but it sounds like a good change.  Also the
PAGE_CACHE_SIZE units above.

> I like this new comment!

Heh, maybe that means I finally understand the code 8).

OK, here is revised patch with your changes:

====
Split ondemand readahead interface into two functions.  I think this
makes it a little clearer for non-readahead experts (like Rusty).

Internally they both call ondemand_readahead(), but the page argument
is changed to an obvious boolean flag.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff -r e96f11f61577 fs/ext3/dir.c
--- a/fs/ext3/dir.c	Sun Jun 10 18:25:28 2007 +1000
+++ b/fs/ext3/dir.c	Wed Jun 13 15:16:49 2007 +1000
@@ -139,10 +139,10 @@ static int ext3_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
-				page_cache_readahead_ondemand(
+				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
-					NULL, index, 1);
+					index, 1);
 			filp->f_ra.prev_index = index;
 			bh = ext3_bread(NULL, inode, blk, 0, &err);
 		}
diff -r e96f11f61577 fs/ext4/dir.c
--- a/fs/ext4/dir.c	Sun Jun 10 18:25:28 2007 +1000
+++ b/fs/ext4/dir.c	Wed Jun 13 15:16:49 2007 +1000
@@ -138,10 +138,10 @@ static int ext4_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
-				page_cache_readahead_ondemand(
+				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
-					NULL, index, 1);
+					index, 1);
 			filp->f_ra.prev_index = index;
 			bh = ext4_bread(NULL, inode, blk, 0, &err);
 		}
diff -r e96f11f61577 fs/splice.c
--- a/fs/splice.c	Sun Jun 10 18:25:28 2007 +1000
+++ b/fs/splice.c	Wed Jun 13 15:16:49 2007 +1000
@@ -312,8 +312,8 @@ __generic_file_splice_read(struct file *
 		 */
 		page = find_get_page(mapping, index);
 		if (!page) {
-			page_cache_readahead_ondemand(mapping, &in->f_ra, in,
-					NULL, index, nr_pages - spd.nr_pages);
+			page_cache_sync_readahead(mapping, &in->f_ra, in,
+					index, nr_pages - spd.nr_pages);
 
 			/*
 			 * page didn't exist, allocate one.
@@ -361,7 +361,7 @@ __generic_file_splice_read(struct file *
 		page = pages[page_nr];
 
 		if (PageReadahead(page))
-			page_cache_readahead_ondemand(mapping, &in->f_ra, in,
+			page_cache_async_readahead(mapping, &in->f_ra, in,
 					page, index, nr_pages - page_nr);
 
 		/*
diff -r e96f11f61577 include/linux/mm.h
--- a/include/linux/mm.h	Sun Jun 10 18:25:28 2007 +1000
+++ b/include/linux/mm.h	Wed Jun 13 15:16:49 2007 +1000
@@ -1146,12 +1146,20 @@ int do_page_cache_readahead(struct addre
 			pgoff_t offset, unsigned long nr_to_read);
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
-unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
-			  struct file_ra_state *ra,
-			  struct file *filp,
-			  struct page *page,
-			  pgoff_t offset,
-			  unsigned long size);
+
+void page_cache_sync_readahead(struct address_space *mapping,
+			       struct file_ra_state *ra,
+			       struct file *filp,
+			       pgoff_t offset,
+			       unsigned long size);
+
+void page_cache_async_readahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				struct file *filp,
+				struct page *pg,
+				pgoff_t offset,
+				unsigned long size);
+
 unsigned long max_sane_readahead(unsigned long nr);
 
 /* Do stack extension */
diff -r e96f11f61577 mm/filemap.c
--- a/mm/filemap.c	Sun Jun 10 18:25:28 2007 +1000
+++ b/mm/filemap.c	Wed Jun 13 15:16:49 2007 +1000
@@ -913,15 +913,15 @@ find_page:
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
-			page_cache_readahead_ondemand(mapping,
-					&ra, filp, page,
+			page_cache_sync_readahead(mapping,
+					&ra, filp,
 					index, last_index - index);
 			page = find_get_page(mapping, index);
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
-			page_cache_readahead_ondemand(mapping,
+			page_cache_async_readahead(mapping,
 					&ra, filp, page,
 					index, last_index - index);
 		}
@@ -1382,14 +1382,14 @@ retry_find:
 	 */
 	if (VM_SequentialReadHint(vma)) {
 		if (!page) {
-			page_cache_readahead_ondemand(mapping, ra, file, page,
+			page_cache_sync_readahead(mapping, ra, file,
 							   fdata->pgoff, 1);
 			page = find_lock_page(mapping, fdata->pgoff);
 			if (!page)
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
-			page_cache_readahead_ondemand(mapping, ra, file, page,
+			page_cache_async_readahead(mapping, ra, file, page,
 							   fdata->pgoff, 1);
 		}
 	}
diff -r e96f11f61577 mm/readahead.c
--- a/mm/readahead.c	Sun Jun 10 18:25:28 2007 +1000
+++ b/mm/readahead.c	Wed Jun 13 15:16:49 2007 +1000
@@ -351,7 +351,7 @@ static unsigned long
 static unsigned long
 ondemand_readahead(struct address_space *mapping,
 		   struct file_ra_state *ra, struct file *filp,
-		   struct page *page, pgoff_t offset,
+		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
 	unsigned long max;	/* max readahead pages */
@@ -379,7 +379,7 @@ ondemand_readahead(struct address_space 
 	 * Standalone, small read.
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	if (!page && !sequential) {
+	if (!hit_readahead_marker && !sequential) {
 		return __do_page_cache_readahead(mapping, filp,
 						offset, req_size, 0);
 	}
@@ -400,7 +400,7 @@ ondemand_readahead(struct address_space 
 	 * E.g. interleaved reads.
 	 * Not knowing its readahead pos/size, bet on the minimal possible one.
 	 */
-	if (page) {
+	if (hit_readahead_marker) {
 		ra_index++;
 		ra_size = min(4 * ra_size, max);
 	}
@@ -413,50 +413,71 @@ fill_ra:
 }
 
 /**
- * page_cache_readahead_ondemand - generic file readahead
+ * page_cache_sync_readahead - generic file readahead
  * @mapping: address_space which holds the pagecache and I/O vectors
  * @ra: file_ra_state which holds the readahead state
  * @filp: passed on to ->readpage() and ->readpages()
- * @page: the page at @offset, or NULL if non-present
- * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
+ * @offset: start offset into @mapping, in pagecache page-sized units
  * @req_size: hint: total size of the read which the caller is performing in
- *            PAGE_CACHE_SIZE units
- *
- * page_cache_readahead_ondemand() is the entry point of readahead logic.
- * This function should be called when it is time to perform readahead:
- * 1) @page == NULL
- *    A cache miss happened, time for synchronous readahead.
- * 2) @page != NULL && PageReadahead(@page)
- *    A look-ahead hit occured, time for asynchronous readahead.
- */
-unsigned long
-page_cache_readahead_ondemand(struct address_space *mapping,
-				struct file_ra_state *ra, struct file *filp,
-				struct page *page, pgoff_t offset,
-				unsigned long req_size)
+ *            pagecache pages
+ *
+ * page_cache_sync_readahead() should be called when a cache miss happened:
+ * it will submit the read.  The readahead logic may decide to piggyback more
+ * pages onto the read request if access patterns suggest it will improve
+ * performance.
+ */
+void page_cache_sync_readahead(struct address_space *mapping,
+			       struct file_ra_state *ra, struct file *filp,
+			       pgoff_t offset, unsigned long req_size)
 {
 	/* no read-ahead */
 	if (!ra->ra_pages)
-		return 0;
-
-	if (page) {
-		/*
-		 * It can be PG_reclaim.
-		 */
-		if (PageWriteback(page))
-			return 0;
-
-		ClearPageReadahead(page);
-
-		/*
-		 * Defer asynchronous read-ahead on IO congestion.
-		 */
-		if (bdi_read_congested(mapping->backing_dev_info))
-			return 0;
-	}
+		return;
 
 	/* do read-ahead */
-	return ondemand_readahead(mapping, ra, filp, page,
-					offset, req_size);
-}
-EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);
+	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
+
+/**
+ * page_cache_async_readahead - file readahead for marked pages
+ * @mapping: address_space which holds the pagecache and I/O vectors
+ * @ra: file_ra_state which holds the readahead state
+ * @filp: passed on to ->readpage() and ->readpages()
+ * @page: the page at @offset which has the PG_readahead flag set
+ * @offset: start offset into @mapping, in pagecache page-sized units
+ * @req_size: hint: total size of the read which the caller is performing in
+ *            pagecache pages
+ *
+ * page_cache_async_ondemand() should be called when a page is used which
+ * has the PG_readahead flag: this is a marker to suggest that the application
+ * has used up enough of the readahead window that we should start pulling in
+ * more pages. */
+void
+page_cache_async_readahead(struct address_space *mapping,
+			   struct file_ra_state *ra, struct file *filp,
+			   struct page *page, pgoff_t offset,
+			   unsigned long req_size)
+{
+	/* no read-ahead */
+	if (!ra->ra_pages)
+		return;
+
+	/*
+	 * Same bit is used for PG_readahead and PG_reclaim.
+	 */
+	if (PageWriteback(page))
+		return;
+
+	ClearPageReadahead(page);
+
+	/*
+	 * Defer asynchronous read-ahead on IO congestion.
+	 */
+	if (bdi_read_congested(mapping->backing_dev_info))
+		return;
+
+	/* do read-ahead */
+	ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_async_readahead);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] readahead: on-demand readahead logic
       [not found]             ` <20070613070724.GA6146@mail.ustc.edu.cn>
@ 2007-06-13  7:07               ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-06-13  7:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel, Andi Kleen, Jens Axboe,
	Oleg Nesterov, Steven Pratt, Ram Pai

On Wed, Jun 13, 2007 at 03:51:14PM +1000, Rusty Russell wrote:
> > >  static unsigned long
> > >  ondemand_readahead(struct address_space *mapping,
> > >  		   struct file_ra_state *ra, struct file *filp,
> > > -		   struct page *page, pgoff_t offset,
> > > +		   bool hit_lookahead_marker, pgoff_t offset,
> > 
> > Or use names like async/is_async for hit_lookahead_marker?
> 
> I wasn't sure.  The argument says "we've hit a marker, so do readahead
> even if doesn't look sequential.".  Now, you and I know that only
> happens if it's an async readahead, but that's not really relevant to
> this function.

Yeah, it's an interface for the message "hey I hit a readahead marker".
That's the point!

> OK, here is revised patch with your changes:
> 
> ====
> Split ondemand readahead interface into two functions.  I think this
> makes it a little clearer for non-readahead experts (like Rusty).
> 
> Internally they both call ondemand_readahead(), but the page argument
> is changed to an obvious boolean flag.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Acked-by: Fengguang Wu <wfg@mail.ustc.edu.cn>

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2007-06-13  7:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070516224752.500812933@mail.ustc.edu.cn>
2007-05-16 22:47 ` [PATCH 0/9] on-demand readahead Fengguang Wu
     [not found] ` <20070516224818.246719596@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 1/9] readahead: introduce PG_readahead Fengguang Wu
2007-05-19  6:28     ` Andrew Morton
2007-05-19 11:35       ` Andi Kleen
2007-05-19 15:19         ` Andrew Morton
     [not found]       ` <20070519123031.GA6095@mail.ustc.edu.cn>
2007-05-19 12:30         ` Fengguang Wu
2007-05-19 15:25           ` Andrew Morton
     [not found]             ` <20070520030904.GA9176@mail.ustc.edu.cn>
2007-05-20  3:09               ` Fengguang Wu
2007-05-20  7:10                 ` Christoph Lameter
2007-06-12  1:04   ` Rusty Russell
     [not found]     ` <20070612025201.GA6447@mail.ustc.edu.cn>
2007-06-12  2:52       ` Fengguang Wu
     [not found] ` <20070516224818.384876951@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 2/9] readahead: add look-ahead support to __do_page_cache_readahead() Fengguang Wu
     [not found] ` <20070516224818.544351896@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 3/9] readahead: MIN_RA_PAGES/MAX_RA_PAGES macros Fengguang Wu
     [not found] ` <20070516224818.841068730@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 5/9] readahead: on-demand readahead logic Fengguang Wu
2007-05-19  6:23     ` Andrew Morton
     [not found]       ` <20070519130202.GB6095@mail.ustc.edu.cn>
2007-05-19 13:02         ` Fengguang Wu
2007-06-12  4:36   ` Rusty Russell
     [not found]     ` <20070612103518.GA9624@mail.ustc.edu.cn>
2007-06-12 10:35       ` Fengguang Wu
2007-06-13  1:40       ` Rusty Russell
     [not found]         ` <20070613040044.GB6494@mail.ustc.edu.cn>
2007-06-13  4:00           ` Fengguang Wu
2007-06-13  5:51           ` Rusty Russell
     [not found]             ` <20070613070724.GA6146@mail.ustc.edu.cn>
2007-06-13  7:07               ` Fengguang Wu
     [not found] ` <20070516224818.963553696@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 6/9] readahead: convert filemap invocations Fengguang Wu
     [not found] ` <20070516224819.131727893@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 7/9] readahead: convert splice invocations Fengguang Wu
     [not found] ` <20070516224819.281192313@mail.ustc.edu.cn>
2007-05-16 22:48   ` [PATCH 8/9] readahead: convert ext3/ext4 invocations Fengguang Wu
2007-05-19 12:19   ` Andi Kleen
     [not found] ` <20070516224819.420933490@mail.ustc.edu.cn>
2007-05-16 22:48   ` [PATCH 9/9] readahead: remove the old algorithm Fengguang Wu
2007-05-19 12:18   ` Andi Kleen
     [not found]     ` <20070519131713.GA6510@mail.ustc.edu.cn>
2007-05-19 13:17       ` Fengguang Wu
     [not found] ` <20070516224818.683288460@mail.ustc.edu.cn>
2007-05-16 22:47   ` [PATCH 4/9] readahead: data structure and routines Fengguang Wu
2007-06-12  3:30   ` Rusty Russell
     [not found]     ` <20070612120706.GB9624@mail.ustc.edu.cn>
2007-06-12 12:07       ` Fengguang Wu
2007-06-13  0:27       ` Rusty Russell
     [not found]         ` <20070613030701.GA6494@mail.ustc.edu.cn>
2007-06-13  3:07           ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).