[RFC][PATCH] on-demand readahead

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH] on-demand readahead
       [not found] <20070425131133.GA26863@mail.ustc.edu.cn>
@ 2007-04-25 13:11 ` Fengguang Wu
  2007-04-25 14:37   ` Andi Kleen
  0 siblings, 1 reply; 9+ messages in thread
From: Fengguang Wu @ 2007-04-25 13:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel

Andrew,

This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.

It is designed to be called on demand:
	- on a missing page, to do synchronous readahead
	- on a lookahead page, to do asynchronous readahead

In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows
to one single window.

The patch is made convenient for testing out.
Do a
	# echo 2 > /proc/sys/vm/readahead_ratio
and it is selected.
Do a
	# echo 1 > /proc/sys/vm/readahead_ratio
and the vanilla readahead is selected.

Comments and benchmark numbers are welcome, thank you.


HEURISTICS

The logic deals with four cases:

	- sequential-next
		found a consistent readahead window, so push it forward

	- random
		standalone small read, so read as is

	- sequential-first
		create a new readahead window for a sequential/oversize request

	- lookahead-clueless
		hit a lookahead page not associated with the readahead window,
		so create a new readahead window and ramp it up

In each case, three parameters are determined:

	- readahead index: where the next readahead begins
	- readahead size:  how much to readahead
	- lookahead size:  when to do the next readahead (for pipelining)


BEHAVIORS

The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:

	- It no longer imposes strict sequential checks.
	  It might help some interleaved cases, and clustered random reads.
	  It does introduce risks of a random lookahead hit triggering an
	  unexpected readahead. But in general it is more likely to do good
	  than to do evil.

	- Interleaved reads are supported in a minimal way.
	  Their chances of being detected and proper handled are still low.

	- Readahead thrashings are better handled.
	  The current readahead leads to tiny average I/O sizes, because it
	  never turn back for the thrashed pages.  They have to be fault in
	  by do_generic_mapping_read() one by one.  Whereas the on-demand
	  readahead will redo readahead for them.


OVERHEADS

The new code reduced the overheads of

	- excessively calling the readahead routine on small sized reads
	  (the current readahead code insists on seeing all requests)

	- doing a lot of pointless page-cache lookups for small cached files
	  (the current readahead only turns itself off after 256 cache hits,
	  unfortunately most files are < 1MB, so never see that chance)

That accounts for speedup of
	- 0.3% on 1-page sequential reads on sparse file
	- 1.2% on 1-page cache hot sequential reads
	- 3.2% on 256-page cache hot sequential reads
	- 1.3% on cache hot `tar /lib`

However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.


PERFORMANCE

The basic benchmark setup is
	- 2.6.20 kernel with on-demand readahead
	- 1MB max readahead size
	- 2.9GHz Intel Core 2 CPU
	- 2GB memory
	- 160G/8M Hitachi SATA II 7200 RPM disk

The benchmarks show that
	- it maintains the same performance for trivial sequential/random reads
	- sysbench/OLTP performance on MySQL gains up to 8%
	- performance on readahead thrashing gains up to 3 times


iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k

			       2.6.20          on-demand      gain
first run
	  "  Initial write "   61437.27        64521.53      +5.0%
	  "        Rewrite "   47893.02        48335.20      +0.9%
	  "           Read "   62111.84        62141.49      +0.0%
	  "        Re-read "   62242.66        62193.17      -0.1%
	  "   Reverse Read "   50031.46        49989.79      -0.1%
	  "    Stride read "    8657.61         8652.81      -0.1%
	  "    Random read "   13914.28        13898.23      -0.1%
	  " Mixed workload "   19069.27        19033.32      -0.2%
	  "   Random write "   14849.80        14104.38      -5.0%
	  "         Pwrite "   62955.30        65701.57      +4.4%
	  "          Pread "   62209.99        62256.26      +0.1%

second run
	  "  Initial write "   60810.31        66258.69      +9.0%
	  "        Rewrite "   49373.89        57833.66     +17.1%
	  "           Read "   62059.39        62251.28      +0.3%
	  "        Re-read "   62264.32        62256.82      -0.0%
	  "   Reverse Read "   49970.96        50565.72      +1.2%
	  "    Stride read "    8654.81         8638.45      -0.2%
	  "    Random read "   13901.44        13949.91      +0.3%
	  " Mixed workload "   19041.32        19092.04      +0.3%
	  "   Random write "   14019.99        14161.72      +1.0%
	  "         Pwrite "   64121.67        68224.17      +6.4%
	  "          Pread "   62225.08        62274.28      +0.1%

In summary, writes are unstable, reads are pretty close on average:

			  access pattern  2.6.20  on-demand   gain
				   Read  62085.61  62196.38  +0.2%
				Re-read  62253.49  62224.99  -0.0%
			   Reverse Read  50001.21  50277.75  +0.6%
			    Stride read   8656.21   8645.63  -0.1%
			    Random read  13907.86  13924.07  +0.1%
	 		 Mixed workload  19055.29  19062.68  +0.0%
				  Pread  62217.53  62265.27  +0.1%


aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

					2.6.20      on-demand  delta
			sequential	 92.57s      92.54s    -0.0%
			random		311.87s     312.15s    +0.1%


sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
	 --file-total-size=4G --file-block-size=64K \
	 --num-threads=001 --max-requests=10000 --max-time=900 run

				threads    2.6.20   on-demand    delta
		first run
				      1   59.1974s    59.2262s  +0.0%
				      2   58.0575s    58.2269s  +0.3%
				      4   48.0545s    47.1164s  -2.0%
				      8   41.0684s    41.2229s  +0.4%
				     16   35.8817s    36.4448s  +1.6%
				     32   32.6614s    32.8240s  +0.5%
				     64   23.7601s    24.1481s  +1.6%
				    128   24.3719s    23.8225s  -2.3%
				    256   23.2366s    22.0488s  -5.1%

		second run
				      1   59.6720s    59.5671s  -0.2%
				      8   41.5158s    41.9541s  +1.1%
				     64   25.0200s    23.9634s  -4.2%
				    256   22.5491s    20.9486s  -7.1%

Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:

                sum all up               495.046s    491.514s   -0.7%


sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
	 --mysql-socket=/var/run/mysqld/mysqld.sock \
	 --mysql-user=root --mysql-password=readahead \
	 --num-threads=064 --max-requests=10000 --max-time=900 run

	10000-transactions run
				threads    2.6.20   on-demand    gain
				      1     62.81       64.56   +2.8%
				      2     67.97       70.93   +4.4%
				      4     81.81       85.87   +5.0%
				      8     94.60       97.89   +3.5%
				     16     99.07      104.68   +5.7%
				     32     95.93      104.28   +8.7%
				     64     96.48      103.68   +7.5%
	5000-transactions run
				      1     48.21       48.65   +0.9%
				      8     68.60       70.19   +2.3%
				     64     70.57       74.72   +5.9%
	2000-transactions run
				      1     37.57       38.04   +1.3%
				      2     38.43       38.99   +1.5%
				      4     45.39       46.45   +2.3%
				      8     51.64       52.36   +1.4%
				     16     54.39       55.18   +1.5%
				     32     52.13       54.49   +4.5%
				     64     54.13       54.61   +0.9%

That's interesting results. Some investigations show that
	- MySQL is accessing the db file non-uniformly: some parts are
	  more hot than others
	- It is mostly doing 4-page random reads, and sometimes doing two
	  reads in a row, the latter one triggers a 16-page readahead.
	- The on-demand readahead leaves many lookahead pages (flagged
	  PG_readahead) there. Many of them will be hit, and trigger
	  more readahead pages. Which might save more seeks.
	- Naturally, the readahead windows tend to lie in hot areas,
	  and the lookahead pages in hot areas is more likely to be hit.
	- The more overall read density, the more possible gain.

That also explains the adaptive readahead tricks for clustered random reads.


readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.

			      max throughput     min avg I/O size
		2.6.20:            5MB/s               16KB
		on-demand:        15MB/s              140KB

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 mm/filemap.c   |   11 +++--
 mm/readahead.c |  101 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 105 insertions(+), 7 deletions(-)

--- linux-2.6.21-rc7-mm1.orig/mm/readahead.c
+++ linux-2.6.21-rc7-mm1/mm/readahead.c
@@ -733,6 +733,11 @@ unsigned long max_sane_readahead(unsigne
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
 
+static int prefer_ondemand_readahead(void)
+{
+	return readahead_ratio == 2;
+}
+
 /*
  * Move pages in danger (of thrashing) to the head of inactive_list.
  * Not expected to happen frequently.
@@ -1608,6 +1613,92 @@ thrashing_recovery_readahead(struct addr
 	return ra_submit(ra, mapping, filp);
 }
 
+/*
+ *  Get the previous window size, ramp it up, and
+ *  return it as the new window size.
+ */
+static inline unsigned long get_next_ra_size2(struct file_ra_state *ra,
+						unsigned long max)
+{
+	unsigned long cur = ra->readahead_index - ra->ra_index;
+	unsigned long newsize;
+
+        if (cur < max / 16) {
+                newsize = 4 * cur;
+        } else {
+                newsize = 2 * cur;
+        }
+
+	return min(newsize, max);
+}
+
+/*
+ * On-demand readahead.
+ * A minimal readahead algorithm for trivial sequential/random reads.
+ */
+unsigned long
+ondemand_readahead(struct address_space *mapping,
+		   struct file_ra_state *ra, struct file *filp,
+		   struct page *page, pgoff_t offset,
+		   unsigned long req_size, unsigned long max)
+{
+	pgoff_t ra_index;	/* readahead index */
+	unsigned long ra_size;	/* readahead size */
+	unsigned long la_size;	/* lookahead size */
+	int sequential;
+
+	sequential = (offset - ra->prev_page <= 1UL) || (req_size > max);
+
+	/*
+	 * Lookahead/readahead hit, assume sequential access.
+	 * Ramp up sizes, and push forward the readahead window.
+	 */
+	if (offset && (offset == ra->lookahead_index ||
+			offset == ra->readahead_index)) {
+		ra_index = ra->readahead_index;
+		ra_size = get_next_ra_size2(ra, max);
+		la_size = ra_size;
+		goto fill_ra;
+	}
+
+	/*
+	 * Standalone, small read.
+	 * Read as is, and do not pollute the readahead state.
+	 */
+	if (!page && !sequential) {
+		return __do_page_cache_readahead(mapping, filp,
+						offset, req_size, 0);
+	}
+
+	/*
+	 * It may be one of
+	 * 	- first read on start of file
+	 * 	- sequential cache miss
+	 * 	- oversize random read
+	 * Start readahead for it.
+	 */
+	ra_index = offset;
+	ra_size = get_init_ra_size(req_size, max);
+	la_size = ra_size > req_size ? ra_size - req_size : ra_size;
+
+	/*
+	 * Hit on a lookahead page without valid readahead state.
+	 * E.g. interleaved reads.
+	 * Not knowing its readahead pos/size, bet on the minimal possible one.
+	 */
+	if (page) {
+		ra_index++;
+		ra_size = min(4 * ra_size, max);
+	}
+
+fill_ra:
+	ra_set_index(ra, offset, ra_index);
+	ra_set_size(ra, ra_size, la_size);
+	ra_set_class(ra, RA_CLASS_NONE);
+
+	return ra_submit(ra, mapping, filp);
+}
+
 /**
  * page_cache_readahead_adaptive - thrashing safe adaptive read-ahead
  * @mapping, @ra, @filp, @offset, @req_size: the same as page_cache_readahead()
@@ -1675,6 +1766,11 @@ page_cache_readahead_adaptive(struct add
 	if (!page && (ra->flags & RA_FLAG_NFSD))
 		goto readit;
 
+	/* on-demand read-ahead */
+	if (prefer_ondemand_readahead())
+		return ondemand_readahead(mapping, ra, filp, page,
+					  offset, req_size, ra_max);
+
 	/*
 	 * Start of file.
 	 */
@@ -1684,14 +1780,13 @@ page_cache_readahead_adaptive(struct add
 	/*
 	 * Recover from possible thrashing.
 	 */
-	if (!page && offset - ra->prev_index <= 1 && ra_has_index(ra, offset))
+	if (!page && ra_has_index(ra, offset))
 		return thrashing_recovery_readahead(mapping, filp, ra, offset);
 
 	/*
 	 * State based sequential read-ahead.
 	 */
-	if (offset == ra->prev_index + 1 &&
-	    offset == ra->lookahead_index &&
+	if (offset == ra->lookahead_index &&
 					!debug_option(disable_clock_readahead))
 		return clock_based_readahead(mapping, filp, ra, page,
 						offset, req_size, ra_max);
--- linux-2.6.21-rc7-mm1.orig/mm/filemap.c
+++ linux-2.6.21-rc7-mm1/mm/filemap.c
@@ -946,13 +946,16 @@ void do_generic_mapping_read(struct addr
 find_page:
 		page = find_get_page(mapping, index);
 		if (prefer_adaptive_readahead()) {
-			if (!page || PageReadahead(page)) {
-				ra.prev_index = prev_index;
+			if (!page) {
+				page_cache_readahead_adaptive(mapping,
+						&ra, filp, page,
+						index, last_index - index);
+				page = find_get_page(mapping, index);
+			}
+			if (page && PageReadahead(page)) {
 				page_cache_readahead_adaptive(mapping,
 						&ra, filp, page,
 						index, last_index - index);
-				if (!page)
-					page = find_get_page(mapping, index);
 			}
 		}
 		if (unlikely(page == NULL)) {


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] on-demand readahead
  2007-04-25 13:11 ` [RFC][PATCH] on-demand readahead Fengguang Wu
@ 2007-04-25 14:37   ` Andi Kleen
       [not found]     ` <20070425160400.GA27954@mail.ustc.edu.cn>
  0 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-04-25 14:37 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel

Fengguang Wu <fengguang.wu@gmail.com> writes:

> OVERHEADS
> 
> The new code reduced the overheads of
> 
> 	- excessively calling the readahead routine on small sized reads
> 	  (the current readahead code insists on seeing all requests)
> 
> 	- doing a lot of pointless page-cache lookups for small cached files
> 	  (the current readahead only turns itself off after 256 cache hits,
> 	  unfortunately most files are < 1MB, so never see that chance)

Would it make sense to keep track in the AS if the file is completely in cache?
Then you could probably avoid a lot of these lookups for small in cache files

> --- linux-2.6.21-rc7-mm1.orig/mm/readahead.c
> +++ linux-2.6.21-rc7-mm1/mm/readahead.c
> @@ -733,6 +733,11 @@ unsigned long max_sane_readahead(unsigne

Quite simple patch, why is it that much simpler than your earlier patchkits?
Or is that on top of them?

You seem to have a lot of magic numbers. They probably all need symbols and 
explanations.

Your white space also needs some work.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] on-demand readahead
       [not found]     ` <20070425160400.GA27954@mail.ustc.edu.cn>
@ 2007-04-25 16:04       ` Fengguang Wu
  2007-04-26  6:58         ` Andrew Morton
  2007-04-25 16:08       ` Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Fengguang Wu @ 2007-04-25 16:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel

On Wed, Apr 25, 2007 at 04:37:41PM +0200, Andi Kleen wrote:
> Fengguang Wu <fengguang.wu@gmail.com> writes:
> 
> > OVERHEADS
> > 
> > The new code reduced the overheads of
> > 
> > 	- excessively calling the readahead routine on small sized reads
> > 	  (the current readahead code insists on seeing all requests)
> > 
> > 	- doing a lot of pointless page-cache lookups for small cached files
> > 	  (the current readahead only turns itself off after 256 cache hits,
> > 	  unfortunately most files are < 1MB, so never see that chance)
> 
> Would it make sense to keep track in the AS if the file is completely in cache?
> Then you could probably avoid a lot of these lookups for small in cache files

Yeah, the on-demand readahead can avoid _all_ lookups for small in-cache files.
But what do you mean by AS?

> > --- linux-2.6.21-rc7-mm1.orig/mm/readahead.c
> > +++ linux-2.6.21-rc7-mm1/mm/readahead.c
> > @@ -733,6 +733,11 @@ unsigned long max_sane_readahead(unsigne
> 
> Quite simple patch, why is it that much simpler than your earlier patchkits?
> Or is that on top of them?

The earlier ones focus on features, while this one aims to be simple.
Even simpler than the current readahead, while keeping the same feature set.

The on-demand readahead is now on top of them, but can/will be made
independent.

> You seem to have a lot of magic numbers. They probably all need symbols and 
> explanations.

The magic numbers are for easier testings, and will be removed in
future.  For now, they enables convenient comparing of the two
algorithms in one kernel.

If this new algorithm has been further tested and approved, I'll
re-submit the patch in a cleaner, standalone form. The adaptive
readahead patches can be dropped then. They may better be reworked as
a kernel module.

> Your white space also needs some work.

White space in patch description?
OK, thanks.

Wu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] on-demand readahead
       [not found]     ` <20070425160400.GA27954@mail.ustc.edu.cn>
  2007-04-25 16:04       ` Fengguang Wu
@ 2007-04-25 16:08       ` Andi Kleen
       [not found]         ` <20070426011655.GA6373@mail.ustc.edu.cn>
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-04-25 16:08 UTC (permalink / raw)
  To: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
	linux-kernel

> Yeah, the on-demand readahead can avoid _all_ lookups for small in-cache files.

How?

> But what do you mean by AS?

struct address_space

> > You seem to have a lot of magic numbers. They probably all need symbols and 
> > explanations.
> 
> The magic numbers are for easier testings, and will be removed in
> future.  For now, they enables convenient comparing of the two
> algorithms in one kernel.

I mean the 16 and 4 not the sysctl

> 
> If this new algorithm has been further tested and approved, I'll
> re-submit the patch in a cleaner, standalone form. The adaptive
> readahead patches can be dropped then. They may better be reworked as
> a kernel module.

If they actually help and don't cause regressions they shouldn't be a module, 
but integrated eventually Just it has to be all step by step.
> 
> > Your white space also needs some work.
> 
> White space in patch description?

In the code indentation.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] on-demand readahead
       [not found]         ` <20070426011655.GA6373@mail.ustc.edu.cn>
@ 2007-04-26  1:16           ` Fengguang Wu
  2007-05-02 10:02             ` [RFC] splice() and readahead interaction Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Fengguang Wu @ 2007-04-26  1:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel

On Wed, Apr 25, 2007 at 06:08:44PM +0200, Andi Kleen wrote:
> > Yeah, the on-demand readahead can avoid _all_ lookups for small in-cache files.
> 
> How?

In filemap.c:
                if (!page) {
                        page_cache_readahead_adaptive(mapping,
                                        &ra, filp, page,
                                        index, last_index - index);
                        page = find_get_page(mapping, index);
                }
                if (page && PageReadahead(page)) {
                        page_cache_readahead_adaptive(mapping,
                                        &ra, filp, page,
                                        index, last_index - index);
                }
        
Cache hot files neither have missing pages (!page) or lookahead
pages (PageReadahead(page)).  So it will not even be called.

> > > You seem to have a lot of magic numbers. They probably all need symbols and 
> > > explanations.
> > 
> > The magic numbers are for easier testings, and will be removed in
> > future.  For now, they enables convenient comparing of the two
> > algorithms in one kernel.
> 
> I mean the 16 and 4 not the sysctl

The numbers and the code in get_next_ra_size2() is simply copied from
get_next_ra_size():

        if (cur < max / 16) {
                newsize = 4 * cur;
        } else {
                newsize = 2 * cur;
        }

It's a trick to ramp up small sizes more quickly.
That trick is documented in the related get_init_ra_size().
So, it would be better to put the two routines together to make it clear.

> > 
> > If this new algorithm has been further tested and approved, I'll
> > re-submit the patch in a cleaner, standalone form. The adaptive
> > readahead patches can be dropped then. They may better be reworked as
> > a kernel module.
> 
> If they actually help and don't cause regressions they shouldn't be a module, 
> but integrated eventually Just it has to be all step by step.

Yeah, the adaptive readahead is complex and the possible workloads diverse.
It becomes obvious that there is a long way to go, and kernel module makes
life easier.

> > > Your white space also needs some work.
> > 
> > White space in patch description?
> 
> In the code indentation.

Ah, got it: a silly copy/paste mistake.

Thank you,
Wu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] on-demand readahead
  2007-04-25 16:04       ` Fengguang Wu
@ 2007-04-26  6:58         ` Andrew Morton
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2007-04-26  6:58 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
	linux-kernel

On Thu, 26 Apr 2007 00:04:00 +0800 Fengguang Wu <fengguang.wu@gmail.com> wrote:

> If this new algorithm has been further tested and approved, I'll
> re-submit the patch in a cleaner, standalone form. The adaptive
> readahead patches can be dropped then. They may better be reworked as
> a kernel module.

that would be appreciated - I just don't know how to move the current patches
forward, and they do get in the way quite regularly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] splice() and readahead interaction
  2007-04-26  1:16           ` Fengguang Wu
@ 2007-05-02 10:02             ` Eric Dumazet
       [not found]               ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2007-05-02 10:02 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
	linux-kernel, Ingo Molnar

Hi Wu

Since you work on readahead, could you please find the reason following program triggers a problem in splice() syscall ?

Description :

I tried to use splice(SPLICE_F_NONBLOCK) in a non blocking environnement, in an attempt to implement cheap AIO, and zero-copy splice() feature.

I quicky found that readahead in splice() is not really working.

To demonstrate the problem, just compile the attached program, and use it to pipe a big file (not yet in cache) to /dev/null :

$ gcc -o spliceout spliceout.c
$ spliceout -d BIGFILE  | cat >/dev/null
offset=49152 ret=49152
offset=65536 ret=16384
offset=131072 ret=65536
...no more progress...   (splice() returns -1 and EAGAIN)

reading splice(SPLICE_F_NONBLOCK) syscall implementation, I expected to exploit its ability to call readahead(), and do some progress if pages are ready in cache.

But apparently, even on an idle machine, it is not working as expected.

Thank you

/*
 * Usage :
 *          spliceout [-d] file | some_other_program
 */
#ifndef _LARGEFILE64_SOURCE
# define _LARGEFILE64_SOURCE 1
#endif
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/poll.h>

#ifndef __do_splice_syscall_h__
#define __do_splice_syscall_h__

#include <sys/syscall.h>
#include <unistd.h>

#if defined(__i386__)

/* From kernel tree include/asm-i386/unistd.h
*/
#ifndef __NR_splice
#define __NR_splice 313
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 316
#endif

#elif defined(__x86_64__)

/* From kernel tree include/asm-x86_64/unistd.h
*/
#ifndef __NR_splice
#define __NR_splice 275
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 278
#endif

#else
#error unsupported architecture
#endif

/* From kernel tree include/linux/pipe_fs_i.h
*/
#define SPLICE_F_MOVE (0x01) /* move pages instead of copying */
#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing
(but */
/* we may still block on the fd we splice */
/* from/to, of course */
#define SPLICE_F_MORE (0x04) /* expect more data */
#define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */

#ifndef SYS_splice
#define SYS_splice __NR_splice
#endif
#ifndef SYS_vmsplice
#define SYS_vmsplice __NR_vmsplice
#endif


static inline
int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out,
	   size_t len, unsigned int flags)
{
	return syscall(SYS_splice, fd_in, off_in, fd_out, off_out, len, flags);
}

struct iovec;

static inline
int vmsplice(int fd, const struct iovec *iov,
	     unsigned long nr_segs, unsigned int flags)
{
return syscall(SYS_vmsplice, fd, iov, nr_segs, flags);
}


#endif /* __do_splice_syscall_h__ */


void usage(int code)
{
	fprintf(stderr, "Usage : spliceout [-d] file\n");
	exit(code);
}

int main(int argc, char *argv[])
{
	int ret;
	int opt;
	int fd_in;
	int dflg = 0;
	loff_t offset = 0;
	loff_t lastoffset = ~0;
	struct stat st;

	while ((opt = getopt(argc, argv, "d")) != EOF) {
		if (opt == 'd')
			dflg++;
	}
	if (optind == argc)
		usage(1);
	if (fstat(1, &st) == -1)
		usage(1);
	if (!S_ISFIFO(st.st_mode)) {
		fprintf(stderr, "stdout is not a pipe\n");
		exit(1);
	}
	fd_in = open(argv[optind], O_RDONLY);
	if (fd_in == -1) {
		perror(argv[optind]);
		exit(1);
	}
	for (;;) {
		struct pollfd pfd;
		pfd.fd = fd_in;
		pfd.events = POLLIN;
		poll(&pfd, 1, -1); /* just in case we support poll() on this file to avoid a loop */
		ret = splice(fd_in, &offset,
			     1, NULL,
			     16*4096, SPLICE_F_NONBLOCK);
		if (ret == 0)
			break;
		if (dflg && lastoffset != offset) {
			fprintf(stderr, "offset=%lu ret=%d\n", (unsigned long)offset, ret);
			lastoffset = offset;
		}
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] splice() and readahead interaction
       [not found]               ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
@ 2007-05-07 21:54                 ` Andrew Morton
  2007-05-10 19:53                 ` Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2007-05-07 21:54 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Eric Dumazet, Andi Kleen, Oleg Nesterov, Steven Pratt, Ram Pai,
	linux-kernel, Ingo Molnar

On Sat, 5 May 2007 05:04:29 -0400
"Fengguang Wu" <fengguang.wu@gmail.com> wrote:

> Readahead logic somehow fails to populate the page range with data.
> It can be because
> 1) the readahead routine is not always called in the following lines of
> fs/splice.c:
>         if (!loff || nr_pages > 1)
>                 page_cache_readahead(mapping, &in->f_ra, in, index,
> nr_pages);
> 2) even called, page_cache_readahead() wont guarantee the pages are there.
> It wont submit readahead I/O for pages already in the radix tree, or when
> (ra_pages == 0), or after 256 cache hits.
> 
> In your case, it should be because of the retried reads, which lead to
> excessive cache hits, and disables readahead at some time.
> 
> And that _one_ failure of readahead blocks the whole read process.
> The application receives EAGAIN and retries the read, but
> __generic_file_splice_read() refuse to make progress:
> - in the previous invocation, it has allocated a blank page and inserted it
> into the radix tree, but never has the chance to start I/O for it: the test
> of SPLICE_F_NONBLOCK goes before that.
> - in the retried invocation, the readahead code will neither get out of the
> cache hit mode, nor will it submit I/O for an already existing page.
> 
> The attached patch should fix the critical splice bug. Sorry for not being
> able to test it locally for now - I'm at home and running knoppix. And the
> readahead bug will be fixed by the upcoming on-demand readahead patch. I
> should be back and submit it after a week.
> 
> Thank you,
> Fengguang Wu
> 
> 
> [splice-nonblock-fix.patch  text/x-patch (506B)]
> --- linux-2.6.21.1/fs/splice.c.old	2007-05-05 04:40:38.000000000 -0400
> +++ linux-2.6.21.1/fs/splice.c	2007-05-05 04:41:59.000000000 -0400
> @@ -378,10 +378,11 @@
>  			 * If in nonblock mode then dont block on waiting
>  			 * for an in-flight io page
>  			 */
> -			if (flags & SPLICE_F_NONBLOCK)
> -				break;
> -
> -			lock_page(page);
> +			if (flags & SPLICE_F_NONBLOCK) {
> +				if (TestSetPageLocked(page))
> +					break;
> +			} else
> +				lock_page(page);
>  
>  			/*
>  			 * page was truncated, stop here. if this isn't the

So.. afaik we're awaiting testing results for this change?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] splice() and readahead interaction
       [not found]               ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
  2007-05-07 21:54                 ` Andrew Morton
@ 2007-05-10 19:53                 ` Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2007-05-10 19:53 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
	linux-kernel, Ingo Molnar

Fengguang Wu a écrit :
> 2007/5/2, Eric Dumazet <dada1@cosmosbay.com <mailto:dada1@cosmosbay.com>>:
> 
>     Since you work on readahead, could you please find the reason
>     following program triggers a problem in splice() syscall ?
> 
>     Description :
> 
>     I tried to use splice(SPLICE_F_NONBLOCK) in a non blocking
>     environnement, in an attempt to implement cheap AIO, and zero-copy
>     splice() feature.
> 
>     I quicky found that readahead in splice() is not really working.
> 
>     To demonstrate the problem, just compile the attached program, and
>     use it to pipe a big file (not yet in cache) to /dev/null :
> 
>     $ gcc -o spliceout spliceout.c
>     $ spliceout -d BIGFILE | cat >/dev/null
>     offset=49152 ret=49152
>     offset=65536 ret=16384
>     offset=131072 ret=65536
>     ...no more progress...   (splice() returns -1 and EAGAIN)
> 
>     reading splice(SPLICE_F_NONBLOCK) syscall implementation, I expected
>     to exploit its ability to call readahead(), and do some progress if
>     pages are ready in cache.
> 
>     But apparently, even on an idle machine, it is not working as expected.
> 
> 
> 
> Eric Dumazet, thank you for disclosing this bug.
> 
> Readahead logic somehow fails to populate the page range with data.
> It can be because
> 1) the readahead routine is not always called in the following lines of 
> fs/splice.c:
>         if (!loff || nr_pages > 1)
>                 page_cache_readahead(mapping, &in->f_ra, in, index, 
> nr_pages);
> 2) even called, page_cache_readahead() wont guarantee the pages are there.
> It wont submit readahead I/O for pages already in the radix tree, or 
> when (ra_pages == 0), or after 256 cache hits.
> 
> In your case, it should be because of the retried reads, which lead to 
> excessive cache hits, and disables readahead at some time.
> 
> And that _one_ failure of readahead blocks the whole read process.
> The application receives EAGAIN and retries the read, but 
> __generic_file_splice_read() refuse to make progress:
> - in the previous invocation, it has allocated a blank page and inserted 
> it into the radix tree, but never has the chance to start I/O for it: 
> the test of SPLICE_F_NONBLOCK goes before that.
> - in the retried invocation, the readahead code will neither get out of 
> the cache hit mode, nor will it submit I/O for an already existing page.
> 
> The attached patch should fix the critical splice bug. Sorry for not 
> being able to test it locally for now - I'm at home and running knoppix. 
> And the readahead bug will be fixed by the upcoming on-demand readahead 
> patch. I should be back and submit it after a week.
> 
> Thank you,
> Fengguang Wu
> 
> 
> ------------------------------------------------------------------------
> 
> --- linux-2.6.21.1/fs/splice.c.old	2007-05-05 04:40:38.000000000 -0400
> +++ linux-2.6.21.1/fs/splice.c	2007-05-05 04:41:59.000000000 -0400
> @@ -378,10 +378,11 @@
>  			 * If in nonblock mode then dont block on waiting
>  			 * for an in-flight io page
>  			 */
> -			if (flags & SPLICE_F_NONBLOCK)
> -				break;
> -
> -			lock_page(page);
> +			if (flags & SPLICE_F_NONBLOCK) {
> +				if (TestSetPageLocked(page))
> +					break;
> +			} else
> +				lock_page(page);
>  
>  			/*
>  			 * page was truncated, stop here. if this isn't the

Sorry for the delay.

This patches solves the problem, thank you !



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-05-10 19:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070425131133.GA26863@mail.ustc.edu.cn>
2007-04-25 13:11 ` [RFC][PATCH] on-demand readahead Fengguang Wu
2007-04-25 14:37   ` Andi Kleen
     [not found]     ` <20070425160400.GA27954@mail.ustc.edu.cn>
2007-04-25 16:04       ` Fengguang Wu
2007-04-26  6:58         ` Andrew Morton
2007-04-25 16:08       ` Andi Kleen
     [not found]         ` <20070426011655.GA6373@mail.ustc.edu.cn>
2007-04-26  1:16           ` Fengguang Wu
2007-05-02 10:02             ` [RFC] splice() and readahead interaction Eric Dumazet
     [not found]               ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
2007-05-07 21:54                 ` Andrew Morton
2007-05-10 19:53                 ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox