* [RFC][PATCH] on-demand readahead
[not found] <20070425131133.GA26863@mail.ustc.edu.cn>
@ 2007-04-25 13:11 ` Fengguang Wu
2007-04-25 14:37 ` Andi Kleen
0 siblings, 1 reply; 9+ messages in thread
From: Fengguang Wu @ 2007-04-25 13:11 UTC (permalink / raw)
To: Andrew Morton; +Cc: Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel
Andrew,
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows
to one single window.
The patch is made convenient for testing out.
Do a
# echo 2 > /proc/sys/vm/readahead_ratio
and it is selected.
Do a
# echo 1 > /proc/sys/vm/readahead_ratio
and the vanilla readahead is selected.
Comments and benchmark numbers are welcome, thank you.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
mm/filemap.c | 11 +++--
mm/readahead.c | 101 +++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 105 insertions(+), 7 deletions(-)
--- linux-2.6.21-rc7-mm1.orig/mm/readahead.c
+++ linux-2.6.21-rc7-mm1/mm/readahead.c
@@ -733,6 +733,11 @@ unsigned long max_sane_readahead(unsigne
#ifdef CONFIG_ADAPTIVE_READAHEAD
+static int prefer_ondemand_readahead(void)
+{
+ return readahead_ratio == 2;
+}
+
/*
* Move pages in danger (of thrashing) to the head of inactive_list.
* Not expected to happen frequently.
@@ -1608,6 +1613,92 @@ thrashing_recovery_readahead(struct addr
return ra_submit(ra, mapping, filp);
}
+/*
+ * Get the previous window size, ramp it up, and
+ * return it as the new window size.
+ */
+static inline unsigned long get_next_ra_size2(struct file_ra_state *ra,
+ unsigned long max)
+{
+ unsigned long cur = ra->readahead_index - ra->ra_index;
+ unsigned long newsize;
+
+ if (cur < max / 16) {
+ newsize = 4 * cur;
+ } else {
+ newsize = 2 * cur;
+ }
+
+ return min(newsize, max);
+}
+
+/*
+ * On-demand readahead.
+ * A minimal readahead algorithm for trivial sequential/random reads.
+ */
+unsigned long
+ondemand_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *page, pgoff_t offset,
+ unsigned long req_size, unsigned long max)
+{
+ pgoff_t ra_index; /* readahead index */
+ unsigned long ra_size; /* readahead size */
+ unsigned long la_size; /* lookahead size */
+ int sequential;
+
+ sequential = (offset - ra->prev_page <= 1UL) || (req_size > max);
+
+ /*
+ * Lookahead/readahead hit, assume sequential access.
+ * Ramp up sizes, and push forward the readahead window.
+ */
+ if (offset && (offset == ra->lookahead_index ||
+ offset == ra->readahead_index)) {
+ ra_index = ra->readahead_index;
+ ra_size = get_next_ra_size2(ra, max);
+ la_size = ra_size;
+ goto fill_ra;
+ }
+
+ /*
+ * Standalone, small read.
+ * Read as is, and do not pollute the readahead state.
+ */
+ if (!page && !sequential) {
+ return __do_page_cache_readahead(mapping, filp,
+ offset, req_size, 0);
+ }
+
+ /*
+ * It may be one of
+ * - first read on start of file
+ * - sequential cache miss
+ * - oversize random read
+ * Start readahead for it.
+ */
+ ra_index = offset;
+ ra_size = get_init_ra_size(req_size, max);
+ la_size = ra_size > req_size ? ra_size - req_size : ra_size;
+
+ /*
+ * Hit on a lookahead page without valid readahead state.
+ * E.g. interleaved reads.
+ * Not knowing its readahead pos/size, bet on the minimal possible one.
+ */
+ if (page) {
+ ra_index++;
+ ra_size = min(4 * ra_size, max);
+ }
+
+fill_ra:
+ ra_set_index(ra, offset, ra_index);
+ ra_set_size(ra, ra_size, la_size);
+ ra_set_class(ra, RA_CLASS_NONE);
+
+ return ra_submit(ra, mapping, filp);
+}
+
/**
* page_cache_readahead_adaptive - thrashing safe adaptive read-ahead
* @mapping, @ra, @filp, @offset, @req_size: the same as page_cache_readahead()
@@ -1675,6 +1766,11 @@ page_cache_readahead_adaptive(struct add
if (!page && (ra->flags & RA_FLAG_NFSD))
goto readit;
+ /* on-demand read-ahead */
+ if (prefer_ondemand_readahead())
+ return ondemand_readahead(mapping, ra, filp, page,
+ offset, req_size, ra_max);
+
/*
* Start of file.
*/
@@ -1684,14 +1780,13 @@ page_cache_readahead_adaptive(struct add
/*
* Recover from possible thrashing.
*/
- if (!page && offset - ra->prev_index <= 1 && ra_has_index(ra, offset))
+ if (!page && ra_has_index(ra, offset))
return thrashing_recovery_readahead(mapping, filp, ra, offset);
/*
* State based sequential read-ahead.
*/
- if (offset == ra->prev_index + 1 &&
- offset == ra->lookahead_index &&
+ if (offset == ra->lookahead_index &&
!debug_option(disable_clock_readahead))
return clock_based_readahead(mapping, filp, ra, page,
offset, req_size, ra_max);
--- linux-2.6.21-rc7-mm1.orig/mm/filemap.c
+++ linux-2.6.21-rc7-mm1/mm/filemap.c
@@ -946,13 +946,16 @@ void do_generic_mapping_read(struct addr
find_page:
page = find_get_page(mapping, index);
if (prefer_adaptive_readahead()) {
- if (!page || PageReadahead(page)) {
- ra.prev_index = prev_index;
+ if (!page) {
+ page_cache_readahead_adaptive(mapping,
+ &ra, filp, page,
+ index, last_index - index);
+ page = find_get_page(mapping, index);
+ }
+ if (page && PageReadahead(page)) {
page_cache_readahead_adaptive(mapping,
&ra, filp, page,
index, last_index - index);
- if (!page)
- page = find_get_page(mapping, index);
}
}
if (unlikely(page == NULL)) {
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC][PATCH] on-demand readahead
2007-04-25 13:11 ` [RFC][PATCH] on-demand readahead Fengguang Wu
@ 2007-04-25 14:37 ` Andi Kleen
[not found] ` <20070425160400.GA27954@mail.ustc.edu.cn>
0 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-04-25 14:37 UTC (permalink / raw)
To: Fengguang Wu
Cc: Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel
Fengguang Wu <fengguang.wu@gmail.com> writes:
> OVERHEADS
>
> The new code reduced the overheads of
>
> - excessively calling the readahead routine on small sized reads
> (the current readahead code insists on seeing all requests)
>
> - doing a lot of pointless page-cache lookups for small cached files
> (the current readahead only turns itself off after 256 cache hits,
> unfortunately most files are < 1MB, so never see that chance)
Would it make sense to keep track in the AS if the file is completely in cache?
Then you could probably avoid a lot of these lookups for small in cache files
> --- linux-2.6.21-rc7-mm1.orig/mm/readahead.c
> +++ linux-2.6.21-rc7-mm1/mm/readahead.c
> @@ -733,6 +733,11 @@ unsigned long max_sane_readahead(unsigne
Quite simple patch, why is it that much simpler than your earlier patchkits?
Or is that on top of them?
You seem to have a lot of magic numbers. They probably all need symbols and
explanations.
Your white space also needs some work.
-Andi
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC][PATCH] on-demand readahead
[not found] ` <20070425160400.GA27954@mail.ustc.edu.cn>
@ 2007-04-25 16:04 ` Fengguang Wu
2007-04-26 6:58 ` Andrew Morton
2007-04-25 16:08 ` Andi Kleen
1 sibling, 1 reply; 9+ messages in thread
From: Fengguang Wu @ 2007-04-25 16:04 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel
On Wed, Apr 25, 2007 at 04:37:41PM +0200, Andi Kleen wrote:
> Fengguang Wu <fengguang.wu@gmail.com> writes:
>
> > OVERHEADS
> >
> > The new code reduced the overheads of
> >
> > - excessively calling the readahead routine on small sized reads
> > (the current readahead code insists on seeing all requests)
> >
> > - doing a lot of pointless page-cache lookups for small cached files
> > (the current readahead only turns itself off after 256 cache hits,
> > unfortunately most files are < 1MB, so never see that chance)
>
> Would it make sense to keep track in the AS if the file is completely in cache?
> Then you could probably avoid a lot of these lookups for small in cache files
Yeah, the on-demand readahead can avoid _all_ lookups for small in-cache files.
But what do you mean by AS?
> > --- linux-2.6.21-rc7-mm1.orig/mm/readahead.c
> > +++ linux-2.6.21-rc7-mm1/mm/readahead.c
> > @@ -733,6 +733,11 @@ unsigned long max_sane_readahead(unsigne
>
> Quite simple patch, why is it that much simpler than your earlier patchkits?
> Or is that on top of them?
The earlier ones focus on features, while this one aims to be simple.
Even simpler than the current readahead, while keeping the same feature set.
The on-demand readahead is now on top of them, but can/will be made
independent.
> You seem to have a lot of magic numbers. They probably all need symbols and
> explanations.
The magic numbers are for easier testings, and will be removed in
future. For now, they enables convenient comparing of the two
algorithms in one kernel.
If this new algorithm has been further tested and approved, I'll
re-submit the patch in a cleaner, standalone form. The adaptive
readahead patches can be dropped then. They may better be reworked as
a kernel module.
> Your white space also needs some work.
White space in patch description?
OK, thanks.
Wu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC][PATCH] on-demand readahead
[not found] ` <20070425160400.GA27954@mail.ustc.edu.cn>
2007-04-25 16:04 ` Fengguang Wu
@ 2007-04-25 16:08 ` Andi Kleen
[not found] ` <20070426011655.GA6373@mail.ustc.edu.cn>
1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-04-25 16:08 UTC (permalink / raw)
To: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
linux-kernel
> Yeah, the on-demand readahead can avoid _all_ lookups for small in-cache files.
How?
> But what do you mean by AS?
struct address_space
> > You seem to have a lot of magic numbers. They probably all need symbols and
> > explanations.
>
> The magic numbers are for easier testings, and will be removed in
> future. For now, they enables convenient comparing of the two
> algorithms in one kernel.
I mean the 16 and 4 not the sysctl
>
> If this new algorithm has been further tested and approved, I'll
> re-submit the patch in a cleaner, standalone form. The adaptive
> readahead patches can be dropped then. They may better be reworked as
> a kernel module.
If they actually help and don't cause regressions they shouldn't be a module,
but integrated eventually Just it has to be all step by step.
>
> > Your white space also needs some work.
>
> White space in patch description?
In the code indentation.
-Andi
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC][PATCH] on-demand readahead
[not found] ` <20070426011655.GA6373@mail.ustc.edu.cn>
@ 2007-04-26 1:16 ` Fengguang Wu
2007-05-02 10:02 ` [RFC] splice() and readahead interaction Eric Dumazet
0 siblings, 1 reply; 9+ messages in thread
From: Fengguang Wu @ 2007-04-26 1:16 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai, linux-kernel
On Wed, Apr 25, 2007 at 06:08:44PM +0200, Andi Kleen wrote:
> > Yeah, the on-demand readahead can avoid _all_ lookups for small in-cache files.
>
> How?
In filemap.c:
if (!page) {
page_cache_readahead_adaptive(mapping,
&ra, filp, page,
index, last_index - index);
page = find_get_page(mapping, index);
}
if (page && PageReadahead(page)) {
page_cache_readahead_adaptive(mapping,
&ra, filp, page,
index, last_index - index);
}
Cache hot files neither have missing pages (!page) or lookahead
pages (PageReadahead(page)). So it will not even be called.
> > > You seem to have a lot of magic numbers. They probably all need symbols and
> > > explanations.
> >
> > The magic numbers are for easier testings, and will be removed in
> > future. For now, they enables convenient comparing of the two
> > algorithms in one kernel.
>
> I mean the 16 and 4 not the sysctl
The numbers and the code in get_next_ra_size2() is simply copied from
get_next_ra_size():
if (cur < max / 16) {
newsize = 4 * cur;
} else {
newsize = 2 * cur;
}
It's a trick to ramp up small sizes more quickly.
That trick is documented in the related get_init_ra_size().
So, it would be better to put the two routines together to make it clear.
> >
> > If this new algorithm has been further tested and approved, I'll
> > re-submit the patch in a cleaner, standalone form. The adaptive
> > readahead patches can be dropped then. They may better be reworked as
> > a kernel module.
>
> If they actually help and don't cause regressions they shouldn't be a module,
> but integrated eventually Just it has to be all step by step.
Yeah, the adaptive readahead is complex and the possible workloads diverse.
It becomes obvious that there is a long way to go, and kernel module makes
life easier.
> > > Your white space also needs some work.
> >
> > White space in patch description?
>
> In the code indentation.
Ah, got it: a silly copy/paste mistake.
Thank you,
Wu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC][PATCH] on-demand readahead
2007-04-25 16:04 ` Fengguang Wu
@ 2007-04-26 6:58 ` Andrew Morton
0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2007-04-26 6:58 UTC (permalink / raw)
To: Fengguang Wu
Cc: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
linux-kernel
On Thu, 26 Apr 2007 00:04:00 +0800 Fengguang Wu <fengguang.wu@gmail.com> wrote:
> If this new algorithm has been further tested and approved, I'll
> re-submit the patch in a cleaner, standalone form. The adaptive
> readahead patches can be dropped then. They may better be reworked as
> a kernel module.
that would be appreciated - I just don't know how to move the current patches
forward, and they do get in the way quite regularly.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC] splice() and readahead interaction
2007-04-26 1:16 ` Fengguang Wu
@ 2007-05-02 10:02 ` Eric Dumazet
[not found] ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2007-05-02 10:02 UTC (permalink / raw)
To: Fengguang Wu
Cc: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
linux-kernel, Ingo Molnar
Hi Wu
Since you work on readahead, could you please find the reason following program triggers a problem in splice() syscall ?
Description :
I tried to use splice(SPLICE_F_NONBLOCK) in a non blocking environnement, in an attempt to implement cheap AIO, and zero-copy splice() feature.
I quicky found that readahead in splice() is not really working.
To demonstrate the problem, just compile the attached program, and use it to pipe a big file (not yet in cache) to /dev/null :
$ gcc -o spliceout spliceout.c
$ spliceout -d BIGFILE | cat >/dev/null
offset=49152 ret=49152
offset=65536 ret=16384
offset=131072 ret=65536
...no more progress... (splice() returns -1 and EAGAIN)
reading splice(SPLICE_F_NONBLOCK) syscall implementation, I expected to exploit its ability to call readahead(), and do some progress if pages are ready in cache.
But apparently, even on an idle machine, it is not working as expected.
Thank you
/*
* Usage :
* spliceout [-d] file | some_other_program
*/
#ifndef _LARGEFILE64_SOURCE
# define _LARGEFILE64_SOURCE 1
#endif
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/poll.h>
#ifndef __do_splice_syscall_h__
#define __do_splice_syscall_h__
#include <sys/syscall.h>
#include <unistd.h>
#if defined(__i386__)
/* From kernel tree include/asm-i386/unistd.h
*/
#ifndef __NR_splice
#define __NR_splice 313
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 316
#endif
#elif defined(__x86_64__)
/* From kernel tree include/asm-x86_64/unistd.h
*/
#ifndef __NR_splice
#define __NR_splice 275
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 278
#endif
#else
#error unsupported architecture
#endif
/* From kernel tree include/linux/pipe_fs_i.h
*/
#define SPLICE_F_MOVE (0x01) /* move pages instead of copying */
#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing
(but */
/* we may still block on the fd we splice */
/* from/to, of course */
#define SPLICE_F_MORE (0x04) /* expect more data */
#define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */
#ifndef SYS_splice
#define SYS_splice __NR_splice
#endif
#ifndef SYS_vmsplice
#define SYS_vmsplice __NR_vmsplice
#endif
static inline
int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out,
size_t len, unsigned int flags)
{
return syscall(SYS_splice, fd_in, off_in, fd_out, off_out, len, flags);
}
struct iovec;
static inline
int vmsplice(int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags)
{
return syscall(SYS_vmsplice, fd, iov, nr_segs, flags);
}
#endif /* __do_splice_syscall_h__ */
void usage(int code)
{
fprintf(stderr, "Usage : spliceout [-d] file\n");
exit(code);
}
int main(int argc, char *argv[])
{
int ret;
int opt;
int fd_in;
int dflg = 0;
loff_t offset = 0;
loff_t lastoffset = ~0;
struct stat st;
while ((opt = getopt(argc, argv, "d")) != EOF) {
if (opt == 'd')
dflg++;
}
if (optind == argc)
usage(1);
if (fstat(1, &st) == -1)
usage(1);
if (!S_ISFIFO(st.st_mode)) {
fprintf(stderr, "stdout is not a pipe\n");
exit(1);
}
fd_in = open(argv[optind], O_RDONLY);
if (fd_in == -1) {
perror(argv[optind]);
exit(1);
}
for (;;) {
struct pollfd pfd;
pfd.fd = fd_in;
pfd.events = POLLIN;
poll(&pfd, 1, -1); /* just in case we support poll() on this file to avoid a loop */
ret = splice(fd_in, &offset,
1, NULL,
16*4096, SPLICE_F_NONBLOCK);
if (ret == 0)
break;
if (dflg && lastoffset != offset) {
fprintf(stderr, "offset=%lu ret=%d\n", (unsigned long)offset, ret);
lastoffset = offset;
}
}
return 0;
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] splice() and readahead interaction
[not found] ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
@ 2007-05-07 21:54 ` Andrew Morton
2007-05-10 19:53 ` Eric Dumazet
1 sibling, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2007-05-07 21:54 UTC (permalink / raw)
To: Fengguang Wu
Cc: Eric Dumazet, Andi Kleen, Oleg Nesterov, Steven Pratt, Ram Pai,
linux-kernel, Ingo Molnar
On Sat, 5 May 2007 05:04:29 -0400
"Fengguang Wu" <fengguang.wu@gmail.com> wrote:
> Readahead logic somehow fails to populate the page range with data.
> It can be because
> 1) the readahead routine is not always called in the following lines of
> fs/splice.c:
> if (!loff || nr_pages > 1)
> page_cache_readahead(mapping, &in->f_ra, in, index,
> nr_pages);
> 2) even called, page_cache_readahead() wont guarantee the pages are there.
> It wont submit readahead I/O for pages already in the radix tree, or when
> (ra_pages == 0), or after 256 cache hits.
>
> In your case, it should be because of the retried reads, which lead to
> excessive cache hits, and disables readahead at some time.
>
> And that _one_ failure of readahead blocks the whole read process.
> The application receives EAGAIN and retries the read, but
> __generic_file_splice_read() refuse to make progress:
> - in the previous invocation, it has allocated a blank page and inserted it
> into the radix tree, but never has the chance to start I/O for it: the test
> of SPLICE_F_NONBLOCK goes before that.
> - in the retried invocation, the readahead code will neither get out of the
> cache hit mode, nor will it submit I/O for an already existing page.
>
> The attached patch should fix the critical splice bug. Sorry for not being
> able to test it locally for now - I'm at home and running knoppix. And the
> readahead bug will be fixed by the upcoming on-demand readahead patch. I
> should be back and submit it after a week.
>
> Thank you,
> Fengguang Wu
>
>
> [splice-nonblock-fix.patch text/x-patch (506B)]
> --- linux-2.6.21.1/fs/splice.c.old 2007-05-05 04:40:38.000000000 -0400
> +++ linux-2.6.21.1/fs/splice.c 2007-05-05 04:41:59.000000000 -0400
> @@ -378,10 +378,11 @@
> * If in nonblock mode then dont block on waiting
> * for an in-flight io page
> */
> - if (flags & SPLICE_F_NONBLOCK)
> - break;
> -
> - lock_page(page);
> + if (flags & SPLICE_F_NONBLOCK) {
> + if (TestSetPageLocked(page))
> + break;
> + } else
> + lock_page(page);
>
> /*
> * page was truncated, stop here. if this isn't the
So.. afaik we're awaiting testing results for this change?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] splice() and readahead interaction
[not found] ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
2007-05-07 21:54 ` Andrew Morton
@ 2007-05-10 19:53 ` Eric Dumazet
1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2007-05-10 19:53 UTC (permalink / raw)
To: Fengguang Wu
Cc: Andi Kleen, Andrew Morton, Oleg Nesterov, Steven Pratt, Ram Pai,
linux-kernel, Ingo Molnar
Fengguang Wu a écrit :
> 2007/5/2, Eric Dumazet <dada1@cosmosbay.com <mailto:dada1@cosmosbay.com>>:
>
> Since you work on readahead, could you please find the reason
> following program triggers a problem in splice() syscall ?
>
> Description :
>
> I tried to use splice(SPLICE_F_NONBLOCK) in a non blocking
> environnement, in an attempt to implement cheap AIO, and zero-copy
> splice() feature.
>
> I quicky found that readahead in splice() is not really working.
>
> To demonstrate the problem, just compile the attached program, and
> use it to pipe a big file (not yet in cache) to /dev/null :
>
> $ gcc -o spliceout spliceout.c
> $ spliceout -d BIGFILE | cat >/dev/null
> offset=49152 ret=49152
> offset=65536 ret=16384
> offset=131072 ret=65536
> ...no more progress... (splice() returns -1 and EAGAIN)
>
> reading splice(SPLICE_F_NONBLOCK) syscall implementation, I expected
> to exploit its ability to call readahead(), and do some progress if
> pages are ready in cache.
>
> But apparently, even on an idle machine, it is not working as expected.
>
>
>
> Eric Dumazet, thank you for disclosing this bug.
>
> Readahead logic somehow fails to populate the page range with data.
> It can be because
> 1) the readahead routine is not always called in the following lines of
> fs/splice.c:
> if (!loff || nr_pages > 1)
> page_cache_readahead(mapping, &in->f_ra, in, index,
> nr_pages);
> 2) even called, page_cache_readahead() wont guarantee the pages are there.
> It wont submit readahead I/O for pages already in the radix tree, or
> when (ra_pages == 0), or after 256 cache hits.
>
> In your case, it should be because of the retried reads, which lead to
> excessive cache hits, and disables readahead at some time.
>
> And that _one_ failure of readahead blocks the whole read process.
> The application receives EAGAIN and retries the read, but
> __generic_file_splice_read() refuse to make progress:
> - in the previous invocation, it has allocated a blank page and inserted
> it into the radix tree, but never has the chance to start I/O for it:
> the test of SPLICE_F_NONBLOCK goes before that.
> - in the retried invocation, the readahead code will neither get out of
> the cache hit mode, nor will it submit I/O for an already existing page.
>
> The attached patch should fix the critical splice bug. Sorry for not
> being able to test it locally for now - I'm at home and running knoppix.
> And the readahead bug will be fixed by the upcoming on-demand readahead
> patch. I should be back and submit it after a week.
>
> Thank you,
> Fengguang Wu
>
>
> ------------------------------------------------------------------------
>
> --- linux-2.6.21.1/fs/splice.c.old 2007-05-05 04:40:38.000000000 -0400
> +++ linux-2.6.21.1/fs/splice.c 2007-05-05 04:41:59.000000000 -0400
> @@ -378,10 +378,11 @@
> * If in nonblock mode then dont block on waiting
> * for an in-flight io page
> */
> - if (flags & SPLICE_F_NONBLOCK)
> - break;
> -
> - lock_page(page);
> + if (flags & SPLICE_F_NONBLOCK) {
> + if (TestSetPageLocked(page))
> + break;
> + } else
> + lock_page(page);
>
> /*
> * page was truncated, stop here. if this isn't the
Sorry for the delay.
This patches solves the problem, thank you !
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-05-10 19:54 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070425131133.GA26863@mail.ustc.edu.cn>
2007-04-25 13:11 ` [RFC][PATCH] on-demand readahead Fengguang Wu
2007-04-25 14:37 ` Andi Kleen
[not found] ` <20070425160400.GA27954@mail.ustc.edu.cn>
2007-04-25 16:04 ` Fengguang Wu
2007-04-26 6:58 ` Andrew Morton
2007-04-25 16:08 ` Andi Kleen
[not found] ` <20070426011655.GA6373@mail.ustc.edu.cn>
2007-04-26 1:16 ` Fengguang Wu
2007-05-02 10:02 ` [RFC] splice() and readahead interaction Eric Dumazet
[not found] ` <f6b15c890705050204l11045ba3w66c8c4ae0ac3407f@mail.gmail.com>
2007-05-07 21:54 ` Andrew Morton
2007-05-10 19:53 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox