From: Wu Fengguang <fengguang.wu@intel.com>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] readahead even for FMODE_RANDOM
Date: Fri, 2 Apr 2010 09:23:38 +0800 [thread overview]
Message-ID: <20100402012338.GA6393@localhost> (raw)
In-Reply-To: <20100401183151.GO23510@kernel.dk>
Hi Jens,
On Fri, Apr 02, 2010 at 02:31:51AM +0800, Jens Axboe wrote:
> Hi,
>
> I got a problem report with fio where larger block size random reads
> where markedly slower with buffered IO than with O_DIRECT, and the
> initial thought was that perhaps this was some fio oddity. The reporter
> eventually discovered that turning off the fadvise hint made it work
> fine. So I took a look, and it seems we never do readahead for
> FMODE_RANDOM even if the request size is larger than 1 page. That seems
> like a bug, if an application is doing eg 16kb random reads, you want to
> readahead the 12kb remaining data. On devices where smaller transfer
> sizes are slower than larger ones, this can make a large difference.
>
> This patch makes us readahead even for FMODE_RANDOM, iff we'll be
> reading more pages in that single read. I ran a quick test here, and it
> appears to fix the problem (no difference with fadvise POSIX_FADV_RANDOM
> being passed in or not).
I guess the application is doing (at least partial) sequential reads,
while at the same time tell kernel with POSIX_FADV_RANDOM that it's
doing random reads.
If so, it's mainly the application's fault.
However the kernel can behave more smart and less "dumb" :)
It can inherit the current good behavior of "submit one single 16kb io
request for a 16kb random read() syscall", while still be able to
start _larger sized_ readahead if it's actually a sequential one.
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 337b20e..d4b201c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -501,8 +501,11 @@ void page_cache_sync_readahead(struct address_space *mapping,
> if (!ra->ra_pages)
> return;
>
> - /* be dumb */
> - if (filp->f_mode & FMODE_RANDOM) {
> + /*
> + * Be dumb for files marked as randomly accessed, but do readahead
> + * inside the original request (req_size > 1).
> + */
> + if ((filp->f_mode & FMODE_RANDOM) && req_size == 1) {
> force_page_cache_readahead(mapping, filp, offset, req_size);
> return;
> }
The patch only fixes the (req_size != 1) case that exposed by your
application. A complete fix would be
@@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;
- /* be dumb */
- if (filp->f_mode & FMODE_RANDOM) {
- force_page_cache_readahead(mapping, filp, offset, req_size);
- return;
- }
-
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}
And a more optimized patch would look like this. Note that only the
last chunk is necessary for bug fixing, and only this chunk can be
applied to vanilla 2.6.34-rc3.
If no problem, I'll submit a patch with only the last chunk for
2.6.34, and submit the remaining chunks for 2.6.35.
Thanks,
Fengguang
---
Subject: readahead: more smart readahead on POSIX_FADV_RANDOM
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Fri Apr 02 08:52:42 CST 2010
Some times user space applications will tell POSIX_FADV_RANDOM
while still doing some sequential reads.
The kernel can behave a bit smarter in this case, by letting the
readahead heuristics handle the POSIX_FADV_RANDOM case, but with
less aggressive assumption on sequential reads.
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
--- linux.orig/mm/readahead.c 2010-04-02 08:43:53.000000000 +0800
+++ linux/mm/readahead.c 2010-04-02 09:00:51.000000000 +0800
@@ -664,6 +664,7 @@ ondemand_readahead(struct address_space
unsigned long max = max_sane_readahead(ra->ra_pages);
unsigned long tt; /* thrashing shreshold */
pgoff_t start;
+ bool random_hint = (filp && (filp->f_mode & FMODE_RANDOM));
/*
* start of file
@@ -671,7 +672,8 @@ ondemand_readahead(struct address_space
if (!offset) {
ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
- if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+ if ((random_hint || (ra->ra_flags & READAHEAD_LSEEK)) &&
+ req_size <= max) {
ra->size = req_size;
ra->async_size = 0;
goto readit;
@@ -743,8 +745,11 @@ context_readahead:
} else
start = offset;
- tt = count_history_pages(mapping, ra, offset,
- READAHEAD_ASYNC_RATIO * max);
+ if (unlikely(random_hint))
+ tt = 0;
+ else
+ tt = count_history_pages(mapping, ra, offset,
+ READAHEAD_ASYNC_RATIO * max);
/*
* no history pages cached, could be
* - a random read
@@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;
- /* be dumb */
- if (filp->f_mode & FMODE_RANDOM) {
- force_page_cache_readahead(mapping, filp, offset, req_size);
- return;
- }
-
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}
next prev parent reply other threads:[~2010-04-02 1:23 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-01 18:31 [PATCH] readahead even for FMODE_RANDOM Jens Axboe
2010-04-02 1:23 ` Wu Fengguang [this message]
2010-04-02 6:38 ` Jens Axboe
2010-04-02 6:52 ` Wu Fengguang
2010-04-02 6:59 ` Jens Axboe
2010-04-02 7:21 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100402012338.GA6393@localhost \
--to=fengguang.wu@intel.com \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.