All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <jens.axboe@oracle.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] readahead even for FMODE_RANDOM
Date: Fri, 2 Apr 2010 08:38:30 +0200	[thread overview]
Message-ID: <20100402063830.GR23510@kernel.dk> (raw)
In-Reply-To: <20100402012338.GA6393@localhost>

On Fri, Apr 02 2010, Wu Fengguang wrote:
> Hi Jens,
> 
> On Fri, Apr 02, 2010 at 02:31:51AM +0800, Jens Axboe wrote:
> > Hi,
> > 
> > I got a problem report with fio where larger block size random reads
> > where markedly slower with buffered IO than with O_DIRECT, and the
> > initial thought was that perhaps this was some fio oddity. The reporter
> > eventually discovered that turning off the fadvise hint made it work
> > fine. So I took a look, and it seems we never do readahead for
> > FMODE_RANDOM even if the request size is larger than 1 page. That seems
> > like a bug, if an application is doing eg 16kb random reads, you want to
> > readahead the 12kb remaining data. On devices where smaller transfer
> > sizes are slower than larger ones, this can make a large difference.
> > 
> > This patch makes us readahead even for FMODE_RANDOM, iff we'll be
> > reading more pages in that single read. I ran a quick test here, and it
> > appears to fix the problem (no difference with fadvise POSIX_FADV_RANDOM
> > being passed in or not).
>  
> I guess the application is doing (at least partial) sequential reads,
> while at the same time tell kernel with POSIX_FADV_RANDOM that it's
> doing random reads.
> 
> If so, it's mainly the application's fault.

The application is doing large random reads. It's purely random, so
the POSIX_FADV_RANDOM hint is correct. However, thinking about it, it
may be that we later hit a random "block" that has now been cached due
to this read-ahead. Let me try and rule that out completely and see if
there's still the difference. The original reporter observed 4kb reads
hitting the driver, where 128kb was expected.

> However the kernel can behave more smart and less "dumb" :)
> It can inherit the current good behavior of "submit one single 16kb io
> request for a 16kb random read() syscall", while still be able to
> start _larger sized_ readahead if it's actually a sequential one.

Yeah, that's an ancient issue and pretty sad.

> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 337b20e..d4b201c 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -501,8 +501,11 @@ void page_cache_sync_readahead(struct address_space *mapping,
> >  	if (!ra->ra_pages)
> >  		return;
> >  
> > -	/* be dumb */
> > -	if (filp->f_mode & FMODE_RANDOM) {
> > +	/*
> > +	 * Be dumb for files marked as randomly accessed, but do readahead
> > +	 * inside the original request (req_size > 1).
> > +	 */
> > +	if ((filp->f_mode & FMODE_RANDOM) && req_size == 1) {
> >  		force_page_cache_readahead(mapping, filp, offset, req_size);
> >  		return;
> >  	}
> 
> The patch only fixes the (req_size != 1) case that exposed by your
> application. A complete fix would be 
> 
> @@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad
>  	if (!ra->ra_pages)
>  		return;
>  
> -	/* be dumb */
> -	if (filp->f_mode & FMODE_RANDOM) {
> -		force_page_cache_readahead(mapping, filp, offset, req_size);
> -		return;
> -	}
> -

Hmm, are we talking about the same thing? I want to hit read-ahead for
the remaining pages inside that random read, eg ensure that read-ahead
gets activated inside that window of the random request.

>  	/* do read-ahead */
>  	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
>  }
> 
> And a more optimized patch would look like this.  Note that only the
> last chunk is necessary for bug fixing, and only this chunk can be
> applied to vanilla 2.6.34-rc3.
> 
> If no problem, I'll submit a patch with only the last chunk for
> 2.6.34, and submit the remaining chunks for 2.6.35.
> 
> Thanks,
> Fengguang
> ---
> Subject: readahead: more smart readahead on POSIX_FADV_RANDOM
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Fri Apr 02 08:52:42 CST 2010
> 
> Some times user space applications will tell POSIX_FADV_RANDOM
> while still doing some sequential reads.
> 
> The kernel can behave a bit smarter in this case, by letting the
> readahead heuristics handle the POSIX_FADV_RANDOM case, but with
> less aggressive assumption on sequential reads.

I'll try and give this a spin. On the laptop, I cannot reproduce the
problem of smaller < reqsize ios, so hard to say just now.

> 
> CC: Jens Axboe <jens.axboe@oracle.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/readahead.c |   17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> --- linux.orig/mm/readahead.c	2010-04-02 08:43:53.000000000 +0800
> +++ linux/mm/readahead.c	2010-04-02 09:00:51.000000000 +0800
> @@ -664,6 +664,7 @@ ondemand_readahead(struct address_space 
>  	unsigned long max = max_sane_readahead(ra->ra_pages);
>  	unsigned long tt;  /* thrashing shreshold */
>  	pgoff_t start;
> +	bool random_hint = (filp && (filp->f_mode & FMODE_RANDOM));
>  
>  	/*
>  	 * start of file
> @@ -671,7 +672,8 @@ ondemand_readahead(struct address_space 
>  	if (!offset) {
>  		ra_set_pattern(ra, RA_PATTERN_INITIAL);
>  		ra->start = offset;
> -		if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
> +		if ((random_hint || (ra->ra_flags & READAHEAD_LSEEK)) &&
> +		    req_size <= max) {
>  			ra->size = req_size;
>  			ra->async_size = 0;
>  			goto readit;
> @@ -743,8 +745,11 @@ context_readahead:
>  	} else
>  		start = offset;
>  
> -	tt = count_history_pages(mapping, ra, offset,
> -				 READAHEAD_ASYNC_RATIO * max);
> +	if (unlikely(random_hint))
> +		tt = 0;
> +	else
> +		tt = count_history_pages(mapping, ra, offset,
> +					 READAHEAD_ASYNC_RATIO * max);
>  	/*
>  	 * no history pages cached, could be
>  	 * 	- a random read
> @@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad
>  	if (!ra->ra_pages)
>  		return;
>  
> -	/* be dumb */
> -	if (filp->f_mode & FMODE_RANDOM) {
> -		force_page_cache_readahead(mapping, filp, offset, req_size);
> -		return;
> -	}
> -
>  	/* do read-ahead */
>  	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
>  }

-- 
Jens Axboe


  reply	other threads:[~2010-04-02  6:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-01 18:31 [PATCH] readahead even for FMODE_RANDOM Jens Axboe
2010-04-02  1:23 ` Wu Fengguang
2010-04-02  6:38   ` Jens Axboe [this message]
2010-04-02  6:52     ` Wu Fengguang
2010-04-02  6:59       ` Jens Axboe
2010-04-02  7:21         ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100402063830.GR23510@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.