From: Wu Fengguang <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "vst@vlnb.net" <vst@vlnb.net>,
"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
"jmoyer@redhat.com" <jmoyer@redhat.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/3] readahead: introduce context readahead algorithm
Date: Sun, 12 Apr 2009 15:11:08 +0800 [thread overview]
Message-ID: <20090412071108.GA14034@localhost> (raw)
In-Reply-To: <20090410171652.96bceb90.akpm@linux-foundation.org>
On Sat, Apr 11, 2009 at 08:16:52AM +0800, Andrew Morton wrote:
> On Fri, 10 Apr 2009 21:12:50 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Introduce page cache context based readahead algorithm.
> > This is to better support concurrent read streams in general.
> >
> > RATIONALE
> > ---------
> > The current readahead algorithm detects interleaved reads in a _passive_ way.
> > Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
> > By checking for (offset == prev_offset + 1), it will discover the sequentialness
> > between 3,4 and between 1004,1005, and start doing sequential readahead for the
> > individual streams since page 4 and page 1005.
> >
> > The context readahead algorithm guarantees to discover the sequentialness no
> > matter how the streams are interleaved. For the above example, it will start
> > sequential readahead since page 2 and 1002.
> >
> > The trick is to poke for page @offset-1 in the page cache when it has no other
> > clues on the sequentialness of request @offset: if the current requenst belongs
> > to a sequential stream, that stream must have accessed page @offset-1 recently,
> > and the page will still be cached now. So if page @offset-1 is there, we can
> > take request @offset as a sequential access.
> >
> > BENEFICIARIES
> > -------------
> > - strictly interleaved reads i.e. 1,1001,2,1002,3,1003,...
> > the current readahead will take them as silly random reads;
> > the context readahead will take them as two sequential streams.
> >
> > - seeky _column_ iterations on a huge matrix
> > Yes it can be regard as _massively_ interleaved streams!
> > Context readahead could transform the 1-page IOs (@offset+@size):
> > 0+1, 1000+1, 2000+1, 3000+1, ...,
> > 1+1, 1001+1, 2001+1, 3001+1, ...,
> > 2+1, 1002+1, 2002+1, 3002+1, ...
> > into larger sized IOs:
> > 0+1, 1000+1, 2000+1, 3000+1, ...,
> > 1+4, 1001+4, 2001+4, 3001+4, ...,
> > 5+8, 1005+8, 2005+8, 3005+8, ...
> >
> > - cooperative IO processes i.e. NFS and SCST
> > They create a thread pool, farming off (sequential) IO requests to different
> > threads which will be performing interleaved IO.
> >
> > It was not easy(or possible) to reliably tell from file->f_ra all those
> > cooperative processes working on the same sequential stream, since they will
> > have different file->f_ra instances. And NFSD's file->f_ra is particularly
> > unusable, since their file objects are dynamically created for each request.
> > The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
> >
> > The new scheme is to detect the sequential pattern via looking up the page
> > cache, which provides one single and consistent view of the pages recently
> > accessed. That makes sequential detection for cooperative processes possible.
> >
> > USER REPORT
> > -----------
> > Vladislav recommends the addition of context readahead as a result of his SCST
> > benchmarks. It leads to 6%~40% performance gains in various cases and achieves
> > equal performance in others. http://lkml.org/lkml/2009/3/19/239
> >
> > OVERHEADS
> > ---------
> > In theory, it introduces one extra page cache lookup per random read. However
> > the below benchmark shows context readahead to be slightly faster, wondering..
> >
> > Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
> > each block size. The average throughputs are:
> >
> > original ra context ra gain
> > 4K random reads: 65.561MB/s 65.648MB/s +0.1%
> > 16K random reads: 124.767MB/s 124.951MB/s +0.1%
> > 64K random reads: 162.123MB/s 162.278MB/s +0.1%
> >
> > Cc: Jens Axboe <jens.axboe@oracle.com>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Tested-by: Vladislav Bolkhovitin <vst@vlnb.net>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>
> > ---
> > mm/readahead.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 60 insertions(+)
> >
> > --- mm.orig/mm/readahead.c
> > +++ mm/mm/readahead.c
> > @@ -330,6 +330,59 @@ static unsigned long get_next_ra_size(st
> > */
> >
> > /*
> > + * Count continuously cached pages from @offset-1 to @offset-@max,
>
> You meant "contiguously" here, yes?
Ah yes, continuously for time and contiguously for space?
> > + * this count is a conservative estimation of
> > + * - length of the sequential read sequence, or
> > + * - thrashing threshold in memory tight systems
> > + */
> > +static unsigned long count_history_pages(struct address_space *mapping,
> > + struct file_ra_state *ra,
> > + pgoff_t offset, unsigned long max)
> > +{
> > + pgoff_t head;
> > +
> > + rcu_read_lock();
> > + head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
> > + rcu_read_unlock();
> > +
> > + return offset - 1 - head;
> > +}
>
> Doesn't matter much, but perhaps this should return pgoff_t.
Do you indicate to use pgoff_t for size?
> > +/*
> > + * page cache context based read-ahead
> > + */
> > +static int try_context_readahead(struct address_space *mapping,
> > + struct file_ra_state *ra,
> > + pgoff_t offset,
> > + unsigned long req_size,
> > + unsigned long max)
> > +{
> > + unsigned long size;
>
> And this could be pgoff_t too.
OK. I'll repost the whole series.
Thanks,
Fengguang
next prev parent reply other threads:[~2009-04-12 7:24 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-10 13:12 [PATCH 0/3] context readahead for concurrent IO Wu Fengguang
2009-04-10 13:12 ` [PATCH 1/3] radix-tree: add radix_tree_next_hole() Wu Fengguang
2009-04-10 13:29 ` Wu Fengguang
2009-04-10 13:31 ` [PATCH 1/3] radix-tree: add radix_tree_prev_hole() Wu Fengguang
2009-04-10 13:12 ` [PATCH 2/3] readahead: move the random read case to bottom Wu Fengguang
2009-04-10 13:12 ` [PATCH 3/3] readahead: introduce context readahead algorithm Wu Fengguang
2009-04-11 0:16 ` Andrew Morton
2009-04-12 7:11 ` Wu Fengguang [this message]
-- strict thread matches above, loose matches on Subject: below --
2009-04-12 7:19 [PATCH 0/3] context readahead for concurrent IO take 2 Wu Fengguang
2009-04-12 7:19 ` [PATCH 3/3] readahead: introduce context readahead algorithm Wu Fengguang
2009-04-12 8:48 ` Ingo Molnar
2009-04-12 12:35 ` Wu Fengguang
2009-04-12 12:35 ` Wu Fengguang
2009-04-16 17:12 ` Vladislav Bolkhovitin
2009-04-16 17:12 ` Vladislav Bolkhovitin
[not found] ` <87zlej7kwf.fsf@basil.nowhere.org>
2009-04-14 9:27 ` Wu Fengguang
2009-04-14 10:00 ` Andi Kleen
2009-04-14 10:58 ` Wu Fengguang
2009-04-14 11:11 ` Wu Fengguang
2009-04-15 3:43 ` Jeff Moyer
2009-04-15 4:43 ` Wu Fengguang
2009-04-15 17:55 ` Jeff Moyer
2009-04-27 4:48 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090412071108.GA14034@localhost \
--to=fengguang.wu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=jens.axboe@oracle.com \
--cc=jmoyer@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=vst@vlnb.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.