From: Wu Fengguang <fengguang.wu@intel.com>
To: Vladislav Bolkhovitin <vst@vlnb.net>
Cc: Ronald Moesbergen <intercommit@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
"kosaki.motohiro@jp.fujitsu.com" <kosaki.motohiro@jp.fujitsu.com>,
"Alan.Brunelle@hp.com" <Alan.Brunelle@hp.com>,
"hifumi.hisashi@oss.ntt.co.jp" <hifumi.hisashi@oss.ntt.co.jp>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
"randy.dunlap@oracle.com" <randy.dunlap@oracle.com>,
Bart Van Assche <bart.vanassche@gmail.com>
Subject: Re: [RESEND] [PATCH] readahead:add blk_run_backing_dev
Date: Mon, 29 Jun 2009 21:28:41 +0800 [thread overview]
Message-ID: <20090629132841.GA26171@localhost> (raw)
In-Reply-To: <20090629131326.GA23668@localhost>
[-- Attachment #1: Type: text/plain, Size: 639 bytes --]
On Mon, Jun 29, 2009 at 09:13:27PM +0800, Wu Fengguang wrote:
> On Mon, Jun 29, 2009 at 09:04:57PM +0800, Vladislav Bolkhovitin wrote:
> > Wu Fengguang, on 06/29/2009 04:54 PM wrote:
> > >
> > > Why not 2.6.30? :)
> >
> > We started with 2.6.29, so why not complete with it (to save additional
> > Ronald's effort to move on 2.6.30)?
>
> OK, that's fair enough.
btw, I backported the 2.6.31 context readahead patches to 2.6.29, just
in case it will help the SCST performance.
Ronald, if you run context readahead, please make sure that the server
side readahead size is bigger than the client side readahead size.
Thanks,
Fengguang
[-- Attachment #2: readahead-context-2.6.29.patch --]
[-- Type: text/x-diff, Size: 6651 bytes --]
--- linux.orig/mm/readahead.c
+++ linux/mm/readahead.c
@@ -337,6 +337,59 @@ static unsigned long get_next_ra_size(st
*/
/*
+ * Count contiguously cached pages from @offset-1 to @offset-@max,
+ * this count is a conservative estimation of
+ * - length of the sequential read sequence, or
+ * - thrashing threshold in memory tight systems
+ */
+static pgoff_t count_history_pages(struct address_space *mapping,
+ struct file_ra_state *ra,
+ pgoff_t offset, unsigned long max)
+{
+ pgoff_t head;
+
+ rcu_read_lock();
+ head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+ rcu_read_unlock();
+
+ return offset - 1 - head;
+}
+
+/*
+ * page cache context based read-ahead
+ */
+static int try_context_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned long max)
+{
+ pgoff_t size;
+
+ size = count_history_pages(mapping, ra, offset, max);
+
+ /*
+ * no history pages:
+ * it could be a random read
+ */
+ if (!size)
+ return 0;
+
+ /*
+ * starts from beginning of file:
+ * it is a strong indication of long-run stream (or whole-file-read)
+ */
+ if (size >= offset)
+ size *= 2;
+
+ ra->start = offset;
+ ra->size = get_init_ra_size(size + req_size, max);
+ ra->async_size = ra->size;
+
+ return 1;
+}
+
+/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
static unsigned long
@@ -345,34 +398,26 @@ ondemand_readahead(struct address_space
bool hit_readahead_marker, pgoff_t offset,
unsigned long req_size)
{
- int max = ra->ra_pages; /* max readahead pages */
- pgoff_t prev_offset;
- int sequential;
+ unsigned long max = max_sane_readahead(ra->ra_pages);
+
+ /*
+ * start of file
+ */
+ if (!offset)
+ goto initial_readahead;
/*
* It's the expected callback offset, assume sequential access.
* Ramp up sizes, and push forward the readahead window.
*/
- if (offset && (offset == (ra->start + ra->size - ra->async_size) ||
- offset == (ra->start + ra->size))) {
+ if ((offset == (ra->start + ra->size - ra->async_size) ||
+ offset == (ra->start + ra->size))) {
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
goto readit;
}
- prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
- sequential = offset - prev_offset <= 1UL || req_size > max;
-
- /*
- * Standalone, small read.
- * Read as is, and do not pollute the readahead state.
- */
- if (!hit_readahead_marker && !sequential) {
- return __do_page_cache_readahead(mapping, filp,
- offset, req_size, 0);
- }
-
/*
* Hit a marked page without valid readahead state.
* E.g. interleaved reads.
@@ -383,7 +428,7 @@ ondemand_readahead(struct address_space
pgoff_t start;
rcu_read_lock();
- start = radix_tree_next_hole(&mapping->page_tree, offset,max+1);
+ start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
rcu_read_unlock();
if (!start || start - offset > max)
@@ -391,23 +436,53 @@ ondemand_readahead(struct address_space
ra->start = start;
ra->size = start - offset; /* old async_size */
+ ra->size += req_size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
goto readit;
}
/*
- * It may be one of
- * - first read on start of file
- * - sequential cache miss
- * - oversize random read
- * Start readahead for it.
+ * oversize read
+ */
+ if (req_size > max)
+ goto initial_readahead;
+
+ /*
+ * sequential cache miss
*/
+ if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+ goto initial_readahead;
+
+ /*
+ * Query the page cache and look for the traces(cached history pages)
+ * that a sequential stream would leave behind.
+ */
+ if (try_context_readahead(mapping, ra, offset, req_size, max))
+ goto readit;
+
+ /*
+ * standalone, small random read
+ * Read as is, and do not pollute the readahead state.
+ */
+ return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+
+initial_readahead:
ra->start = offset;
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
readit:
+ /*
+ * Will this read hit the readahead marker made by itself?
+ * If so, trigger the readahead marker hit now, and merge
+ * the resulted next readahead window into the current one.
+ */
+ if (offset == ra->start && ra->size == ra->async_size) {
+ ra->async_size = get_next_ra_size(ra, max);
+ ra->size += ra->async_size;
+ }
+
return ra_submit(ra, mapping, filp);
}
--- linux.orig/lib/radix-tree.c
+++ linux/lib/radix-tree.c
@@ -666,6 +666,43 @@ unsigned long radix_tree_next_hole(struc
}
EXPORT_SYMBOL(radix_tree_next_hole);
+/**
+ * radix_tree_prev_hole - find the prev hole (not-present entry)
+ * @root: tree root
+ * @index: index key
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index]
+ * for the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >= max_scan'
+ * will be true). In rare cases of wrap-around, LONG_MAX will be returned.
+ *
+ * radix_tree_next_hole may be called under rcu_read_lock. However, like
+ * radix_tree_gang_lookup, this will not atomically search a snapshot of
+ * the tree at a single point in time. For example, if a hole is created
+ * at index 10, then subsequently a hole is created at index 5,
+ * radix_tree_prev_hole covering both indexes may return 5 if called under
+ * rcu_read_lock.
+ */
+unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ unsigned long i;
+
+ for (i = 0; i < max_scan; i++) {
+ if (!radix_tree_lookup(root, index))
+ break;
+ index--;
+ if (index == LONG_MAX)
+ break;
+ }
+
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_prev_hole);
+
static unsigned int
__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
--- linux.orig/include/linux/radix-tree.h
+++ linux/include/linux/radix-tree.h
@@ -167,6 +167,8 @@ radix_tree_gang_lookup_slot(struct radix
unsigned long first_index, unsigned int max_items);
unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
next prev parent reply other threads:[~2009-06-29 13:28 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-29 5:35 [RESEND] [PATCH] readahead:add blk_run_backing_dev Hisashi Hifumi
2009-06-01 0:36 ` Andrew Morton
2009-06-01 1:04 ` Hisashi Hifumi
2009-06-05 15:15 ` Alan D. Brunelle
2009-06-06 14:36 ` KOSAKI Motohiro
2009-06-06 22:45 ` Wu Fengguang
2009-06-18 19:04 ` Andrew Morton
2009-06-20 3:55 ` Wu Fengguang
2009-06-20 12:29 ` Vladislav Bolkhovitin
2009-06-29 9:34 ` Wu Fengguang
2009-06-29 10:26 ` Ronald Moesbergen
2009-06-29 10:55 ` Vladislav Bolkhovitin
2009-06-29 12:54 ` Wu Fengguang
2009-06-29 12:58 ` Bart Van Assche
2009-06-29 13:01 ` Wu Fengguang
2009-06-29 13:04 ` Vladislav Bolkhovitin
2009-06-29 13:13 ` Wu Fengguang
2009-06-29 13:28 ` Wu Fengguang [this message]
2009-06-29 14:43 ` Ronald Moesbergen
2009-06-29 14:51 ` Wu Fengguang
2009-06-29 14:56 ` Ronald Moesbergen
2009-06-29 15:37 ` Vladislav Bolkhovitin
2009-06-29 14:00 ` Ronald Moesbergen
2009-06-29 14:21 ` Wu Fengguang
2009-06-29 15:01 ` Wu Fengguang
2009-06-29 15:37 ` Vladislav Bolkhovitin
[not found] ` <20090630010414.GB31418@localhost>
[not found] ` <4A49EEF9.6010205@vlnb.net>
[not found] ` <a0272b440907030214l4016422bxbc98fd003bfe1b3d@mail.gmail.com>
[not found] ` <4A4DE3C1.5080307@vlnb.net>
[not found] ` <a0272b440907040819l5289483cp44b37d967440ef73@mail.gmail.com>
2009-07-06 11:12 ` Vladislav Bolkhovitin
2009-07-06 14:37 ` Ronald Moesbergen
2009-07-06 17:48 ` Vladislav Bolkhovitin
2009-07-07 6:49 ` Ronald Moesbergen
[not found] ` <4A5395FD.2040507@vlnb.net>
[not found] ` <a0272b440907080149j3eeeb9bat13f942520db059a8@mail.gmail.com>
2009-07-08 12:40 ` Vladislav Bolkhovitin
2009-07-10 6:32 ` Ronald Moesbergen
2009-07-10 8:43 ` Vladislav Bolkhovitin
2009-07-10 9:27 ` Vladislav Bolkhovitin
2009-07-13 12:12 ` Ronald Moesbergen
2009-07-13 12:36 ` Wu Fengguang
2009-07-13 12:47 ` Ronald Moesbergen
2009-07-13 12:52 ` Wu Fengguang
2009-07-14 18:52 ` Vladislav Bolkhovitin
2009-07-15 7:06 ` Wu Fengguang
2009-07-14 18:52 ` Vladislav Bolkhovitin
2009-07-15 6:30 ` Vladislav Bolkhovitin
2009-07-16 7:32 ` Ronald Moesbergen
2009-07-16 10:36 ` Vladislav Bolkhovitin
2009-07-16 14:54 ` Ronald Moesbergen
2009-07-16 16:03 ` Vladislav Bolkhovitin
2009-07-17 14:15 ` Ronald Moesbergen
2009-07-17 18:23 ` Vladislav Bolkhovitin
2009-07-20 7:20 ` Vladislav Bolkhovitin
2009-07-22 8:44 ` Ronald Moesbergen
2009-07-27 13:11 ` Vladislav Bolkhovitin
2009-07-28 9:51 ` Ronald Moesbergen
2009-07-28 19:07 ` Vladislav Bolkhovitin
2009-07-29 12:48 ` Ronald Moesbergen
2009-07-31 18:32 ` Vladislav Bolkhovitin
2009-08-03 9:15 ` Ronald Moesbergen
2009-08-03 9:20 ` Vladislav Bolkhovitin
2009-08-03 11:44 ` Ronald Moesbergen
2009-07-15 20:52 ` Kurt Garloff
2009-07-16 10:38 ` Vladislav Bolkhovitin
2009-06-30 10:22 ` Vladislav Bolkhovitin
2009-06-29 10:55 ` Vladislav Bolkhovitin
2009-06-29 13:00 ` Wu Fengguang
2009-09-22 20:58 ` Andrew Morton
-- strict thread matches above, loose matches on Subject: below --
2009-05-22 0:09 [RESEND][PATCH] " Hisashi Hifumi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090629132841.GA26171@localhost \
--to=fengguang.wu@intel.com \
--cc=Alan.Brunelle@hp.com \
--cc=akpm@linux-foundation.org \
--cc=bart.vanassche@gmail.com \
--cc=hifumi.hisashi@oss.ntt.co.jp \
--cc=intercommit@gmail.com \
--cc=jens.axboe@oracle.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=randy.dunlap@oracle.com \
--cc=vst@vlnb.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).