All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: "Jörn Engel" <joern@logfs.org>
Cc: Jens Axboe <jens.axboe@oracle.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Fengguang Wu <wfg@mail.ustc.edu.cn>, riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tim Pepper <lnxninja@us.ibm.com>, Chris Snook <csnook@redhat.com>
Subject: Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
Date: Tue, 24 Jul 2007 08:44:21 +1000	[thread overview]
Message-ID: <1185230661.1803.11.camel@localhost.localdomain> (raw)
In-Reply-To: <20070723100438.GA13963@lazybastard.org>

On Mon, 2007-07-23 at 12:04 +0200, Jörn Engel wrote:
> I believe this whole thing is fundamentally flawed.  The perfect
> readahead size depends on the device in question.  If we set a single
> system-wide value depending on memory size, it may easily be too small
> and too large at the same time.  Think hard disk and SSD.

Well, I think the filesystem should get the first shot at altering the
readahead heuristic (we should add a hook so it can then ask the
underlying device).

Something like:
===
Hook for filesystems to tweak readahead logic

It was suggested at Fengguang's OLS talk that filesystems can make a
useful contribution to the readahead logic: avoiding seeks and lock
boundaries (on cluster filesystems) and other such filesystem-specific
things.

The alter_readahead hook should return the number of pages to read.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>

---
 Documentation/filesystems/vfs.txt |   10 ++++++----
 include/linux/fs.h                |    1 +
 mm/readahead.c                    |   12 ++++++++++++
 3 files changed, 19 insertions(+), 4 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc6-mm1/include/linux/fs.h
@@ -1204,7 +1204,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*revoke)(struct file *, struct address_space *);
+	unsigned long (*alter_readahead)(struct file *, pgoff_t, unsigned long, pgoff_t, unsigned long);
 };
 
 struct inode_operations {
--- linux-2.6.22-rc6-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc6-mm1/mm/readahead.c
@@ -394,7 +394,19 @@ ondemand_readahead(struct address_space 
 	}
 
 readit:
+	/*
+	 * Give filesystem a chance to tweak readahead size (eg. to
+	 * contiguous block boundary).  It should probably not change
+	 * by more than 50% otherwise it's really ignoring our
+	 * readahead advice.
+	 */
+	if (filp->f_op->alter_readahead) {
+		ra->size = filp->f_op->alter_readahead(filp, offset, req_size,
+						       ra->start, ra->size);
+		if (ra->async_size > ra->size)
+			ra->async_size = ra->size;
+	}
 	return ra_submit(ra, mapping, filp);
 }
 
--- linux-2.6.22-rc6-mm1.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6.22-rc6-mm1/Documentation/filesystems/vfs.txt
@@ -777,11 +777,10 @@ struct file_operations {
 	int (*check_flags)(int);
 	int (*dir_notify)(struct file *filp, unsigned long arg);
 	int (*flock) (struct file *, int, struct file_lock *);
-	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned 
-int);
-	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned  
-int);
+	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int);
+	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*revoke)(struct file *);
+	unsigned long (*alter_readahead)(struct file *, pgoff_t, unsigned long, pgoff_t, unsigned long);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -859,6 +858,9 @@ otherwise noted.
 	  to an open file. This method must ensure that all currently blocked
 	  writes are flushed and reads will fail.
 
+  alter_readahead: called when the readahead logic is about to submit a
+                   readahead request for I/O
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special



  parent reply	other threads:[~2007-07-23 22:45 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra
2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
2007-07-21 20:29   ` Eric St-Laurent
2007-07-21 20:37     ` Peter Zijlstra
2007-07-21 20:59       ` Eric St-Laurent
2007-07-21 21:06         ` Peter Zijlstra
2007-07-25  3:55   ` Eric St-Laurent
2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra
2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra
2007-07-22  8:24   ` Jens Axboe
2007-07-22  8:36     ` Peter Zijlstra
2007-07-22  8:50       ` Jens Axboe
2007-07-22  9:17         ` Peter Zijlstra
2007-07-22 16:44           ` Jens Axboe
2007-07-23 10:04             ` Jörn Engel
2007-07-23 10:11               ` Jens Axboe
2007-07-23 22:44               ` Rusty Russell [this message]
2007-07-22 23:52         ` Rik van Riel
2007-07-23  5:22           ` Jens Axboe
2007-07-22  8:45   ` Fengguang Wu
2007-07-22  8:45     ` Fengguang Wu
2007-07-22  8:59       ` Peter Zijlstra
2007-07-22  9:53         ` Fengguang Wu
2007-07-22  9:53           ` Fengguang Wu
2007-07-22  2:39 ` [PATCH 0/3] readahead drop behind and size adjustment Fengguang Wu
2007-07-22  2:39   ` Fengguang Wu
2007-07-22  2:44   ` Dave Jones
2007-07-22  8:10     ` Fengguang Wu
2007-07-22  8:10       ` Fengguang Wu
2007-07-22  8:24         ` Peter Zijlstra
2007-07-22  8:29           ` Fengguang Wu
2007-07-22  8:29             ` Fengguang Wu
2007-07-22  8:33       ` Rusty Russell
2007-07-22  8:45         ` Peter Zijlstra
2007-07-23  9:00         ` Nick Piggin
2007-07-23 14:24           ` Fengguang Wu
2007-07-23 14:24             ` Fengguang Wu
2007-07-23 19:40               ` Andrew Morton
2007-07-24  0:47                 ` Fengguang Wu
2007-07-24  0:47                   ` Fengguang Wu
2007-07-24  1:17                     ` Andrew Morton
2007-07-24  8:50                       ` Andreas Dilger
2007-07-24  4:30                     ` Nick Piggin
2007-07-25  4:35           ` Eric St-Laurent
2007-07-25  5:19             ` Nick Piggin
2007-07-25  6:18               ` Eric St-Laurent
2007-07-25  7:09                 ` Nick Piggin
2007-07-25  7:48                   ` Eric St-Laurent
2007-07-25 15:36                     ` Rik van Riel
2007-07-25 15:33                   ` Rik van Riel
2007-07-29  7:44                   ` Eric St-Laurent
2007-07-25 15:28               ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1185230661.1803.11.camel@localhost.localdomain \
    --to=rusty@rustcorp.com.au \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=csnook@redhat.com \
    --cc=jens.axboe@oracle.com \
    --cc=joern@logfs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lnxninja@us.ibm.com \
    --cc=riel@redhat.com \
    --cc=wfg@mail.ustc.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.