From: "Vladimir V. Saveliev" <vs@clusterfs.com>
To: Andreas Dilger <adilger@clusterfs.com>
Cc: Valerie Henson <val.henson@gmail.com>,
Theodore Ts'o <tytso@mit.edu>, Ric Wheeler <ric@emc.com>,
linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: Threaded readahead strawman
Date: Thu, 11 Oct 2007 20:41:14 +0300 [thread overview]
Message-ID: <470E603A.2080203@clusterfs.com> (raw)
In-Reply-To: <20071011052736.GF8122@schatzie.adilger.int>
[-- Attachment #1: Type: text/plain, Size: 4396 bytes --]
Hello
Andreas Dilger wrote:
> On Oct 10, 2007 20:09 -0700, Valerie Henson wrote:
>> I need to get started on a mergeable version of the threaded readahead
>> patch for e2fsck. I intend for it to be compatible with Andreas'
>> sys_readahead() for block devices that support it. Here's a first
>> draft proposal - your thoughts? Note that it's not really that
>> anything is being read *ahead* per se, but that it's being read
>> simultaneously. Single-threaded readahead doesn't go any faster.
>
> We've been fiddling with this as well. I'd attach some patches but
> bugzilla is down as I write this :(. I also asked Vladimir (working on
> these patches) to forward them to you and the linux-ext4 mailing list.
>
The patch is attached.
If an application can foresee what it is going to read in future - it
can call io_channel_readahead for those data forehand. Even if
io_channel_readahead is called right before the data are actually needed
- it may make positive effect for multi disk devices because of parallel
reading.
For example, using io_channel_readahead to readahead coming inode tables
in done_group callback of ext2_inode_scan changes inode table scan in my
local quick test from 34 seconds to 26 (on 2 two ide disk raid0)
> We added a "readahead" method to the io_manager interface (no-op for
> Win/DOS) that can be used generically. This is currently done via
> posix_fadvise(POSIX_FADV_WILLNEED). We haven't done any multi-threading
> yet, but there is some hope that the block layer could sort it out?
> It would still be beneficial to have multiple user-space threads do
> the reading of the data, to get parallel memcpy() into userspace.
>
>> The major global parameters to the system are:
>>
>> 1. Optimal number of concurrent requests - number of underlying read
>> heads times some N of best number of outstanding requests. Default to
>> one.
>>
>> 2. Stripe size, or more generally which areas can be read concurrently
>> and which cannot.
>
> There are new parameters in the superblock (s_raid_stride and
> s_raid_stripe_width) but as yet only s_raid_stride is initialized by
> mke2fs. There is a library in xfstools (libdisk or somesuch) that
> can get a lot more disk geometry info and it would be good to leverage
> that for mke2fs also.
>
>> 3. Maximum memory to use. We have to keep the readahead from
>> outrunning the actual processing (though so far, that hasn't been a
>> problem) and having bits of our buffer cache kicked out before they
>> are used. This can be set to some percentage of available memory by
>> default.
>
> Agreed. I'd proposed in the past that fsck could call fsck.{fstype}
> with a parameter like --expected-memory to determine the expected memory
> usage of fsck.{fstype} based on the filesystem geometry, and it could
> also supply --max-memory so we don't have parallel fscks stomping on
> each other.
>
>> I see two main ways to do this: One is a straightforward offset plus
>> size, telling it what to read. The other is to make libext2 do all
>> the interpretation of ondisk format, and design the interface in terms
>> of kinds of metadata to read. Given that libext2 functions like
>> ext2fs_get_next_inode_full() should be aware of what's going on in
>> readahead. This argues for a metadata aware, in-library
>> implementation. Something like:
>>
>> /* Creates the threads, sets some variables. Returns a handle. */
>> handle = ext2fs_readahead_init(concurrent_requests, stripe_size, max_memory);
>>
>> /* Readahead inode tables and inode indirect blocks - can't really be
>> separated */
>> ext2fs_readahead_inodes(handle, fs);
>
> Well, there's something to be said for allowing the inode tables and
> corresponding bitmaps to be read in a single shot. Also, not all users
> require the indirect blocks, so I would make that an option.
>
>> /* Read the directory block list (pass 2) */
>> ext2fs_readahead_dblist(handle, fs);
>
> We're working on this as part of e2scan (in bug 13108 above), not sure if
> there is a patch available or not.
>
>> /* Read bitmaps (pass 5) */
>> ext2fs_readahead_bitmaps(handle, fs);
>
> This is a big one, because of the many seeks for small data read. Using
> the FLEX_BG feature (which is really a tiny kernel patch) could improve
> this many times.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
>
[-- Attachment #2: e2fsprogs-add-io_channel_readahead.patch --]
[-- Type: text/x-patch, Size: 5136 bytes --]
This patch adds a "readahead" method to the io_manager interface
Signed-off-by: Vladimir V. Saveliev vs@clusterfs.com
Index: e2fsprogs-1.40.2/lib/ext2fs/ext2_io.h
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/ext2_io.h
+++ e2fsprogs-1.40.2/lib/ext2fs/ext2_io.h
@@ -68,6 +68,8 @@ struct struct_io_manager {
errcode_t (*set_blksize)(io_channel channel, int blksize);
errcode_t (*read_blk)(io_channel channel, unsigned long block,
int count, void *data);
+ errcode_t (*readahead)(io_channel channel, unsigned long block,
+ int count);
errcode_t (*write_blk)(io_channel channel, unsigned long block,
int count, const void *data);
errcode_t (*flush)(io_channel channel);
@@ -89,6 +91,7 @@ struct struct_io_manager {
#define io_channel_close(c) ((c)->manager->close((c)))
#define io_channel_set_blksize(c,s) ((c)->manager->set_blksize((c),s))
#define io_channel_read_blk(c,b,n,d) ((c)->manager->read_blk((c),b,n,d))
+#define io_channel_readahead(c,b,n) ((c)->manager->readahead((c),b,n))
#define io_channel_write_blk(c,b,n,d) ((c)->manager->write_blk((c),b,n,d))
#define io_channel_flush(c) ((c)->manager->flush((c)))
#define io_channel_bumpcount(c) ((c)->refcount++)
@@ -99,6 +102,8 @@ extern errcode_t io_channel_set_options(
extern errcode_t io_channel_write_byte(io_channel channel,
unsigned long offset,
int count, const void *data);
+extern errcode_t readahead_noop(io_channel channel, unsigned long block,
+ int count);
/* unix_io.c */
extern io_manager unix_io_manager;
Index: e2fsprogs-1.40.2/lib/ext2fs/unix_io.c
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/unix_io.c
+++ e2fsprogs-1.40.2/lib/ext2fs/unix_io.c
@@ -15,6 +15,8 @@
* %End-Header%
*/
+#define _XOPEN_SOURCE 600
+#define _FILE_OFFSET_BITS 64
#define _LARGEFILE_SOURCE
#define _LARGEFILE64_SOURCE
@@ -78,6 +80,8 @@ static errcode_t unix_close(io_channel c
static errcode_t unix_set_blksize(io_channel channel, int blksize);
static errcode_t unix_read_blk(io_channel channel, unsigned long block,
int count, void *data);
+static errcode_t unix_readahead(io_channel channel, unsigned long block,
+ int count);
static errcode_t unix_write_blk(io_channel channel, unsigned long block,
int count, const void *data);
static errcode_t unix_flush(io_channel channel);
@@ -106,6 +110,7 @@ static struct struct_io_manager struct_u
unix_close,
unix_set_blksize,
unix_read_blk,
+ unix_readahead,
unix_write_blk,
unix_flush,
#ifdef NEED_BOUNCE_BUFFER
@@ -611,6 +616,18 @@ static errcode_t unix_read_blk(io_channe
#endif /* NO_IO_CACHE */
}
+static errcode_t unix_readahead(io_channel channel, unsigned long block,
+ int count)
+{
+ struct unix_private_data *data;
+
+ data = (struct unix_private_data *)channel->private_data;
+ posix_fadvise(data->dev, (ext2_loff_t)block * channel->block_size,
+ (ext2_loff_t)count * channel->block_size,
+ POSIX_FADV_WILLNEED);
+ return 0;
+}
+
static errcode_t unix_write_blk(io_channel channel, unsigned long block,
int count, const void *buf)
{
Index: e2fsprogs-1.40.2/lib/ext2fs/inode_io.c
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/inode_io.c
+++ e2fsprogs-1.40.2/lib/ext2fs/inode_io.c
@@ -64,6 +64,7 @@ static struct struct_io_manager struct_i
inode_close,
inode_set_blksize,
inode_read_blk,
+ readahead_noop,
inode_write_blk,
inode_flush,
inode_write_byte
Index: e2fsprogs-1.40.2/lib/ext2fs/dosio.c
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/dosio.c
+++ e2fsprogs-1.40.2/lib/ext2fs/dosio.c
@@ -64,6 +64,7 @@ static struct struct_io_manager struct_d
dos_close,
dos_set_blksize,
dos_read_blk,
+ readahead_noop,
dos_write_blk,
dos_flush
};
Index: e2fsprogs-1.40.2/lib/ext2fs/nt_io.c
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/nt_io.c
+++ e2fsprogs-1.40.2/lib/ext2fs/nt_io.c
@@ -236,6 +236,7 @@ static struct struct_io_manager struct_n
nt_close,
nt_set_blksize,
nt_read_blk,
+ readahead_noop,
nt_write_blk,
nt_flush
};
Index: e2fsprogs-1.40.2/lib/ext2fs/test_io.c
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/test_io.c
+++ e2fsprogs-1.40.2/lib/ext2fs/test_io.c
@@ -74,6 +74,7 @@ static struct struct_io_manager struct_t
test_close,
test_set_blksize,
test_read_blk,
+ readahead_noop,
test_write_blk,
test_flush,
test_write_byte,
Index: e2fsprogs-1.40.2/lib/ext2fs/io_manager.c
===================================================================
--- e2fsprogs-1.40.2.orig/lib/ext2fs/io_manager.c
+++ e2fsprogs-1.40.2/lib/ext2fs/io_manager.c
@@ -67,3 +67,9 @@ errcode_t io_channel_write_byte(io_chann
return EXT2_ET_UNIMPLEMENTED;
}
+
+errcode_t readahead_noop(io_channel channel, unsigned long block,
+ int count)
+{
+ return 0;
+}
parent reply other threads:[~2007-10-11 16:36 UTC|newest]
Thread overview: expand[flat|nested] mbox.gz Atom feed
[parent not found: <20071011052736.GF8122@schatzie.adilger.int>]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=470E603A.2080203@clusterfs.com \
--to=vs@clusterfs.com \
--cc=adilger@clusterfs.com \
--cc=linux-ext4@vger.kernel.org \
--cc=ric@emc.com \
--cc=tytso@mit.edu \
--cc=val.henson@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.