From: Andrew Morton <akpm@osdl.org>
To: linux@horizon.com
Cc: linux-kernel@vger.kernel.org, sct@redhat.com, linux@horizon.com
Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?
Date: Thu, 9 Feb 2006 00:18:50 -0800 [thread overview]
Message-ID: <20060209001850.18ca135f.akpm@osdl.org> (raw)
In-Reply-To: <20060209071832.10500.qmail@science.horizon.com>
linux@horizon.com wrote:
>
> Sorry to bring this up only after a 2-year hiatus, but I'm trying to
> port an application from Solaris and Linux 2.4 to 2.6 and finding amazing
> performance regression due to this. (For referemce, as of 2.5.something,
> msync(MS_ASYNC) just puts the pages on a dirty list but doesn't actually
> start the I/O until some fairly coarse timer in the VM system fires.)
>
> It uses msync(MS_ASYNC) and msync(MS_SYNC) as a poor man's portable
> async IO. It's appending data to an on-disk log. When a page is full
> or a transaction complete, the page will not be modified any further and
> it uses MS_ASYNC to start the I/O as early as possible. (When compiled
> with debugging, it also uses remaps the page read-only.)
>
> Then it accomplishes as much as it can without needing the transaction
> committed (typically 25-100 ms), and when it's blocked until the
> transaction is known to be durable, it calls msync(MS_SYNC). Which
> should, if everything went right, return immediately, because the page
> is clean.
2.4:
MS_ASYNC: dirty the pagecache pages, start I/O
MS_SYNC: dirty the pagecache pages, start I/O, wait on I/O
2.6:
MS_ASYNC: dirty the pagecache pages
MS_SYNC: dirty the pagecache pages, start I/O, wait on I/O.
So you're saying that doing the I/O in that 25-100msec window allowed your
app to do more pipelining.
I think for most scenarios, what we have in 2.6 is better: it gives the app
more control over when the I/O should be started. But not for you, because
you have this handy 25-100ms window in which to do other stuff, which
eliminates the need to create a new thread to do the I/O.
Something like this? (Needs a triple-check).
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'.
LINUX_FADV_SYNC_WRITE: start and wait upon writeout of any dirty pages between
file offsets `offset' and `offset+len'.
The patch also regularises the filemap_write_and_wait_range() API. Make it
look like the __filemap_fdatawrite_range() one: the `end' argument points at
the first byte beyond the range being written.
Signed-off-by: Andrew Morton <akpm@osdl.org>
---
fs/direct-io.c | 2 +-
include/linux/fadvise.h | 6 ++++++
include/linux/fs.h | 3 +++
mm/fadvise.c | 13 +++++++++++--
mm/filemap.c | 18 +++++++++++-------
5 files changed, 32 insertions(+), 10 deletions(-)
diff -puN mm/fadvise.c~fadvise-async-write-commands mm/fadvise.c
--- devel/mm/fadvise.c~fadvise-async-write-commands 2006-02-08 23:55:42.000000000 -0800
+++ devel-akpm/mm/fadvise.c 2006-02-09 00:16:58.000000000 -0800
@@ -15,6 +15,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/fadvise.h>
+#include <linux/writeback.h>
#include <linux/syscalls.h>
#include <asm/unistd.h>
@@ -96,11 +97,19 @@ asmlinkage long sys_fadvise64_64(int fd,
filemap_flush(mapping);
/* First and last FULL page! */
- start_index = (offset + (PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
+ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
end_index = (endbyte >> PAGE_CACHE_SHIFT);
if (end_index > start_index)
- invalidate_mapping_pages(mapping, start_index, end_index-1);
+ invalidate_mapping_pages(mapping, start_index,
+ end_index - 1);
+ break;
+ case LINUX_FADV_ASYNC_WRITE:
+ ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
+ WB_SYNC_NONE);
+ break;
+ case LINUX_FADV_SYNC_WRITE:
+ ret = filemap_write_and_wait_range(mapping, offset, endbyte);
break;
default:
ret = -EINVAL;
diff -puN include/linux/fadvise.h~fadvise-async-write-commands include/linux/fadvise.h
--- devel/include/linux/fadvise.h~fadvise-async-write-commands 2006-02-08 23:55:42.000000000 -0800
+++ devel-akpm/include/linux/fadvise.h 2006-02-08 23:56:55.000000000 -0800
@@ -18,4 +18,10 @@
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif
+/*
+ * Linux-specific fadvise() extensions:
+ */
+#define LINUX_FADV_ASYNC_WRITE 32 /* Start writeout on range */
+#define LINUX_FADV_SYNC_WRITE 33 /* Write out and wait upon range */
+
#endif /* FADVISE_H_INCLUDED */
diff -puN mm/filemap.c~fadvise-async-write-commands mm/filemap.c
--- devel/mm/filemap.c~fadvise-async-write-commands 2006-02-08 23:59:01.000000000 -0800
+++ devel-akpm/mm/filemap.c 2006-02-09 00:10:40.000000000 -0800
@@ -174,7 +174,8 @@ static int sync_page(void *word)
* dirty pages that lie within the byte offsets <start, end>
* @mapping: address space structure to write
* @start: offset in bytes where the range starts
- * @end: offset in bytes where the range ends
+ * @end: offset in bytes where the range ends (+1: we write end-start
+ * bytes)
* @sync_mode: enable synchronous operation
*
* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
@@ -182,8 +183,8 @@ static int sync_page(void *word)
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
*/
-static int __filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end, int sync_mode)
+int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+ loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
@@ -212,8 +213,8 @@ int filemap_fdatawrite(struct address_sp
}
EXPORT_SYMBOL(filemap_fdatawrite);
-static int filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end)
+static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+ loff_t end)
{
return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
}
@@ -367,19 +368,22 @@ int filemap_write_and_wait(struct addres
}
EXPORT_SYMBOL(filemap_write_and_wait);
+/*
+ * Write out and wait upon all the bytes between lstart and (lend-1)
+ */
int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
int err = 0;
- if (mapping->nrpages) {
+ if (mapping->nrpages && lend > lstart) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = wait_on_page_writeback_range(mapping,
lstart >> PAGE_CACHE_SHIFT,
- lend >> PAGE_CACHE_SHIFT);
+ (lend - 1) >> PAGE_CACHE_SHIFT);
if (!err)
err = err2;
}
diff -puN include/linux/fs.h~fadvise-async-write-commands include/linux/fs.h
--- devel/include/linux/fs.h~fadvise-async-write-commands 2006-02-08 23:59:24.000000000 -0800
+++ devel-akpm/include/linux/fs.h 2006-02-09 00:03:22.000000000 -0800
@@ -1476,6 +1476,9 @@ extern int filemap_fdatawait(struct addr
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
+extern int __filemap_fdatawrite_range(struct address_space *mapping,
+ loff_t start, loff_t end, int sync_mode);
+
extern void sync_supers(void);
extern void sync_filesystems(int wait);
extern void emergency_sync(void);
diff -puN fs/direct-io.c~fadvise-async-write-commands fs/direct-io.c
--- devel/fs/direct-io.c~fadvise-async-write-commands 2006-02-09 00:09:54.000000000 -0800
+++ devel-akpm/fs/direct-io.c 2006-02-09 00:10:06.000000000 -0800
@@ -1240,7 +1240,7 @@ __blockdev_direct_IO(int rw, struct kioc
}
retval = filemap_write_and_wait_range(mapping, offset,
- end - 1);
+ end);
if (retval) {
kfree(dio);
goto out;
_
next prev parent reply other threads:[~2006-02-09 8:24 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-09 7:18 msync() behaviour broken for MS_ASYNC, revert patch? linux
2006-02-09 8:18 ` Andrew Morton [this message]
2006-02-09 8:35 ` Nick Piggin
2006-02-09 8:42 ` Andrew Morton
2006-02-09 12:38 ` Nick Piggin
2006-02-09 12:39 ` Nick Piggin
2006-02-09 17:48 ` Andrew Morton
2006-02-10 3:36 ` Nick Piggin
2006-02-10 3:50 ` Andrew Morton
2006-02-10 3:57 ` Nick Piggin
2006-02-10 4:13 ` Andrew Morton
2006-02-10 4:30 ` Nick Piggin
2006-02-10 4:43 ` Andrew Morton
2006-02-10 4:52 ` Nick Piggin
2006-02-10 5:13 ` Andrew Morton
2006-02-10 5:29 ` Nick Piggin
2006-02-10 5:50 ` Andrew Morton
2006-02-10 6:03 ` Nick Piggin
2006-02-10 6:13 ` Andrew Morton
2006-02-10 6:31 ` Nick Piggin
2006-02-10 6:46 ` Andrew Morton
2006-02-10 6:57 ` Nick Piggin
2006-02-10 7:14 ` Andrew Morton
2006-02-10 12:41 ` Nick Piggin
2006-02-10 16:19 ` Linus Torvalds
2006-02-10 17:00 ` Nick Piggin
2006-02-10 17:12 ` Linus Torvalds
2006-02-10 17:35 ` Linus Torvalds
2006-02-10 17:59 ` Nick Piggin
2006-02-10 18:55 ` Linus Torvalds
2006-02-10 19:29 ` Nick Piggin
2006-02-10 19:44 ` Linus Torvalds
2006-02-10 19:52 ` Nick Piggin
2006-02-10 20:03 ` Linus Torvalds
2006-02-11 5:49 ` Nick Piggin
2006-02-10 16:05 ` Linus Torvalds
2006-02-10 16:37 ` Nick Piggin
2006-02-10 17:03 ` Linus Torvalds
2006-02-10 17:37 ` Nick Piggin
2006-02-10 18:01 ` Linus Torvalds
2006-02-10 18:38 ` Nick Piggin
2006-02-10 19:05 ` Linus Torvalds
2006-02-10 19:34 ` Oliver Neukum
2006-02-10 19:59 ` Linus Torvalds
2006-02-10 20:11 ` Andrew Morton
2006-02-10 21:15 ` Linus Torvalds
2006-02-10 21:28 ` Andrew Morton
2006-02-10 20:03 ` Nick Piggin
2006-02-10 21:10 ` Linus Torvalds
2006-02-10 21:55 ` Trond Myklebust
2006-02-10 22:46 ` Linus Torvalds
2006-02-10 23:02 ` Trond Myklebust
2006-02-10 23:15 ` Linus Torvalds
2006-02-11 19:07 ` Trond Myklebust
2006-02-10 17:29 ` linux
2006-02-10 17:42 ` Linus Torvalds
2006-02-10 18:57 ` Nick Piggin
2006-02-10 8:00 ` linux
2006-02-10 13:18 ` Nick Piggin
2006-02-10 7:15 ` linux
2006-02-10 7:28 ` Andrew Morton
2006-02-09 11:18 ` linux
-- strict thread matches above, loose matches on Subject: below --
2004-03-31 22:16 Stephen C. Tweedie
2004-03-31 22:37 ` Linus Torvalds
2004-03-31 23:41 ` Stephen C. Tweedie
2004-04-01 0:08 ` Linus Torvalds
2004-04-01 0:30 ` Andrew Morton
2004-04-01 15:40 ` Stephen C. Tweedie
2004-04-01 16:02 ` Linus Torvalds
2004-04-01 16:33 ` Stephen C. Tweedie
2004-04-01 16:19 ` Jamie Lokier
2004-04-01 16:57 ` Stephen C. Tweedie
2004-04-01 18:51 ` Andrew Morton
2004-03-31 22:53 ` Andrew Morton
2004-03-31 23:20 ` Stephen C. Tweedie
2004-04-16 22:35 ` Jamie Lokier
2004-04-19 21:54 ` Stephen C. Tweedie
2004-04-21 2:10 ` Jamie Lokier
2004-04-21 9:52 ` Stephen C. Tweedie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060209001850.18ca135f.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@horizon.com \
--cc=sct@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox