public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@osdl.org>
To: linux@horizon.com
Cc: linux-kernel@vger.kernel.org, sct@redhat.com, linux@horizon.com
Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?
Date: Thu, 9 Feb 2006 00:18:50 -0800	[thread overview]
Message-ID: <20060209001850.18ca135f.akpm@osdl.org> (raw)
In-Reply-To: <20060209071832.10500.qmail@science.horizon.com>

linux@horizon.com wrote:
>
> Sorry to bring this up only after a 2-year hiatus, but I'm trying to
>  port an application from Solaris and Linux 2.4 to 2.6 and finding amazing
>  performance regression due to this.  (For referemce, as of 2.5.something,
>  msync(MS_ASYNC) just puts the pages on a dirty list but doesn't actually
>  start the I/O until some fairly coarse timer in the VM system fires.)
> 
>  It uses msync(MS_ASYNC) and msync(MS_SYNC) as a poor man's portable
>  async IO.  It's appending data to an on-disk log.  When a page is full
>  or a transaction complete, the page will not be modified any further and
>  it uses MS_ASYNC to start the I/O as early as possible.  (When compiled
>  with debugging, it also uses remaps the page read-only.)
> 
>  Then it accomplishes as much as it can without needing the transaction
>  committed (typically 25-100 ms), and when it's blocked until the
>  transaction is known to be durable, it calls msync(MS_SYNC).  Which
>  should, if everything went right, return immediately, because the page
>  is clean.

2.4:

	MS_ASYNC: dirty the pagecache pages, start I/O
	MS_SYNC: dirty the pagecache pages, start I/O, wait on I/O

2.6:

	MS_ASYNC: dirty the pagecache pages
	MS_SYNC: dirty the pagecache pages, start I/O, wait on I/O.

So you're saying that doing the I/O in that 25-100msec window allowed your
app to do more pipelining.

I think for most scenarios, what we have in 2.6 is better: it gives the app
more control over when the I/O should be started.  But not for you, because
you have this handy 25-100ms window in which to do other stuff, which
eliminates the need to create a new thread to do the I/O.

Something like this?   (Needs a triple-check).



Add two new linux-specific fadvise extensions():

LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'.

LINUX_FADV_SYNC_WRITE: start and wait upon writeout of any dirty pages between
file offsets `offset' and `offset+len'.

The patch also regularises the filemap_write_and_wait_range() API.  Make it
look like the __filemap_fdatawrite_range() one: the `end' argument points at
the first byte beyond the range being written.


Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 fs/direct-io.c          |    2 +-
 include/linux/fadvise.h |    6 ++++++
 include/linux/fs.h      |    3 +++
 mm/fadvise.c            |   13 +++++++++++--
 mm/filemap.c            |   18 +++++++++++-------
 5 files changed, 32 insertions(+), 10 deletions(-)

diff -puN mm/fadvise.c~fadvise-async-write-commands mm/fadvise.c
--- devel/mm/fadvise.c~fadvise-async-write-commands	2006-02-08 23:55:42.000000000 -0800
+++ devel-akpm/mm/fadvise.c	2006-02-09 00:16:58.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/fadvise.h>
+#include <linux/writeback.h>
 #include <linux/syscalls.h>
 
 #include <asm/unistd.h>
@@ -96,11 +97,19 @@ asmlinkage long sys_fadvise64_64(int fd,
 			filemap_flush(mapping);
 
 		/* First and last FULL page! */
-		start_index = (offset + (PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
+		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);
 
 		if (end_index > start_index)
-			invalidate_mapping_pages(mapping, start_index, end_index-1);
+			invalidate_mapping_pages(mapping, start_index,
+						end_index - 1);
+		break;
+	case LINUX_FADV_ASYNC_WRITE:
+		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
+						WB_SYNC_NONE);
+		break;
+	case LINUX_FADV_SYNC_WRITE:
+		ret = filemap_write_and_wait_range(mapping, offset, endbyte);
 		break;
 	default:
 		ret = -EINVAL;
diff -puN include/linux/fadvise.h~fadvise-async-write-commands include/linux/fadvise.h
--- devel/include/linux/fadvise.h~fadvise-async-write-commands	2006-02-08 23:55:42.000000000 -0800
+++ devel-akpm/include/linux/fadvise.h	2006-02-08 23:56:55.000000000 -0800
@@ -18,4 +18,10 @@
 #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
 #endif
 
+/*
+ * Linux-specific fadvise() extensions:
+ */
+#define LINUX_FADV_ASYNC_WRITE	32	/* Start writeout on range */
+#define LINUX_FADV_SYNC_WRITE	33	/* Write out and wait upon range */
+
 #endif	/* FADVISE_H_INCLUDED */
diff -puN mm/filemap.c~fadvise-async-write-commands mm/filemap.c
--- devel/mm/filemap.c~fadvise-async-write-commands	2006-02-08 23:59:01.000000000 -0800
+++ devel-akpm/mm/filemap.c	2006-02-09 00:10:40.000000000 -0800
@@ -174,7 +174,8 @@ static int sync_page(void *word)
  * dirty pages that lie within the byte offsets <start, end>
  * @mapping:	address space structure to write
  * @start:	offset in bytes where the range starts
- * @end:	offset in bytes where the range ends
+ * @end:	offset in bytes where the range ends (+1: we write end-start
+ *		bytes)
  * @sync_mode:	enable synchronous operation
  *
  * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
@@ -182,8 +183,8 @@ static int sync_page(void *word)
  * these two operations is that if a dirty page/buffer is encountered, it must
  * be waited upon, and not just skipped over.
  */
-static int __filemap_fdatawrite_range(struct address_space *mapping,
-	loff_t start, loff_t end, int sync_mode)
+int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+				loff_t end, int sync_mode)
 {
 	int ret;
 	struct writeback_control wbc = {
@@ -212,8 +213,8 @@ int filemap_fdatawrite(struct address_sp
 }
 EXPORT_SYMBOL(filemap_fdatawrite);
 
-static int filemap_fdatawrite_range(struct address_space *mapping,
-	loff_t start, loff_t end)
+static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+				loff_t end)
 {
 	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
 }
@@ -367,19 +368,22 @@ int filemap_write_and_wait(struct addres
 }
 EXPORT_SYMBOL(filemap_write_and_wait);
 
+/*
+ * Write out and wait upon all the bytes between lstart and (lend-1)
+ */
 int filemap_write_and_wait_range(struct address_space *mapping,
 				 loff_t lstart, loff_t lend)
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping->nrpages && lend > lstart) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
 						lstart >> PAGE_CACHE_SHIFT,
-						lend >> PAGE_CACHE_SHIFT);
+						(lend - 1) >> PAGE_CACHE_SHIFT);
 			if (!err)
 				err = err2;
 		}
diff -puN include/linux/fs.h~fadvise-async-write-commands include/linux/fs.h
--- devel/include/linux/fs.h~fadvise-async-write-commands	2006-02-08 23:59:24.000000000 -0800
+++ devel-akpm/include/linux/fs.h	2006-02-09 00:03:22.000000000 -0800
@@ -1476,6 +1476,9 @@ extern int filemap_fdatawait(struct addr
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
+extern int __filemap_fdatawrite_range(struct address_space *mapping,
+				loff_t start, loff_t end, int sync_mode);
+
 extern void sync_supers(void);
 extern void sync_filesystems(int wait);
 extern void emergency_sync(void);
diff -puN fs/direct-io.c~fadvise-async-write-commands fs/direct-io.c
--- devel/fs/direct-io.c~fadvise-async-write-commands	2006-02-09 00:09:54.000000000 -0800
+++ devel-akpm/fs/direct-io.c	2006-02-09 00:10:06.000000000 -0800
@@ -1240,7 +1240,7 @@ __blockdev_direct_IO(int rw, struct kioc
 			}
 
 			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
+							      end);
 			if (retval) {
 				kfree(dio);
 				goto out;
_


  reply	other threads:[~2006-02-09  8:24 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-09  7:18 msync() behaviour broken for MS_ASYNC, revert patch? linux
2006-02-09  8:18 ` Andrew Morton [this message]
2006-02-09  8:35   ` Nick Piggin
2006-02-09  8:42     ` Andrew Morton
2006-02-09 12:38       ` Nick Piggin
2006-02-09 12:39       ` Nick Piggin
2006-02-09 17:48         ` Andrew Morton
2006-02-10  3:36           ` Nick Piggin
2006-02-10  3:50             ` Andrew Morton
2006-02-10  3:57               ` Nick Piggin
2006-02-10  4:13                 ` Andrew Morton
2006-02-10  4:30                   ` Nick Piggin
2006-02-10  4:43                     ` Andrew Morton
2006-02-10  4:52                       ` Nick Piggin
2006-02-10  5:13                         ` Andrew Morton
2006-02-10  5:29                           ` Nick Piggin
2006-02-10  5:50                             ` Andrew Morton
2006-02-10  6:03                               ` Nick Piggin
2006-02-10  6:13                                 ` Andrew Morton
2006-02-10  6:31                                   ` Nick Piggin
2006-02-10  6:46                                     ` Andrew Morton
2006-02-10  6:57                                       ` Nick Piggin
2006-02-10  7:14                                         ` Andrew Morton
2006-02-10 12:41                                           ` Nick Piggin
2006-02-10 16:19                                             ` Linus Torvalds
2006-02-10 17:00                                               ` Nick Piggin
2006-02-10 17:12                                                 ` Linus Torvalds
2006-02-10 17:35                                                   ` Linus Torvalds
2006-02-10 17:59                                                   ` Nick Piggin
2006-02-10 18:55                                                     ` Linus Torvalds
2006-02-10 19:29                                                       ` Nick Piggin
2006-02-10 19:44                                                         ` Linus Torvalds
2006-02-10 19:52                                                           ` Nick Piggin
2006-02-10 20:03                                                             ` Linus Torvalds
2006-02-11  5:49                                                               ` Nick Piggin
2006-02-10 16:05                                         ` Linus Torvalds
2006-02-10 16:37                                           ` Nick Piggin
2006-02-10 17:03                                             ` Linus Torvalds
2006-02-10 17:37                                               ` Nick Piggin
2006-02-10 18:01                                                 ` Linus Torvalds
2006-02-10 18:38                                                   ` Nick Piggin
2006-02-10 19:05                                                     ` Linus Torvalds
2006-02-10 19:34                                                       ` Oliver Neukum
2006-02-10 19:59                                                         ` Linus Torvalds
2006-02-10 20:11                                                           ` Andrew Morton
2006-02-10 21:15                                                             ` Linus Torvalds
2006-02-10 21:28                                                               ` Andrew Morton
2006-02-10 20:03                                                       ` Nick Piggin
2006-02-10 21:10                                                         ` Linus Torvalds
2006-02-10 21:55                                                           ` Trond Myklebust
2006-02-10 22:46                                                             ` Linus Torvalds
2006-02-10 23:02                                                               ` Trond Myklebust
2006-02-10 23:15                                                                 ` Linus Torvalds
2006-02-11 19:07                                                                   ` Trond Myklebust
2006-02-10 17:29                                           ` linux
2006-02-10 17:42                                             ` Linus Torvalds
2006-02-10 18:57                                               ` Nick Piggin
2006-02-10  8:00                                       ` linux
2006-02-10 13:18                                         ` Nick Piggin
2006-02-10  7:15                   ` linux
2006-02-10  7:28                     ` Andrew Morton
2006-02-09 11:18   ` linux
  -- strict thread matches above, loose matches on Subject: below --
2004-03-31 22:16 Stephen C. Tweedie
2004-03-31 22:37 ` Linus Torvalds
2004-03-31 23:41   ` Stephen C. Tweedie
2004-04-01  0:08     ` Linus Torvalds
2004-04-01  0:30       ` Andrew Morton
2004-04-01 15:40       ` Stephen C. Tweedie
2004-04-01 16:02         ` Linus Torvalds
2004-04-01 16:33           ` Stephen C. Tweedie
2004-04-01 16:19         ` Jamie Lokier
2004-04-01 16:57           ` Stephen C. Tweedie
2004-04-01 18:51         ` Andrew Morton
2004-03-31 22:53 ` Andrew Morton
2004-03-31 23:20   ` Stephen C. Tweedie
2004-04-16 22:35 ` Jamie Lokier
2004-04-19 21:54   ` Stephen C. Tweedie
2004-04-21  2:10     ` Jamie Lokier
2004-04-21  9:52       ` Stephen C. Tweedie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060209001850.18ca135f.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@horizon.com \
    --cc=sct@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox