public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* msync() behaviour broken for MS_ASYNC, revert patch?
@ 2004-03-31 22:16 Stephen C. Tweedie
  2004-03-31 22:37 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Stephen C. Tweedie @ 2004-03-31 22:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton; +Cc: linux-kernel, Stephen Tweedie, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 3170 bytes --]

Hi,

I've been looking at a discrepancy between msync() behaviour on 2.4.9
and newer 2.4 kernels, and it looks like things changed again in
2.5.68.  From the ChangeLog:

ChangeSet 1.971.76.156 2003/04/09 11:31:36 akpm@digeo.com
  [PATCH] Make msync(MS_ASYNC) no longer start the I/O
  
  MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O
  and not wait on it.  This can cause undesirable blocking if msync is called
  rapidly against the same memory.
  
  So instead, change msync(MS_ASYNC) to not start any IO at all.  Just flush
  the pte dirty bits into the pageframe and leave it at that.
  
  The IO _will_ happen within a kupdate period.  And the application can use
  fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO
  immediately.

Unfortunately, this seems to contradict SingleUnix requirements, which
state:

        When MS_ASYNC is specified, msync() shall return immediately
        once all the write operations are initiated or queued for
        servicing
        
although I can't find an unambiguous definition of "queued for service"
in the online standard.  I'm reading it as requiring that the I/O has
reached the block device layer, not simply that it has been marked dirty
for some future writeback pass to catch; Uli agrees with that
interpretation.

The comment that was added with this change in 2.5.68 states:

 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
 * marks the relevant pages dirty.  The application may now run fsync() to
 * write out the dirty pages and wait on the writeout and check the result.
 * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
 * async writeout immediately.
 * So my _not_ starting I/O in MS_ASYNC we provide complete flexibility to
 * applications.

but that's actually misleading --- in every kernel I've looked at as far
back as 2.4.9, the "new" behaviour of simply queueing the pages for
kupdated writeback is already available by specifying flags==0 for
msync(), so there's no additional flexibility added by making MS_ASYNC
do the same thing.  And FADV_DONTNEED doesn't make any guarantees that
anything gets queued for writeback: it only does a filemap writeback if
there's no congestion on the backing device.

So we're actually _more_ flexible, as well as more compliant, under the
old behaviour --- flags==0 gives the "defer to kupdated" behaviour,
MS_ASYNC guarantees that IOs have been queued, and MS_SYNC waits for
synchronised completion.

All that's really missing is documentation to define how Linux deals
with flags==0 as an extension to SingleUnix (which requires either
MS_SYNC, MS_ASYNC or MS_INVALIDATE to be set, and which does not define
a behaviour for flags==0.)
  
The 2.5.68 changeset also includes the comment:

  (This has triggered an ext3 bug - the page's buffers get dirtied so fast
  that kjournald keeps writing the buffers over and over for 10-20 seconds
  before deciding to give up for some reason)

Was that ever resolved?  If it's still there, I should have a look at it
if we're restoring the old trigger.

Patch below reverts the behaviour in 2.6.

--Stephen


[-- Attachment #2: msync-async-writeout.patch --]
[-- Type: text/plain, Size: 1935 bytes --]

--- linux-2.6-tmp/mm/msync.c.=K0000=.orig
+++ linux-2.6-tmp/mm/msync.c
@@ -127,13 +127,10 @@ static int filemap_sync(struct vm_area_s
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
- * marks the relevant pages dirty.  The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
- * So my _not_ starting I/O in MS_ASYNC we provide complete flexibility to
- * applications.
+ * MS_ASYNC once again starts I/O (it did not between 2.5.68 and 2.6.4.)
+ * SingleUnix requires it.  If an application wants to queue dirty pages
+ * for normal asychronous writeback, msync with flags==0 should achieve
+ * that on all kernels at least as far back as 2.4.
  */
 static int msync_interval(struct vm_area_struct * vma,
 	unsigned long start, unsigned long end, int flags)
@@ -147,20 +144,22 @@ static int msync_interval(struct vm_area
 	if (file && (vma->vm_flags & VM_SHARED)) {
 		ret = filemap_sync(vma, start, end-start, flags);
 
-		if (!ret && (flags & MS_SYNC)) {
+		if (!ret && (flags & (MS_SYNC|MS_ASYNC))) {
 			struct address_space *mapping = file->f_mapping;
 			int err;
 
 			down(&mapping->host->i_sem);
 			ret = filemap_fdatawrite(mapping);
-			if (file->f_op && file->f_op->fsync) {
-				err = file->f_op->fsync(file,file->f_dentry,1);
-				if (err && !ret)
+			if (flags & MS_SYNC) {
+				if (file->f_op && file->f_op->fsync) {
+					err = file->f_op->fsync(file, file->f_dentry, 1);
+					if (err && !ret)
+						ret = err;
+				}
+				err = filemap_fdatawait(mapping);
+				if (!ret)
 					ret = err;
 			}
-			err = filemap_fdatawait(mapping);
-			if (!ret)
-				ret = err;
 			up(&mapping->host->i_sem);
 		}
 	}

^ permalink raw reply	[flat|nested] 80+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch?
@ 2006-02-09  7:18 linux
  2006-02-09  8:18 ` Andrew Morton
  0 siblings, 1 reply; 80+ messages in thread
From: linux @ 2006-02-09  7:18 UTC (permalink / raw)
  To: akpm, linux-kernel, sct; +Cc: linux

Sorry to bring this up only after a 2-year hiatus, but I'm trying to
port an application from Solaris and Linux 2.4 to 2.6 and finding amazing
performance regression due to this.  (For referemce, as of 2.5.something,
msync(MS_ASYNC) just puts the pages on a dirty list but doesn't actually
start the I/O until some fairly coarse timer in the VM system fires.)

It uses msync(MS_ASYNC) and msync(MS_SYNC) as a poor man's portable
async IO.  It's appending data to an on-disk log.  When a page is full
or a transaction complete, the page will not be modified any further and
it uses MS_ASYNC to start the I/O as early as possible.  (When compiled
with debugging, it also uses remaps the page read-only.)

Then it accomplishes as much as it can without needing the transaction
committed (typically 25-100 ms), and when it's blocked until the
transaction is known to be durable, it calls msync(MS_SYNC).  Which
should, if everything went right, return immediately, because the page
is clean.

Reading the spec, this seems like exactly what msync() is designed for.

But looking at the 2.6 code, I see that it does't actually start the
write promptly, and makes the kooky and unusable suggestions to either
use fdatasync(fd), which will block on all the *following* transactions,
or fadvise(FADV_DONTNEED), which is emphatically lying to the kernel.
The data *is* needed in future (by the database *readers*), so discarding
it from memory is a stupid idea.  The only thing we don't need to do to
the page any more is write to it.


Now, I know I could research when async-IO support became reasonably
stable and have the code do async I/O when on a recent enough kernel,
but I really wonder who the genius was who managed to misunderstand
"initiated or queued for servicing" enough to think it involves sleeping
with an idle disk.

Yes, it's just timing, but I hadn't noticed Linux developers being
ivory-tower academic types who only care about correctness or big-O
performance measures.  In fact, "no, we can't prove it's starvation-free,
but damn it's fast on benchmarks!" is more of the attitude I've come
to expect.


Anyway, for future reference, Linux's current non-implementation of
msync(MS_ASYNC) is an outright BUG.  It "computes the correct result",
but totally buggers performance.


(Deep breath)  Now that I've finished complaining, I need to ask for
help.  I may be inspired to fix the kernel, but first I have to fix my
application, which has to run on existing Linux kernels.

Can anyone advise me on the best way to perform this sort of split-
transaction disk write on extant 2.6 kernels?  Preferably without using
glibc's pthread-based aio emulation?  Will O_DIRECT and !O_SYNC writes
do what I want?  Or will that interact bady with mmap()?

For my application, all transactions are completed in-order, so there is
never any question of which order to wait for completion.  My current
guess is that I'm going to have to call io_submit directly; is there
any documentation with more detail than the man pages but less than the
source code?  The former is silent on how the semantics of the various
IOCB_CMD_* opcodes, while the latter doesn't distinguish clearly between
the promises the interface is intended to keep and the properties of
the current implementation.

Thanks for any suggestions.

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2006-02-11 19:07 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-31 22:16 msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie
2004-03-31 22:37 ` Linus Torvalds
2004-03-31 23:41   ` Stephen C. Tweedie
2004-04-01  0:08     ` Linus Torvalds
2004-04-01  0:30       ` Andrew Morton
2004-04-01 15:40       ` Stephen C. Tweedie
2004-04-01 16:02         ` Linus Torvalds
2004-04-01 16:33           ` Stephen C. Tweedie
2004-04-01 16:19         ` Jamie Lokier
2004-04-01 16:56           ` s390 storage key inconsistency? [was Re: msync() behaviour broken for MS_ASYNC, revert patch?] Stephen C. Tweedie
2004-04-01 16:57           ` msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie
2004-04-01 18:51         ` Andrew Morton
2004-03-31 22:53 ` Andrew Morton
2004-03-31 23:20   ` Stephen C. Tweedie
2004-04-16 22:35 ` Jamie Lokier
2004-04-19 21:54   ` Stephen C. Tweedie
2004-04-21  2:10     ` Jamie Lokier
2004-04-21  9:52       ` Stephen C. Tweedie
  -- strict thread matches above, loose matches on Subject: below --
2006-02-09  7:18 linux
2006-02-09  8:18 ` Andrew Morton
2006-02-09  8:35   ` Nick Piggin
2006-02-09  8:42     ` Andrew Morton
2006-02-09 12:38       ` Nick Piggin
2006-02-09 12:39       ` Nick Piggin
2006-02-09 17:48         ` Andrew Morton
2006-02-10  3:36           ` Nick Piggin
2006-02-10  3:50             ` Andrew Morton
2006-02-10  3:57               ` Nick Piggin
2006-02-10  4:13                 ` Andrew Morton
2006-02-10  4:30                   ` Nick Piggin
2006-02-10  4:43                     ` Andrew Morton
2006-02-10  4:52                       ` Nick Piggin
2006-02-10  5:13                         ` Andrew Morton
2006-02-10  5:29                           ` Nick Piggin
2006-02-10  5:50                             ` Andrew Morton
2006-02-10  6:03                               ` Nick Piggin
2006-02-10  6:13                                 ` Andrew Morton
2006-02-10  6:31                                   ` Nick Piggin
2006-02-10  6:46                                     ` Andrew Morton
2006-02-10  6:57                                       ` Nick Piggin
2006-02-10  7:14                                         ` Andrew Morton
2006-02-10 12:41                                           ` Nick Piggin
2006-02-10 16:19                                             ` Linus Torvalds
2006-02-10 17:00                                               ` Nick Piggin
2006-02-10 17:12                                                 ` Linus Torvalds
2006-02-10 17:35                                                   ` Linus Torvalds
2006-02-10 17:59                                                   ` Nick Piggin
2006-02-10 18:55                                                     ` Linus Torvalds
2006-02-10 19:29                                                       ` Nick Piggin
2006-02-10 19:44                                                         ` Linus Torvalds
2006-02-10 19:52                                                           ` Nick Piggin
2006-02-10 20:03                                                             ` Linus Torvalds
2006-02-11  5:49                                                               ` Nick Piggin
2006-02-10 16:05                                         ` Linus Torvalds
2006-02-10 16:37                                           ` Nick Piggin
2006-02-10 17:03                                             ` Linus Torvalds
2006-02-10 17:37                                               ` Nick Piggin
2006-02-10 18:01                                                 ` Linus Torvalds
2006-02-10 18:38                                                   ` Nick Piggin
2006-02-10 19:05                                                     ` Linus Torvalds
2006-02-10 19:34                                                       ` Oliver Neukum
2006-02-10 19:59                                                         ` Linus Torvalds
2006-02-10 20:11                                                           ` Andrew Morton
2006-02-10 21:15                                                             ` Linus Torvalds
2006-02-10 21:28                                                               ` Andrew Morton
2006-02-10 20:03                                                       ` Nick Piggin
2006-02-10 21:10                                                         ` Linus Torvalds
2006-02-10 21:55                                                           ` Trond Myklebust
2006-02-10 22:46                                                             ` Linus Torvalds
2006-02-10 23:02                                                               ` Trond Myklebust
2006-02-10 23:15                                                                 ` Linus Torvalds
2006-02-11 19:07                                                                   ` Trond Myklebust
2006-02-10 17:29                                           ` linux
2006-02-10 17:42                                             ` Linus Torvalds
2006-02-10 18:57                                               ` Nick Piggin
2006-02-10  8:00                                       ` linux
2006-02-10 13:18                                         ` Nick Piggin
2006-02-10  7:15                   ` linux
2006-02-10  7:28                     ` Andrew Morton
2006-02-09 11:18   ` linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox