linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Jamie Lokier <jamie@shareable.org>
Cc: "Jörn Engel" <joern@logfs.org>,
	"Nick Piggin" <nickpiggin@yahoo.com.au>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	"Chris Wedgwood" <cw@f00f.org>
Subject: Re: Proposal for "proper" durable fsync() and fdatasync()
Date: Tue, 26 Feb 2008 08:27:25 -0800	[thread overview]
Message-ID: <20080226082725.b65365f0.akpm@linux-foundation.org> (raw)
In-Reply-To: <20080226150745.GA18118@shareable.org>

On Tue, 26 Feb 2008 15:07:45 +0000 Jamie Lokier <jamie@shareable.org> wrote:

> SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
> pages which aren't already queued for write-out.  It marks those with
> a "write-out" flag, and starts write I/Os at some unspecified time in
> the near future; it can be assumed writes for all the pages will
> complete eventually if there's no errors.  When I/O completes on a
> page, it cleans the page and also clears the write-out flag.
> 
> SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
> have the "write-out" flag set.
> 
> SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
> pages for write-out.  I don't actually see the point in this.  Isn't a
> preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
> BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.

  reply	other threads:[~2008-02-26 16:28 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
2008-02-26  7:43 ` Andrew Morton
2008-02-26  7:59   ` Jamie Lokier
2008-02-26  9:16     ` Nick Piggin
2008-02-26 14:09       ` Jörn Engel
2008-02-26 15:07         ` Jamie Lokier
2008-02-26 16:27           ` Andrew Morton [this message]
2008-02-26 15:28         ` Jamie Lokier
2008-02-26 17:02           ` Jörn Engel
2008-02-26 17:29             ` Jamie Lokier
2008-02-26 17:38               ` Jörn Engel
2008-02-26 16:43       ` Jeff Garzik
2008-02-26 17:00         ` Jamie Lokier
2008-02-26 17:54           ` Jeff Garzik
2008-02-27 14:16             ` Jamie Lokier
2008-02-26  7:43 ` Jeff Garzik
2008-02-26  7:55   ` Jamie Lokier
2008-02-26  9:25   ` Jamie Lokier
2008-02-26 12:13   ` Ric Wheeler
2008-02-26 15:43     ` Jamie Lokier
2008-11-24 21:10       ` Sachin Gaikwad
2008-11-25 10:17         ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080226082725.b65365f0.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=cw@f00f.org \
    --cc=jamie@shareable.org \
    --cc=joern@logfs.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).