public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: "Jörn Engel" <joern@logfs.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Chris Wedgwood <cw@f00f.org>
Subject: Re: Proposal for "proper" durable fsync() and fdatasync()
Date: Tue, 26 Feb 2008 15:28:10 +0000	[thread overview]
Message-ID: <20080226152810.GB18118@shareable.org> (raw)
In-Reply-To: <20080226140925.GB20428@lazybastard.org>

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.

Sometimes there is a point in "a little" safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application -> main memory
cache -> disk cache -> disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say "fine, the data is safe enough for me" depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says "done", they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says "I've written it", it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

Doesn't the ->fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie

  parent reply	other threads:[~2008-02-26 15:28 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
2008-02-26  7:43 ` Andrew Morton
2008-02-26  7:59   ` Jamie Lokier
2008-02-26  9:16     ` Nick Piggin
2008-02-26 14:09       ` Jörn Engel
2008-02-26 15:07         ` Jamie Lokier
2008-02-26 16:27           ` Andrew Morton
2008-02-26 15:28         ` Jamie Lokier [this message]
2008-02-26 17:02           ` Jörn Engel
2008-02-26 17:29             ` Jamie Lokier
2008-02-26 17:38               ` Jörn Engel
2008-02-26 16:43       ` Jeff Garzik
2008-02-26 17:00         ` Jamie Lokier
2008-02-26 17:54           ` Jeff Garzik
2008-02-27 14:16             ` Jamie Lokier
2008-02-26  7:43 ` Jeff Garzik
2008-02-26  7:55   ` Jamie Lokier
2008-02-26  9:25   ` Jamie Lokier
2008-02-26 12:13   ` Ric Wheeler
2008-02-26 15:43     ` Jamie Lokier
2008-11-24 21:10       ` Sachin Gaikwad
2008-11-25 10:17         ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080226152810.GB18118@shareable.org \
    --to=jamie@shareable.org \
    --cc=akpm@linux-foundation.org \
    --cc=cw@f00f.org \
    --cc=joern@logfs.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox