Re: Proposal for "proper" durable fsync() and fdatasync()

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jamie Lokier <jamie@shareable.org>
To: Ric Wheeler <ric@emc.com>
Cc: Jeff Garzik <jeff@garzik.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Chris Wedgwood <cw@f00f.org>
Subject: Re: Proposal for "proper" durable fsync() and fdatasync()
Date: Tue, 26 Feb 2008 15:43:15 +0000	[thread overview]
Message-ID: <20080226154315.GC18118@shareable.org> (raw)
In-Reply-To: <47C40269.7060309@emc.com>

Ric Wheeler wrote:
> >>I was surprised that fsync() doesn't do this already.  There was a lot
> >>of effort put into block I/O write barriers during 2.5, so that
> >>journalling filesystems can force correct write ordering, using disk
> >>flush cache commands.
> >>
> >>After all that effort, I was very surprised to notice that Linux 2.6.x
> >>doesn't use that capability to ensure fsync() flushes the disk cache
> >>onto stable storage.
> >
> >It's surprising you are surprised, given that this [lame] fsync behavior 
> >has remaining consistently lame throughout Linux's history.
> 
> Maybe I am confused, but isn't this is what fsync() does today whenever 
> barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

	fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
	while (1) {
		char byte;
		usleep (100000);
		pwrite (fd, &byte, 1, 0);
		fsync (fd);
	}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie

next prev parent reply	other threads:[~2008-02-26 15:43 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
2008-02-26  7:43 ` Andrew Morton
2008-02-26  7:59   ` Jamie Lokier
2008-02-26  9:16     ` Nick Piggin
2008-02-26 14:09       ` Jörn Engel
2008-02-26 15:07         ` Jamie Lokier
2008-02-26 16:27           ` Andrew Morton
2008-02-26 15:28         ` Jamie Lokier
2008-02-26 17:02           ` Jörn Engel
2008-02-26 17:29             ` Jamie Lokier
2008-02-26 17:38               ` Jörn Engel
2008-02-26 16:43       ` Jeff Garzik
2008-02-26 17:00         ` Jamie Lokier
2008-02-26 17:54           ` Jeff Garzik
2008-02-27 14:16             ` Jamie Lokier
2008-02-26  7:43 ` Jeff Garzik
2008-02-26  7:55   ` Jamie Lokier
2008-02-26  9:25   ` Jamie Lokier
2008-02-26 12:13   ` Ric Wheeler
2008-02-26 15:43     ` Jamie Lokier [this message]
2008-11-24 21:10       ` Sachin Gaikwad
2008-11-25 10:17         ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080226154315.GC18118@shareable.org \
    --to=jamie@shareable.org \
    --cc=cw@f00f.org \
    --cc=jeff@garzik.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ric@emc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).