linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Sachin Gaikwad" <sachin.kernel@gmail.com>
To: "Jamie Lokier" <jamie@shareable.org>
Cc: "Ric Wheeler" <ric@emc.com>, "Jeff Garzik" <jeff@garzik.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	"Chris Wedgwood" <cw@f00f.org>
Subject: Re: Proposal for "proper" durable fsync() and fdatasync()
Date: Mon, 24 Nov 2008 16:10:48 -0500	[thread overview]
Message-ID: <f5524d840811241310m52fe0d30u2bcfaf3981f7368f@mail.gmail.com> (raw)
In-Reply-To: <20080226154315.GC18118@shareable.org>

Hi Jamie,

On Tue, Feb 26, 2008 at 10:43 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Ric Wheeler wrote:
>> >>I was surprised that fsync() doesn't do this already.  There was a lot
>> >>of effort put into block I/O write barriers during 2.5, so that
>> >>journalling filesystems can force correct write ordering, using disk
>> >>flush cache commands.
>> >>
>> >>After all that effort, I was very surprised to notice that Linux 2.6.x
>> >>doesn't use that capability to ensure fsync() flushes the disk cache
>> >>onto stable storage.
>> >
>> >It's surprising you are surprised, given that this [lame] fsync behavior
>> >has remaining consistently lame throughout Linux's history.
>>
>> Maybe I am confused, but isn't this is what fsync() does today whenever
>> barriers are enabled (the fsync() invalidates the drive's write cache).
>
> No, fsync() doesn't always flush the drive's write cache.  It often
> does, any I think many people are under the impression it always does,
> but it doesn't.
>
> Try this code on ext3:
>
>        fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
>        while (1) {
>                char byte;
>                usleep (100000);
>                pwrite (fd, &byte, 1, 0);
>                fsync (fd);
>        }
>
> It will do just over 10 write ops per second on an idle system (13 on
> mine), and 1 flush op per second.

How did you measure write-ops and flush-ops ? Is there any tool which
can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
provides, but no luck.

Sachin

>
> That's because ext3 fsync() only does a journal commit when the inode
> has changed.  The inode mtime is changed by write only with 1 second
> granularity.  Without a journal commit, there's no barrier, which
> translates to not flushing disk write cache.
>
> If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
> and fsync, you'll see at least 20 write ops and 20 flush ops per
> second, and you'll here the disk seeking more.  That's because the
> fchmod dirties the inode, so fsync() writes the inode with a journal
> commit.
>
> It turns out even _that_ is not sufficient according to the kernel
> internals.  A journal commit uses an ordered request, which isn't the
> same as a flush potentially, it just happens to use flush in this
> instance.  I'm not sure if ordered requests are actually implemented
> by any drivers at the moment.  If not now, they will be one day.
>
> We could change ext3 fsync() to always do a journal commit, and depend
> on the non-existence of block drivers which do ordered (not flush)
> barrier requests.  But there's lots of things wrong with that.  Not
> least, it sucks performance for database-like applications and virtual
> machines, a lot due to unnecessary seeks.  That way lies wrongness.
>
> Rightness is to make fdatasync() work well, with a genuine flush (or
> equivalent (see FUA), only when required, and not a mere ordered
> barrier), no inode write, and to make sync_file_range()[*] offer the
> fancier applications finer controls which reflect what they actually
> need.
>
> [*] - or whatever.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

  reply	other threads:[~2008-11-24 21:10 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
2008-02-26  7:43 ` Andrew Morton
2008-02-26  7:59   ` Jamie Lokier
2008-02-26  9:16     ` Nick Piggin
2008-02-26 14:09       ` Jörn Engel
2008-02-26 15:07         ` Jamie Lokier
2008-02-26 16:27           ` Andrew Morton
2008-02-26 15:28         ` Jamie Lokier
2008-02-26 17:02           ` Jörn Engel
2008-02-26 17:29             ` Jamie Lokier
2008-02-26 17:38               ` Jörn Engel
2008-02-26 16:43       ` Jeff Garzik
2008-02-26 17:00         ` Jamie Lokier
2008-02-26 17:54           ` Jeff Garzik
2008-02-27 14:16             ` Jamie Lokier
2008-02-26  7:43 ` Jeff Garzik
2008-02-26  7:55   ` Jamie Lokier
2008-02-26  9:25   ` Jamie Lokier
2008-02-26 12:13   ` Ric Wheeler
2008-02-26 15:43     ` Jamie Lokier
2008-11-24 21:10       ` Sachin Gaikwad [this message]
2008-11-25 10:17         ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5524d840811241310m52fe0d30u2bcfaf3981f7368f@mail.gmail.com \
    --to=sachin.kernel@gmail.com \
    --cc=cw@f00f.org \
    --cc=jamie@shareable.org \
    --cc=jeff@garzik.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ric@emc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).