From: "Sachin Gaikwad" <sachin.kernel@gmail.com>
To: "Jamie Lokier" <jamie@shareable.org>
Cc: "Ric Wheeler" <ric@emc.com>, "Jeff Garzik" <jeff@garzik.org>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
"Chris Wedgwood" <cw@f00f.org>
Subject: Re: Proposal for "proper" durable fsync() and fdatasync()
Date: Mon, 24 Nov 2008 16:10:48 -0500 [thread overview]
Message-ID: <f5524d840811241310m52fe0d30u2bcfaf3981f7368f@mail.gmail.com> (raw)
In-Reply-To: <20080226154315.GC18118@shareable.org>
Hi Jamie,
On Tue, Feb 26, 2008 at 10:43 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Ric Wheeler wrote:
>> >>I was surprised that fsync() doesn't do this already. There was a lot
>> >>of effort put into block I/O write barriers during 2.5, so that
>> >>journalling filesystems can force correct write ordering, using disk
>> >>flush cache commands.
>> >>
>> >>After all that effort, I was very surprised to notice that Linux 2.6.x
>> >>doesn't use that capability to ensure fsync() flushes the disk cache
>> >>onto stable storage.
>> >
>> >It's surprising you are surprised, given that this [lame] fsync behavior
>> >has remaining consistently lame throughout Linux's history.
>>
>> Maybe I am confused, but isn't this is what fsync() does today whenever
>> barriers are enabled (the fsync() invalidates the drive's write cache).
>
> No, fsync() doesn't always flush the drive's write cache. It often
> does, any I think many people are under the impression it always does,
> but it doesn't.
>
> Try this code on ext3:
>
> fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
> while (1) {
> char byte;
> usleep (100000);
> pwrite (fd, &byte, 1, 0);
> fsync (fd);
> }
>
> It will do just over 10 write ops per second on an idle system (13 on
> mine), and 1 flush op per second.
How did you measure write-ops and flush-ops ? Is there any tool which
can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
provides, but no luck.
Sachin
>
> That's because ext3 fsync() only does a journal commit when the inode
> has changed. The inode mtime is changed by write only with 1 second
> granularity. Without a journal commit, there's no barrier, which
> translates to not flushing disk write cache.
>
> If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
> and fsync, you'll see at least 20 write ops and 20 flush ops per
> second, and you'll here the disk seeking more. That's because the
> fchmod dirties the inode, so fsync() writes the inode with a journal
> commit.
>
> It turns out even _that_ is not sufficient according to the kernel
> internals. A journal commit uses an ordered request, which isn't the
> same as a flush potentially, it just happens to use flush in this
> instance. I'm not sure if ordered requests are actually implemented
> by any drivers at the moment. If not now, they will be one day.
>
> We could change ext3 fsync() to always do a journal commit, and depend
> on the non-existence of block drivers which do ordered (not flush)
> barrier requests. But there's lots of things wrong with that. Not
> least, it sucks performance for database-like applications and virtual
> machines, a lot due to unnecessary seeks. That way lies wrongness.
>
> Rightness is to make fdatasync() work well, with a genuine flush (or
> equivalent (see FUA), only when required, and not a mere ordered
> barrier), no inode write, and to make sync_file_range()[*] offer the
> fancier applications finer controls which reflect what they actually
> need.
>
> [*] - or whatever.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2008-11-24 21:10 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-26 7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
2008-02-26 7:43 ` Andrew Morton
2008-02-26 7:59 ` Jamie Lokier
2008-02-26 9:16 ` Nick Piggin
2008-02-26 14:09 ` Jörn Engel
2008-02-26 15:07 ` Jamie Lokier
2008-02-26 16:27 ` Andrew Morton
2008-02-26 15:28 ` Jamie Lokier
2008-02-26 17:02 ` Jörn Engel
2008-02-26 17:29 ` Jamie Lokier
2008-02-26 17:38 ` Jörn Engel
2008-02-26 16:43 ` Jeff Garzik
2008-02-26 17:00 ` Jamie Lokier
2008-02-26 17:54 ` Jeff Garzik
2008-02-27 14:16 ` Jamie Lokier
2008-02-26 7:43 ` Jeff Garzik
2008-02-26 7:55 ` Jamie Lokier
2008-02-26 9:25 ` Jamie Lokier
2008-02-26 12:13 ` Ric Wheeler
2008-02-26 15:43 ` Jamie Lokier
2008-11-24 21:10 ` Sachin Gaikwad [this message]
2008-11-25 10:17 ` Jamie Lokier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f5524d840811241310m52fe0d30u2bcfaf3981f7368f@mail.gmail.com \
--to=sachin.kernel@gmail.com \
--cc=cw@f00f.org \
--cc=jamie@shareable.org \
--cc=jeff@garzik.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ric@emc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).