From: jw schultz <jw@pegasys.ws>
To: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts
Date: Mon, 15 Jul 2002 18:40:29 -0700 [thread overview]
Message-ID: <20020716014029.GC18703@pegasys.ws> (raw)
In-Reply-To: <s5gofd8sq4i.fsf@egghead.curl.com>
On Mon, Jul 15, 2002 at 04:17:01PM -0400, Patrick J. LoPresti wrote:
> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>
> > Documentation/fs/fsync.txt or similar sounds a good idea
>
> OK, attached is my first attempt at such a document.
>
> What do you think?
>
> - Pat
>
>
Nice and clear. I expect it also applies to unlink(2) and
rename(2).
A simplified version of this with a list of popular "broken"
MTAs and other spooling utilities might also go into the faq
with a strong emphasis on the chattr and mount options.
Content-Description: fsync.txt
> Linux fsync() semantics
> (or, "How to create a file reliably")
>
>
> Introduction
> ============
>
> Consider the following C program:
>
> #include <unistd.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <string.h>
>
> int
> main (int argc, char *argv[]) {
> int fd;
> char *s = "Hello, world!\n";
>
> fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
> if (fd < 0) return 1;
>
> if (write (fd, s, strlen(s)) < 0) return 3;
> if (fsync (fd) < 0) return 4;
> if (close (fd) < 0) return 5;
>
> return 0;
> }
>
> Question: If you compile and run this program, and it exits zero
> (success), and your machine then crashes, is it guaranteed that the
> file /tmp/foo will exist upon reboot?
>
> Answer: On many Unices, including *BSD, yes.
> On Linux, NO.
>
> How could this be? And what can you do about it?
>
>
> History
> =======
>
> In the beginning was BSD with its Fast File System (FFS). Under FFS,
> changes to directories were "synchronous", meaning they were committed
> to disk before the system call (open/link/rename/etc.) returned.
> Changes to files (write()) were asynchronous. The fsync() system call
> allowed an application to force a file's pending writes to be
> committed to persistent media.
>
> In general, disks have reasonble throughput but horrible latency, so
> it is much faster to write many things all at once rather than one at
> a time. In other words, synchronous operations are slow.
>
> Enter Linux. By default, Linux makes all operations, including
> directory updates, asynchronous. Early file system benchmarks showed
> Linux beating the pants off of BSD, especially when lots of directory
> operations were involved. This annoyed the BSD folks, who claimed
> that synchronous directory updates are required for reliable
> operation. (As with most points of contention between Linux and BSD,
> this is both true and false... See below.)
>
> The problem with making directory operations asynchronous is that you
> then need to provide a way for the application to commit those changes
> to disk. Otherwise, it is impossible to write reliable applications.
>
>
> BSD softupdates
> ===============
>
> Sometime during the 90s, the BSD developers introduced "soft updates"
> to improve performance. These do two things. First, they make all
> file system operations asynchronous (like Linux). Second, they extend
> the fsync() system call so that it commits to disk BOTH the file's
> data AND any directories via which the file might be accessed.
>
> In other words, BSD with soft updates requires that you call fsync()
> on a file to commit any changes to its containing directory. This is
> why the program above "works" on BSD.
>
> Many programs are written these days to expect soft update semantics,
> because such algorithms will also work correctly under traditional
> FFS.
>
> The problem with the softupdates approach is that finding all paths to
> a file is complex, and the Linux developers hate complexity. Linux
> does NOT support this behavior for fsync() and probably never will.
>
>
> Standards
> =========
>
> Quick aside: What do the relevant standards (POSIX, SuS) say? Is
> Linux violating some standard here?
>
> Well, different people, having read the standards, disagree on this
> point. This itself means the standards are not clear (which is a bad
> thing for a standard). This is probably because the standards were
> written when synchronous directory updates were the norm, and the
> authors did not even consider asynchronous directory updates.
>
>
> The Linux Solution
> ==================
>
> The Linux answer is simple: If you want to flush a modified directory
> to disk, call fsync() on the directory.
>
> In other words, to reliably create a file on Linux, you need to do
> something like this:
>
> #include <unistd.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <string.h>
>
> int
> main (int argc, char *argv[]) {
> int fd, dirfd;
> char *s = "Hello, world!\n";
>
> fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
> if (fd < 0) return 1;
>
> dirfd = open ("/tmp", O_RDONLY);
> if (dirfd < 0) return 2;
>
> if (write (fd, s, strlen(s)) < 0) return 3;
> if (fsync (fd) < 0) return 4;
> if (close (fd) < 0) return 5;
> if (fsync (dirfd) < 0) return 6;
> if (close (dirfd) < 0) return 7;
>
> return 0;
> }
>
> If this program exits zero, the file /tmp/foo is guaranteed to be on
> disk and to have the correct contents. This is true for ALL versions
> of the Linux kernel and ALL file systems.
>
>
> Other choices
> =============
>
> So you have written to the authors of your favorite MTA asking them to
> support Linux properly by using fsync() on directories. They have
> responded saying that "Linux is broken". (Be sure to ask them to
> justify this claim with chapter and verse from a standard. It is sure
> to be interesting.) What can you do?
>
> If the application does all its work in one directory, or a few
> directories, you can do "chattr +S" on the directory. This will cause
> all operations on that directory to be synchronous.
>
> You can use the "-o sync" mount option. This will cause ALL
> operations on that partition to be synchronous. This solves the
> problem, but is likely to be slow.
>
> In the current version of Linux, you can use the ext3 or ReiserFS file
> systems. These happen to commit their journals to disk whenever
> fsync() is called, which has the side-effect of providing semantics
> like BSD's soft updates. But note: This behavior is not guaranteed,
> and may change in future releases!
>
> But really, the best idea is to convince application authors to
> support the "Linux way" for committing directory updates. The
> semantics are simple, clear, and extremely efficient. So go bug those
> MTA authors until they listen :-).
>
>
> - Patrick LoPresti <patl@curl.com>
> July 2002
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@pegasys.ws
Remember Cernan and Schmitt
next prev parent reply other threads:[~2002-07-16 1:37 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-07-09 13:49 [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts Trond Myklebust
2002-07-09 14:06 ` Richard B. Johnson
2002-07-09 14:06 ` Richard B. Johnson
2002-07-09 14:08 ` Trond Myklebust
2002-07-09 15:06 ` Richard B. Johnson
2002-07-09 15:06 ` Richard B. Johnson
2002-07-09 16:56 ` Alan Cox
2002-07-09 16:56 ` Alan Cox
2002-07-09 17:22 ` Richard B. Johnson
2002-07-09 17:22 ` Richard B. Johnson
2002-07-09 18:58 ` [NFS] " Bill Rugolsky Jr.
2002-07-09 18:58 ` Bill Rugolsky Jr.
2002-07-09 19:11 ` Alan Cox
2002-07-09 19:11 ` Alan Cox
2002-07-09 19:13 ` Richard B. Johnson
2002-07-09 19:39 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " David Dillow
2002-07-09 19:59 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Alan Cox
2002-07-09 19:59 ` Alan Cox
2002-07-09 19:50 ` Richard B. Johnson
2002-07-15 7:52 ` Sean Hunter
2002-07-15 12:45 ` Richard B. Johnson
2002-07-15 12:45 ` Richard B. Johnson
2002-07-15 13:35 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org>
2002-07-15 14:49 ` Patrick J. LoPresti
2002-07-15 15:18 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org>
2002-07-15 16:10 ` Patrick J. LoPresti
2002-07-15 18:16 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org>
2002-07-15 18:56 ` Patrick J. LoPresti
2002-07-15 20:50 ` Matthias Andree
2002-07-15 16:16 ` Alan Cox
2002-07-15 15:19 ` Matthias Andree
2002-07-15 16:45 ` Alan Cox
2002-07-15 15:38 ` Patrick J. LoPresti
2002-07-15 16:55 ` Alan Cox
2002-07-15 15:29 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " Sandy Harris
2002-07-15 20:17 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti
2002-07-16 1:40 ` jw schultz [this message]
2002-07-15 15:20 ` Bill Rugolsky Jr.
2002-07-15 15:35 ` Matthias Andree
2002-07-15 16:14 ` Bill Rugolsky Jr.
2002-07-09 19:50 ` Richard B. Johnson
2002-07-09 19:13 ` Richard B. Johnson
2002-07-09 14:08 ` Trond Myklebust
2002-07-10 6:33 ` Alex Riesen
2002-07-10 11:20 ` Richard B. Johnson
2002-07-11 10:52 ` Matthias Andree
2002-07-11 11:26 ` Trond Myklebust
-- strict thread matches above, loose matches on Subject: below --
2002-07-09 13:49 Trond Myklebust
[not found] <E17SjDh-00067R-00@usw-sf-list2.sourceforge.net>
2002-07-11 19:14 ` Rex Dieter
2002-07-11 20:05 ` Tom McNeal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20020716014029.GC18703@pegasys.ws \
--to=jw@pegasys.ws \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.