From: jw schultz <jw@pegasys.ws>
To: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts
Date: Mon, 15 Jul 2002 18:40:29 -0700 [thread overview]
Message-ID: <20020716014029.GC18703@pegasys.ws> (raw)
In-Reply-To: <s5gofd8sq4i.fsf@egghead.curl.com>
On Mon, Jul 15, 2002 at 04:17:01PM -0400, Patrick J. LoPresti wrote:
> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>
> > Documentation/fs/fsync.txt or similar sounds a good idea
>
> OK, attached is my first attempt at such a document.
>
> What do you think?
>
> - Pat
>
>
Nice and clear. I expect it also applies to unlink(2) and
rename(2).
A simplified version of this with a list of popular "broken"
MTAs and other spooling utilities might also go into the faq
with a strong emphasis on the chattr and mount options.
Content-Description: fsync.txt
> Linux fsync() semantics
> (or, "How to create a file reliably")
>
>
> Introduction
> ============
>
> Consider the following C program:
>
> #include <unistd.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <string.h>
>
> int
> main (int argc, char *argv[]) {
> int fd;
> char *s = "Hello, world!\n";
>
> fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
> if (fd < 0) return 1;
>
> if (write (fd, s, strlen(s)) < 0) return 3;
> if (fsync (fd) < 0) return 4;
> if (close (fd) < 0) return 5;
>
> return 0;
> }
>
> Question: If you compile and run this program, and it exits zero
> (success), and your machine then crashes, is it guaranteed that the
> file /tmp/foo will exist upon reboot?
>
> Answer: On many Unices, including *BSD, yes.
> On Linux, NO.
>
> How could this be? And what can you do about it?
>
>
> History
> =======
>
> In the beginning was BSD with its Fast File System (FFS). Under FFS,
> changes to directories were "synchronous", meaning they were committed
> to disk before the system call (open/link/rename/etc.) returned.
> Changes to files (write()) were asynchronous. The fsync() system call
> allowed an application to force a file's pending writes to be
> committed to persistent media.
>
> In general, disks have reasonble throughput but horrible latency, so
> it is much faster to write many things all at once rather than one at
> a time. In other words, synchronous operations are slow.
>
> Enter Linux. By default, Linux makes all operations, including
> directory updates, asynchronous. Early file system benchmarks showed
> Linux beating the pants off of BSD, especially when lots of directory
> operations were involved. This annoyed the BSD folks, who claimed
> that synchronous directory updates are required for reliable
> operation. (As with most points of contention between Linux and BSD,
> this is both true and false... See below.)
>
> The problem with making directory operations asynchronous is that you
> then need to provide a way for the application to commit those changes
> to disk. Otherwise, it is impossible to write reliable applications.
>
>
> BSD softupdates
> ===============
>
> Sometime during the 90s, the BSD developers introduced "soft updates"
> to improve performance. These do two things. First, they make all
> file system operations asynchronous (like Linux). Second, they extend
> the fsync() system call so that it commits to disk BOTH the file's
> data AND any directories via which the file might be accessed.
>
> In other words, BSD with soft updates requires that you call fsync()
> on a file to commit any changes to its containing directory. This is
> why the program above "works" on BSD.
>
> Many programs are written these days to expect soft update semantics,
> because such algorithms will also work correctly under traditional
> FFS.
>
> The problem with the softupdates approach is that finding all paths to
> a file is complex, and the Linux developers hate complexity. Linux
> does NOT support this behavior for fsync() and probably never will.
>
>
> Standards
> =========
>
> Quick aside: What do the relevant standards (POSIX, SuS) say? Is
> Linux violating some standard here?
>
> Well, different people, having read the standards, disagree on this
> point. This itself means the standards are not clear (which is a bad
> thing for a standard). This is probably because the standards were
> written when synchronous directory updates were the norm, and the
> authors did not even consider asynchronous directory updates.
>
>
> The Linux Solution
> ==================
>
> The Linux answer is simple: If you want to flush a modified directory
> to disk, call fsync() on the directory.
>
> In other words, to reliably create a file on Linux, you need to do
> something like this:
>
> #include <unistd.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <string.h>
>
> int
> main (int argc, char *argv[]) {
> int fd, dirfd;
> char *s = "Hello, world!\n";
>
> fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
> if (fd < 0) return 1;
>
> dirfd = open ("/tmp", O_RDONLY);
> if (dirfd < 0) return 2;
>
> if (write (fd, s, strlen(s)) < 0) return 3;
> if (fsync (fd) < 0) return 4;
> if (close (fd) < 0) return 5;
> if (fsync (dirfd) < 0) return 6;
> if (close (dirfd) < 0) return 7;
>
> return 0;
> }
>
> If this program exits zero, the file /tmp/foo is guaranteed to be on
> disk and to have the correct contents. This is true for ALL versions
> of the Linux kernel and ALL file systems.
>
>
> Other choices
> =============
>
> So you have written to the authors of your favorite MTA asking them to
> support Linux properly by using fsync() on directories. They have
> responded saying that "Linux is broken". (Be sure to ask them to
> justify this claim with chapter and verse from a standard. It is sure
> to be interesting.) What can you do?
>
> If the application does all its work in one directory, or a few
> directories, you can do "chattr +S" on the directory. This will cause
> all operations on that directory to be synchronous.
>
> You can use the "-o sync" mount option. This will cause ALL
> operations on that partition to be synchronous. This solves the
> problem, but is likely to be slow.
>
> In the current version of Linux, you can use the ext3 or ReiserFS file
> systems. These happen to commit their journals to disk whenever
> fsync() is called, which has the side-effect of providing semantics
> like BSD's soft updates. But note: This behavior is not guaranteed,
> and may change in future releases!
>
> But really, the best idea is to convince application authors to
> support the "Linux way" for committing directory updates. The
> semantics are simple, clear, and extremely efficient. So go bug those
> MTA authors until they listen :-).
>
>
> - Patrick LoPresti <patl@curl.com>
> July 2002
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@pegasys.ws
Remember Cernan and Schmitt
next prev parent reply other threads:[~2002-07-16 1:37 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20020715075221.GC21470@uncarved.com>
2002-07-15 12:45 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts Richard B. Johnson
2002-07-15 13:35 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org>
2002-07-15 14:49 ` Patrick J. LoPresti
2002-07-15 15:18 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org>
2002-07-15 16:10 ` Patrick J. LoPresti
2002-07-15 18:16 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org>
2002-07-15 18:56 ` Patrick J. LoPresti
2002-07-15 20:50 ` Matthias Andree
2002-07-15 16:16 ` Alan Cox
2002-07-15 15:19 ` Matthias Andree
2002-07-15 16:45 ` Alan Cox
2002-07-15 15:38 ` Patrick J. LoPresti
2002-07-15 16:55 ` Alan Cox
2002-07-15 15:29 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " Sandy Harris
2002-07-15 20:17 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti
2002-07-16 1:40 ` jw schultz [this message]
2002-07-15 15:20 ` Bill Rugolsky Jr.
2002-07-15 15:35 ` Matthias Andree
2002-07-15 16:14 ` Bill Rugolsky Jr.
2002-07-09 13:49 Trond Myklebust
2002-07-09 14:06 ` Richard B. Johnson
2002-07-09 14:08 ` Trond Myklebust
2002-07-09 15:06 ` Richard B. Johnson
2002-07-09 16:56 ` Alan Cox
2002-07-09 17:22 ` Richard B. Johnson
2002-07-09 19:11 ` Alan Cox
2002-07-09 19:13 ` Richard B. Johnson
2002-07-09 19:59 ` Alan Cox
2002-07-09 19:50 ` Richard B. Johnson
2002-07-10 6:33 ` Alex Riesen
2002-07-10 11:20 ` Richard B. Johnson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20020716014029.GC18703@pegasys.ws \
--to=jw@pegasys.ws \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox