public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: jw schultz <jw@pegasys.ws>
To: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts
Date: Mon, 15 Jul 2002 18:40:29 -0700	[thread overview]
Message-ID: <20020716014029.GC18703@pegasys.ws> (raw)
In-Reply-To: <s5gofd8sq4i.fsf@egghead.curl.com>

On Mon, Jul 15, 2002 at 04:17:01PM -0400, Patrick J. LoPresti wrote:
> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> 
> > Documentation/fs/fsync.txt or similar sounds a good idea
> 
> OK, attached is my first attempt at such a document.
> 
> What do you think?
> 
>  - Pat
> 
> 

Nice and clear.  I expect it also applies to unlink(2) and
rename(2).

A simplified version of this with a list of popular "broken"
MTAs and other spooling utilities might also go into the faq
with a strong emphasis on the chattr and mount options.



Content-Description: fsync.txt
>                        Linux fsync() semantics
>                 (or, "How to create a file reliably")
> 
> 
> Introduction
> ============
> 
> Consider the following C program:
> 
>     #include <unistd.h>
>     #include <stdio.h>
>     #include <fcntl.h>
>     #include <string.h>
> 
>     int
>     main (int argc, char *argv[]) {
>       int fd;
>       char *s = "Hello, world!\n";
> 
>       fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
>       if (fd < 0) return 1;
> 
>       if (write (fd, s, strlen(s)) < 0) return 3;
>       if (fsync (fd) < 0) return 4;
>       if (close (fd) < 0) return 5;
> 
>       return 0;
>     }
> 
> Question: If you compile and run this program, and it exits zero
> (success), and your machine then crashes, is it guaranteed that the
> file /tmp/foo will exist upon reboot?
> 
> Answer: On many Unices, including *BSD, yes.
>         On Linux, NO.
> 
> How could this be?  And what can you do about it?
> 
> 
> History
> =======
> 
> In the beginning was BSD with its Fast File System (FFS).  Under FFS,
> changes to directories were "synchronous", meaning they were committed
> to disk before the system call (open/link/rename/etc.) returned.
> Changes to files (write()) were asynchronous.  The fsync() system call
> allowed an application to force a file's pending writes to be
> committed to persistent media.
> 
> In general, disks have reasonble throughput but horrible latency, so
> it is much faster to write many things all at once rather than one at
> a time.  In other words, synchronous operations are slow.
> 
> Enter Linux.  By default, Linux makes all operations, including
> directory updates, asynchronous.  Early file system benchmarks showed
> Linux beating the pants off of BSD, especially when lots of directory
> operations were involved.  This annoyed the BSD folks, who claimed
> that synchronous directory updates are required for reliable
> operation.  (As with most points of contention between Linux and BSD,
> this is both true and false...  See below.)
> 
> The problem with making directory operations asynchronous is that you
> then need to provide a way for the application to commit those changes
> to disk.  Otherwise, it is impossible to write reliable applications.
> 
> 
> BSD softupdates
> ===============
> 
> Sometime during the 90s, the BSD developers introduced "soft updates"
> to improve performance.  These do two things.  First, they make all
> file system operations asynchronous (like Linux).  Second, they extend
> the fsync() system call so that it commits to disk BOTH the file's
> data AND any directories via which the file might be accessed.
> 
> In other words, BSD with soft updates requires that you call fsync()
> on a file to commit any changes to its containing directory.  This is
> why the program above "works" on BSD.
> 
> Many programs are written these days to expect soft update semantics,
> because such algorithms will also work correctly under traditional
> FFS.
> 
> The problem with the softupdates approach is that finding all paths to
> a file is complex, and the Linux developers hate complexity.  Linux
> does NOT support this behavior for fsync() and probably never will.
> 
> 
> Standards
> =========
> 
> Quick aside: What do the relevant standards (POSIX, SuS) say?  Is
> Linux violating some standard here?
> 
> Well, different people, having read the standards, disagree on this
> point.  This itself means the standards are not clear (which is a bad
> thing for a standard).  This is probably because the standards were
> written when synchronous directory updates were the norm, and the
> authors did not even consider asynchronous directory updates.
> 
> 
> The Linux Solution
> ==================
> 
> The Linux answer is simple: If you want to flush a modified directory
> to disk, call fsync() on the directory.
> 
> In other words, to reliably create a file on Linux, you need to do
> something like this:
> 
>     #include <unistd.h>
>     #include <stdio.h>
>     #include <fcntl.h>
>     #include <string.h>
> 
>     int
>     main (int argc, char *argv[]) {
>       int fd, dirfd;
>       char *s = "Hello, world!\n";
> 
>       fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
>       if (fd < 0) return 1;
> 
>       dirfd = open ("/tmp", O_RDONLY);
>       if (dirfd < 0) return 2;
> 
>       if (write (fd, s, strlen(s)) < 0) return 3;
>       if (fsync (fd) < 0) return 4;
>       if (close (fd) < 0) return 5;
>       if (fsync (dirfd) < 0) return 6;
>       if (close (dirfd) < 0) return 7;
> 
>       return 0;
>     }
> 
> If this program exits zero, the file /tmp/foo is guaranteed to be on
> disk and to have the correct contents.  This is true for ALL versions
> of the Linux kernel and ALL file systems.
> 
> 
> Other choices
> =============
> 
> So you have written to the authors of your favorite MTA asking them to
> support Linux properly by using fsync() on directories.  They have
> responded saying that "Linux is broken".  (Be sure to ask them to
> justify this claim with chapter and verse from a standard.  It is sure
> to be interesting.)  What can you do?
> 
> If the application does all its work in one directory, or a few
> directories, you can do "chattr +S" on the directory.  This will cause
> all operations on that directory to be synchronous.
> 
> You can use the "-o sync" mount option.  This will cause ALL
> operations on that partition to be synchronous.  This solves the
> problem, but is likely to be slow.
> 
> In the current version of Linux, you can use the ext3 or ReiserFS file
> systems.  These happen to commit their journals to disk whenever
> fsync() is called, which has the side-effect of providing semantics
> like BSD's soft updates.  But note: This behavior is not guaranteed,
> and may change in future releases!
> 
> But really, the best idea is to convince application authors to
> support the "Linux way" for committing directory updates.  The
> semantics are simple, clear, and extremely efficient.  So go bug those
> MTA authors until they listen :-).
> 
> 
>  - Patrick LoPresti <patl@curl.com>
>    July 2002


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

  reply	other threads:[~2002-07-16  1:37 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20020715075221.GC21470@uncarved.com>
2002-07-15 12:45 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts Richard B. Johnson
2002-07-15 13:35   ` Matthias Andree
     [not found]     ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org>
2002-07-15 14:49       ` Patrick J. LoPresti
2002-07-15 15:18         ` Matthias Andree
     [not found]           ` <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org>
2002-07-15 16:10             ` Patrick J. LoPresti
2002-07-15 18:16               ` Matthias Andree
     [not found]                 ` <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org>
2002-07-15 18:56                   ` Patrick J. LoPresti
2002-07-15 20:50                     ` Matthias Andree
2002-07-15 16:16         ` Alan Cox
2002-07-15 15:19           ` Matthias Andree
2002-07-15 16:45             ` Alan Cox
2002-07-15 15:38           ` Patrick J. LoPresti
2002-07-15 16:55             ` Alan Cox
2002-07-15 15:29               ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " Sandy Harris
2002-07-15 20:17               ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti
2002-07-16  1:40                 ` jw schultz [this message]
2002-07-15 15:20     ` Bill Rugolsky Jr.
2002-07-15 15:35       ` Matthias Andree
2002-07-15 16:14         ` Bill Rugolsky Jr.
2002-07-09 13:49 Trond Myklebust
2002-07-09 14:06 ` Richard B. Johnson
2002-07-09 14:08   ` Trond Myklebust
2002-07-09 15:06     ` Richard B. Johnson
2002-07-09 16:56       ` Alan Cox
2002-07-09 17:22         ` Richard B. Johnson
2002-07-09 19:11           ` Alan Cox
2002-07-09 19:13             ` Richard B. Johnson
2002-07-09 19:59               ` Alan Cox
2002-07-09 19:50                 ` Richard B. Johnson
2002-07-10  6:33   ` Alex Riesen
2002-07-10 11:20     ` Richard B. Johnson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020716014029.GC18703@pegasys.ws \
    --to=jw@pegasys.ws \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox