* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts [not found] <20020715075221.GC21470@uncarved.com> @ 2002-07-15 12:45 ` Richard B. Johnson 2002-07-15 13:35 ` Matthias Andree 0 siblings, 1 reply; 31+ messages in thread From: Richard B. Johnson @ 2002-07-15 12:45 UTC (permalink / raw) To: Sean Hunter; +Cc: Alan Cox, Trond Myklebust, nfs, linux-kernel On Mon, 15 Jul 2002, Sean Hunter wrote: > On Tue, Jul 09, 2002 at 03:50:17PM -0400, Richard B. Johnson wrote: > > On Tue, 9 Jul 2002, Alan Cox wrote: > > > > > > That is what it's supposed to do with files. The attached code clearly > > > > shows that it doesn't work with directories. The fsync() instantly > > > > returns, even though there is buffered data still to be written. > > > > > > Your understanding or code is wrong. Its hard to tell which. > > > > > > fsync on the directory syncs the directory metadata not the file metadata > > > > > > > Well the original complaint was that Linux NFS didn't allow a directory to > > be fsync()ed. I showed that POSIX.4 doesn't provide for fsync()ing > > directories, only files, that you have to fsync() individual files, not > > the directories that contain them. Others said that fsync()ing individual > > files was not necessary, that you only have to fsync() the directory. I > > explained that you have to cheat to even get a fd that can be used > > to fsync() a directory. Then I showed that fsync()ing a directory in this > > manner doesn't work so, we are actually in violent agreement. > > I'm not sure whether or not you've got the gist with all the flamage and > shrapnel flying about, however as I understand it, fsync on a directory fd > ensures that all directory ops such as rename()s unlinks(), links() etc are > committed, not that all data pending to all files in that dir are flushed. > > To get all changes you need to fsync the dirfd and all the fds of the files as > well. > > Because directory changes (such as renames, unlinks etc) are synchronous on NFS > any way, fsync() on a dir fd on an NFS mount can simply return. There will > never be any outstanding dir ops to flush. ergo: no bug. > > Hope that's clear. > > Sean > NFS has characteristics that seem to make it 'special'. For instance, you have a server that performs local actions on behalf of a remote client. As long as the local server doesn't crash, everything it did for the remote client is safe even if the remote client crashes and burns. From the perspective of the remote client, it really doesn't make much difference if it ever calls fsync() on anything as long as the server doesn't crash. Therefore, for discussion I will ignore NFS and other Client Server file access systems. But just because they are special, it doesn't mean that they should be treated specially. Given the following: /1/2/3/4/5/6/7/8/9/file ... I suggest that it MUST be sufficient to fsync() 'file' to assure that file data can be recovered. That's what POSIX.4 states. If the implementation doesn't allow this, i.e., 'file' will end up in 'lost+found', then there is a problem that should be addressed. This is because a local file user's program may not know the entire directory tree. For example, in a chrooted environment. Also, the task has no way of knowing what, if any, of these directory entries have already been flushed to disk. A directory tree could, in principle, be up to _POSIX_PATH_MAX entries in length. In the beginning, when God created Unix, files and directories were all the same. I could fix a bad directory entry with an editor. Over the years, certain rules were established to prevent users from accessing directories as files. They still are files, but the Operating System(s) try their best to make sure you don't muck with directories as files. So now you have to read a directory with getdents(), actually that's not even POSIX, you need to use readdir(). Also, the directory will fail to be opened in other than read-only. These are all artificial constraints, imposed to make sure you follow the rules. So, you get a read-only file-descriptor and fsync() it! What does that mean? Obviously, the file must have existed previously to open it read-only. Since I can't change its contents, because I opened it read-only, fsync() can't do anything because I could not have altered its contents. So, lets say two tasks open the same file. One opens it read-only and the other read-write. The read-write task is happily writing to the file. The read-only task executes fsync(). Does this cause the writer to wait until the file has been flushed to disk? I don't know, but if it does, we have a very broken system where an unprivileged reader can severely affect the performance of a file-server with a denial-of-service attack. So, I suggest that a read-only file-descriptor CANNOT cause the contents of a file to be written. If it does, it's broken. Given this, fsync() on a directory entry, accessed by a read-only file-descriptor, can't do anything. These are things that should be addressed rather than flamed- away. I think that the intent of fsync() on a file is to make certain that it is on the physical media in a state from which it can be accessed after a crash. If this is the intent, then playing games with individual directories is not useful and fsync() on the read/write file-descriptor actually updating the file should be sufficient. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 12:45 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts Richard B. Johnson @ 2002-07-15 13:35 ` Matthias Andree [not found] ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org> 2002-07-15 15:20 ` Bill Rugolsky Jr. 0 siblings, 2 replies; 31+ messages in thread From: Matthias Andree @ 2002-07-15 13:35 UTC (permalink / raw) To: linux-kernel On Mon, 15 Jul 2002, Richard B. Johnson wrote: > These are things that should be addressed rather than flamed- > away. I think that the intent of fsync() on a file is to make > certain that it is on the physical media in a state from which > it can be accessed after a crash. If this is the intent, then > playing games with individual directories is not useful and > fsync() on the read/write file-descriptor actually updating the > file should be sufficient. We had a similar discussion along the lines of an MTA roughly a year ago, but without your (unquoted) objection that fsync() on a fiel without write permit should be impossible. The essence was that Linux 2.4 ext3fs and reiserfs guarantee that on fsync(), the file is recoverable from the place it was created, 2.2 was halfway there; but beware: only data=ordered or data=journal (in ext3fs, as beta patch for reiserfs from ftp.suse.com:/pub/people/mason/patches/data-logging/ <- from memory)) will guarantee that your file contents are recoverable. This does not constitute any statement on JFS or XFS. I'm unaware of their characteristics in fsync and directory update issues. That aside, it would really useful to get this "hog a writer" issue ironed out either way, and that the illogical "fsync() a O_RDONLY" file be resolved somehow. For the data of users not acquainted with kernel intrinsics, the way things are now are most dangerous, and I'd really ask that Andrew Morton's dirsync() patches (where still necessary) and tool patches (chattr, mount) be deployed NOW and that -o dirsync (call it noasync for compatibility) be the default. A safety-speed tradeoff should only sacrifice safety at the explicit request and mke2fs should be told to generate ext3fs by default NOW. The argumentation that Linux leaves the choice of when to sync directory data to the application is nice, but not more, and having this as tuning option is fine, but to quote Wietse Venema "it's interesting to see that out of the box, Linux handles logging more securely (sync writes) than email (async directory updates)". And right he is. Is fsync()ing directories any portable? -- archived at: http://groups.google.com/groups?selm=89uj5c%242h2s%241%40FreeBSD.csie.NCTU.edu.tw&oe=utf-8&output=gplain -- Matthias Andree ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org>]
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts [not found] ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org> @ 2002-07-15 14:49 ` Patrick J. LoPresti 2002-07-15 15:18 ` Matthias Andree 2002-07-15 16:16 ` Alan Cox 0 siblings, 2 replies; 31+ messages in thread From: Patrick J. LoPresti @ 2002-07-15 14:49 UTC (permalink / raw) To: linux-kernel; +Cc: Matthias Andree Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes: > We had a similar discussion along the lines of an MTA roughly a year > ago, but without your (unquoted) objection that fsync() on a fiel > without write permit should be impossible. It was a long thread: http://groups.google.com/groups?threadm=linux.kernel.3B5FC7FB.D5AF0932%40zip.com.au http://lists.insecure.org/linux-kernel/2001/Aug/index.html#39 > The essence was that Linux 2.4 ext3fs and reiserfs guarantee that on > fsync(), the file is recoverable from the place it was created, 2.2 was > halfway there; but beware: only data=ordered or data=journal (in ext3fs, > as beta patch for reiserfs from > ftp.suse.com:/pub/people/mason/patches/data-logging/ <- from memory)) > will guarantee that your file contents are recoverable. I do not recall anything about data=ordered or data=journal mode being required. I thought someone authoritative (Stephen Tweedie?) said that ext3 happens to commit the journal on fsync(), independent of the journaling mode, but that this behavior was an implementation coincidence and not guaranteed. (Unfortunately, I am having trouble finding that message... Can someone familiar with the source confirm or deny this?) I would love to know what IS guaranteed. This fsync() question keeps cropping up, and as far as I know there is no authoritative statement anywhere about what Linux promises. "Read the source code" is the wrong answer; implementations can change at any time. This is a question about the interface, not the implementation. "See post XXX on linux-kernel" is almost as bad. > That aside, it would really useful to get this "hog a writer" issue > ironed out either way, and that the illogical "fsync() a O_RDONLY" > file be resolved somehow. It is a non-issue; no resolution is necessary. If I can even read or write a single file on the same DISK (or bus) that some server process uses, I can "hog its resources" and slow it down. Horrors! Is there any solution??? Oh yeah, don't let me do that. The only interesting question here is what the relevant standards say. And if they allow fsync() at all on a read-only descriptor, then there is pretty clearly only one thing that can mean. If you have a problem with this behavior, then configure your precious servers to keep their data unreadable by untrusted parties. > Is fsync()ing directories any portable? No, but apparently it is what Linux supports. If this were documented clearly somewhere, maybe application authors could be convinced to support it. - Pat ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 14:49 ` Patrick J. LoPresti @ 2002-07-15 15:18 ` Matthias Andree [not found] ` <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org> 2002-07-15 16:16 ` Alan Cox 1 sibling, 1 reply; 31+ messages in thread From: Matthias Andree @ 2002-07-15 15:18 UTC (permalink / raw) To: linux-kernel On Mon, 15 Jul 2002, Patrick J. LoPresti wrote: > I do not recall anything about data=ordered or data=journal mode being > required. I thought someone authoritative (Stephen Tweedie?) said > that ext3 happens to commit the journal on fsync(), independent of the > journaling mode, but that this behavior was an implementation > coincidence and not guaranteed. (Unfortunately, I am having trouble > finding that message... Can someone familiar with the source confirm > or deny this?) I know about the "happens to...", but I think after that discussion, they'd keep it that way. The data= mode was not part of the past discussion, that's why I brought this up now. However, reiserfs or ext3fs with data=writeback only journal the fsync() metadata involved, not the order of data (file contents) versus directory contents, so you can end up with a "crash - journal replay - file with bogus contents" scenario. I've seen this happen on ReiserFS and I was not too fond of it, particularly not as I don't have "fast-access" backups, I need to read a full file from SLR tape up to the point where the desired file is stored. > I would love to know what IS guaranteed. This fsync() question keeps > cropping up, and as far as I know there is no authoritative statement > anywhere about what Linux promises. "Read the source code" is the Indeed not, and a "file system codex" to document these guarantees in respect to path names, with link, rename, directory updates should be documented authoritatively and should be valid for one kernel revision until the next version (i. e. if documented 2.4.18+, it must not change before 2.5.x). > > That aside, it would really useful to get this "hog a writer" issue > > ironed out either way, and that the illogical "fsync() a O_RDONLY" > > file be resolved somehow. > > It is a non-issue; no resolution is necessary. If I can even read or > write a single file on the same DISK (or bus) that some server process > uses, I can "hog its resources" and slow it down. Horrors! Is there > any solution??? Oh yeah, don't let me do that. [IRONY DETECTED] Seriously: imagine another process that opens the file your process is writing into, but it itself has no write permission -- and busy loops on fsync(). Should this fsync process really trigger flushing your blocks although it has no write permissions, this _is_ a problem unless you have some decent tagged queueing in place. fsync() as per open group base specs issue 6 is allowed to return EBADF, EINTR, EINVAL, EIO. Returning EINVAL for fsync(fd) after fd = open("blah", O_RDONLY) does not sound unreasonable. You have nothing to write in O_RDONLY, use O_RDWR or O_WRONLY instead. > The only interesting question here is what the relevant standards say. > And if they allow fsync() at all on a read-only descriptor, then there > is pretty clearly only one thing that can mean. If you have a problem > with this behavior, then configure your precious servers to keep their > data unreadable by untrusted parties. Or moke fsync() a no-op, meaning "your process (group) has no data to write", or return error... EINVAL. > > Is fsync()ing directories any portable? > > No, but apparently it is what Linux supports. If this were documented > clearly somewhere, maybe application authors could be convinced to > support it. I don't think so. They'd rather declare ReiserFS unsupported and go with chattr +S. Seen that. New implementations (Courier's maildrop) still rely on BSD FFS "synchronous directory" semantics. -- Matthias Andree ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org>]
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts [not found] ` <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org> @ 2002-07-15 16:10 ` Patrick J. LoPresti 2002-07-15 18:16 ` Matthias Andree 0 siblings, 1 reply; 31+ messages in thread From: Patrick J. LoPresti @ 2002-07-15 16:10 UTC (permalink / raw) To: linux-kernel Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes: > The data= mode was not part of the past discussion, that's why I > brought this up now. However, reiserfs or ext3fs with data=writeback > only journal the fsync() metadata involved, not the order of data > (file contents) versus directory contents, so you can end up with a > "crash - journal replay - file with bogus contents" scenario. This should not happen with a properly written application. fsync() flushes a bunch of stuff to disk, but it normally makes no promise about the ORDER in which that stuff goes out. fsync() itself is how application authors can enforce an ordering on disk operations. For example, a typical MTA might follow this paradigm: write temp file fsync() rename temp file to destination fsync() report success (Yes, I know, "link/unlink" is more common in practice than rename(). But the principle is the same.) Or, in the case of Postfix: write message file fsync() chmod +x message file fsync() report success The first paradigm uses the presence of a directory entry to represent "committed" data. The second uses a mode bit on the file. Both of these paradigms work fine with data=writeback. Yes, they require calling fsync() twice, but that is exactly what you need to enforce the ordering constraints! An MTA has two ordering constraints: 1) Data must be flushed to disk before it is marked on disk as "committed". This is to ensure that, after a crash, the MTA does not read a corrupted mail file. 2) Data must be marked on disk as "committed" before a success code is reported to the remote MTA. This is to ensure that no mail is lost. The ext3 data=ordered mode enforces the first constraint for mailers using the "rename" paradigm, eliminating the need for the first fsync() call. But any MTA which relies on data=ordered semantics is not only Linux-specific, but ext3/reiserfs specific! Synchronous directory updates, a la FFS, enforce the second constraint (again for the "rename" paradigm), eliminating the need for the second fsync(). But to be robust across platforms and file systems, a mailer needs both fsync() calls. (On Linux, you actually need to fsync() the *directory*, not the file, for the "rename" paradigm. It would be nice if we could convince MTA authors to do this.) > I don't think so. They'd rather declare ReiserFS unsupported and go with > chattr +S. Seen that. > > New implementations (Courier's maildrop) still rely on BSD FFS > "synchronous directory" semantics. Are you sure? Because that is ridiculous... Modern BSDs like to use "soft updates", which need that second fsync() to commit the metadata. So as long as fsync() commits the journal, either paradigm above should work fine under any journaling mode. Summary: *All* MTAs should call fsync() twice. The only issue is what descriptors they should call it on, exactly :-). - Pat ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 16:10 ` Patrick J. LoPresti @ 2002-07-15 18:16 ` Matthias Andree [not found] ` <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org> 0 siblings, 1 reply; 31+ messages in thread From: Matthias Andree @ 2002-07-15 18:16 UTC (permalink / raw) To: linux-kernel On Mon, 15 Jul 2002, Patrick J. LoPresti wrote: > Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes: > > > The data= mode was not part of the past discussion, that's why I > > brought this up now. However, reiserfs or ext3fs with data=writeback > > only journal the fsync() metadata involved, not the order of data > > (file contents) versus directory contents, so you can end up with a > > "crash - journal replay - file with bogus contents" scenario. > > This should not happen with a properly written application. fsync() > flushes a bunch of stuff to disk, but it normally makes no promise > about the ORDER in which that stuff goes out. fsync() itself is how > application authors can enforce an ordering on disk operations. Well, to some extent. > For example, a typical MTA might follow this paradigm: > > write temp file > fsync() > rename temp file to destination > fsync() So does fsync() guarantee rename() persistence across crash on all file systems and kernel versions? IIRC, no. We might want to fill in a table, on the rows kernel release and file system, on the columns whether 1. fsync() syncs all directory updates up to the root, 2. fsync() syncs rename properly, 3. fsync() syncs link, 4. fsync() syncs unlink (not too important, at least not for an MTA, if you ask me), 5. offers dirsync, 6. has dirsync on by default. Very raw draft: Linux 2.0 ext2 ufs Linux 2.2 ufs ext2 ext3 0.0.7<mumble> reiserfs 3.5 jfs xfs? don't think so. Linux 2.4 ufs ext2 ext3 0.9.x 1. yes 2. yes 3. yes 4. ? 5. use patch, use sync, use chattr +S 6. no reiserfs 3.5 reiserfs 3.6 1. yes 2. yes 3. yes 4. ? 5. no, use sync 6. no jfs 1.0 xfs 1.0 xfs 1.1 you name it And for completeness: Free/Net/OpenBSD ffs 1. yes 2. yes 3. yes 4. yes 5. yes 6. yes ffs softupdates 1. yes 2. yes 3. yes 4. ? 5. no 6. no ext2 ufs lfs ^ editor vacancy... > report success > > (Yes, I know, "link/unlink" is more common in practice than rename(). > But the principle is the same.) doesn't matter except that unlink over a crash is usually unsafe, the file may reappear. > Or, in the case of Postfix: > > write message file > fsync() > chmod +x message file > fsync() > report success That'd be inefficient for the double fsync(). Postfix is ahead of that: it omits the first fsync() you suggest, because the +x flag, while necessary, is not sufficient to mark the mail as "complete, further processing allowed". The "message file" is a structured file format that has an "end" record at the end of the file. The +x flag must be set AND this end marker must be present for Postfix to treat the message file. So the +x flag is just an accelerator for the concurrent reader that won't even bother to look into the file that lacks the +x flag. write - fchmod - fsync - close -> 250 Ok is therefore sound in Postfix. (but beware of chmod, in publicly accessible places like /tmp, this can be prone to races, use fchmod if you have an open file descriptor at hand). > An MTA has two ordering constraints: > > 1) Data must be flushed to disk before it is marked on disk as > "committed". This is to ensure that, after a crash, the MTA does > not read a corrupted mail file. > > 2) Data must be marked on disk as "committed" before a success code > is reported to the remote MTA. This is to ensure that no mail is > lost. > > The ext3 data=ordered mode enforces the first constraint for mailers > using the "rename" paradigm, eliminating the need for the first > fsync() call. But any MTA which relies on data=ordered semantics is > not only Linux-specific, but ext3/reiserfs specific! You're right for the MTA AFAICT. But let's keep this unspecific to the MTA. Unless fsync() is used to enforce ordering, without data=ordered, journalled file systems can "recreate" files that are not there. Undead you may call them if you so like... Let me claim that fsync() is beyond the common hobbyist hacker. Yes, I have just put Asbestos underwear on :-) > Synchronous directory updates, a la FFS, enforce the second constraint > (again for the "rename" paradigm), eliminating the need for the second > fsync(). ...or for systems that don't sync the "new" path name created with rename(2) from an open file descriptor... > But to be robust across platforms and file systems, a mailer needs > both fsync() calls. (On Linux, you actually need to fsync() the > *directory*, not the file, for the "rename" paradigm. It would be > nice if we could convince MTA authors to do this.) ...and this will not likely happen with Postfix. Wietse uses chattr +S, and the Postfix queue only works reliably on systems that either (any one alone is sufficient): 1. mount the file system containing /var/spool/postfix with -o sync 2. support chattr +S /var/spool/postfix 3. behave the way BSD softdeps do, where fsync() also syncs all directory changes involved in a rename(2), all the way up to the mount point. Postfix' local(8) daemon additionally relies on rename(2) being synchronous (in Maildir delivery), it does not fsync() after rename. OTOH, the file is completely in Maildir/tmp/somename, so it's not really lost, just invisible. It'd be interesting if chattr +S Maildir/tmp/ would be sufficient to make the rename ("tmp/somefile", "cur/somefile") persistent. > > New implementations (Courier's maildrop) still rely on BSD FFS > > "synchronous directory" semantics. > > Are you sure? Because that is ridiculous... Modern BSDs like to use > "soft updates", which need that second fsync() to commit the metadata. Unless I misread maildrop, yes. Anyone is free to show otherwise, and I will apologize for this false claim. > Summary: *All* MTAs should call fsync() twice. The only issue is what > descriptors they should call it on, exactly :-). See above. Before that, we must know that fsync() syncs all directory and file data and metadata (that makes four) all the way up to the mount point. For Linux 2.0, 2.2, 2.4. For any file system and any mount option. See the table project above ;-) -- Matthias Andree ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org>]
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts [not found] ` <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org> @ 2002-07-15 18:56 ` Patrick J. LoPresti 2002-07-15 20:50 ` Matthias Andree 0 siblings, 1 reply; 31+ messages in thread From: Patrick J. LoPresti @ 2002-07-15 18:56 UTC (permalink / raw) To: linux-kernel Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes: > > For example, a typical MTA might follow this paradigm: > > > > write temp file > > fsync() > > rename temp file to destination > > fsync() > > So does fsync() guarantee rename() persistence across crash on all file > systems and kernel versions? IIRC, no. It depends on what you fsync() :-). On BSD, fsync() of a file's descriptor will commit the rename of that file to disk. On Linux, fsync() of the *directory's* descriptor is required. And yes, this will work across file systems and Linux versions, according to Linus/Alan/etc. > That'd be inefficient for the double fsync(). But it is necessary. See below. > Postfix is ahead of that: it omits the first fsync() you suggest, > because the +x flag, while necessary, is not sufficient to mark the > mail as "complete, further processing allowed". The "message file" > is a structured file format that has an "end" record at the end of > the file. This is not sufficient! Data writes are NOT guaranteed to be ordered. It is permissible for the file system to flush the first and last block of a file to disk BEFORE flushing the middle. You either need the double fsync() or you need a checksum in the file; simple markers are not enough to make a real guarantee. And MTAs should be making real guarantees! > But let's keep this unspecific to the MTA. Unless fsync() is used to > enforce ordering, without data=ordered, journalled file systems can > "recreate" files that are not there. Undead you may call them if you > so like... No, data=ordered has nothing to do with recreating dead files. What data=ordered does is make sure bogus blocks do not appear in new files (or in new extents of old files). Failing to call fsync() at all (i.e., failing to commit metadata updates) is what can recreate dead files. > Postfix' local(8) daemon additionally relies on rename(2) being > synchronous (in Maildir delivery), it does not fsync() after rename. > OTOH, the file is completely in Maildir/tmp/somename, so it's not > really lost, just invisible. No, it is lost, because the file's creation is not guaranteed to have happened at all! (Well, depending on the file system and the semantics. I think I need to write this up more clearly.) > > Summary: *All* MTAs should call fsync() twice. The only issue is what > > descriptors they should call it on, exactly :-). > > See above. Before that, we must know that fsync() syncs all directory > and file data and metadata (that makes four) all the way up to the mount > point. For Linux 2.0, 2.2, 2.4. For any file system and any mount > option. See the table project above ;-) As I said, the issue is what descriptors they should call fsync() on. On Linux, fsync() on a file's descriptor will commit the file's contents; a second fsync() on the containing directory's descriptor will commit the rename()/link(). - Pat ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 18:56 ` Patrick J. LoPresti @ 2002-07-15 20:50 ` Matthias Andree 0 siblings, 0 replies; 31+ messages in thread From: Matthias Andree @ 2002-07-15 20:50 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: msg.pgp --] [-- Type: application/pgp, Size: 2472 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 14:49 ` Patrick J. LoPresti 2002-07-15 15:18 ` Matthias Andree @ 2002-07-15 16:16 ` Alan Cox 2002-07-15 15:19 ` Matthias Andree 2002-07-15 15:38 ` Patrick J. LoPresti 1 sibling, 2 replies; 31+ messages in thread From: Alan Cox @ 2002-07-15 16:16 UTC (permalink / raw) To: Patrick J. LoPresti; +Cc: linux-kernel, Matthias Andree On Mon, 2002-07-15 at 15:49, Patrick J. LoPresti wrote: > I would love to know what IS guaranteed. This fsync() question keeps > cropping up, and as far as I know there is no authoritative statement Linus has explicitly stated what fsync on a directory does, during several of the thousands of cycling repeated flamewars generated by MTA authors If that isnt definitive I don't know what is ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 16:16 ` Alan Cox @ 2002-07-15 15:19 ` Matthias Andree 2002-07-15 16:45 ` Alan Cox 2002-07-15 15:38 ` Patrick J. LoPresti 1 sibling, 1 reply; 31+ messages in thread From: Matthias Andree @ 2002-07-15 15:19 UTC (permalink / raw) To: linux-kernel On Mon, 15 Jul 2002, Alan Cox wrote: > On Mon, 2002-07-15 at 15:49, Patrick J. LoPresti wrote: > > I would love to know what IS guaranteed. This fsync() question keeps > > cropping up, and as far as I know there is no authoritative statement > > Linus has explicitly stated what fsync on a directory does, during > several of the thousands of cycling repeated flamewars generated by MTA > authors That requires explicitly porting applications to Linux and is unreasonable to expect from a usability point of view. -- Matthias Andree ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 15:19 ` Matthias Andree @ 2002-07-15 16:45 ` Alan Cox 0 siblings, 0 replies; 31+ messages in thread From: Alan Cox @ 2002-07-15 16:45 UTC (permalink / raw) To: Matthias Andree; +Cc: linux-kernel On Mon, 2002-07-15 at 16:19, Matthias Andree wrote: > > Linus has explicitly stated what fsync on a directory does, during > > several of the thousands of cycling repeated flamewars generated by MTA > > authors > > That requires explicitly porting applications to Linux and is > unreasonable to expect from a usability point of view. Well bad luck then. POSIX and SuS forgot to specify a standard on this. I've pointed other people at the standards committee to go fix it but heard a deafening silence. Until then I'll run nice fast Linux ported mail apps ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 16:16 ` Alan Cox 2002-07-15 15:19 ` Matthias Andree @ 2002-07-15 15:38 ` Patrick J. LoPresti 2002-07-15 16:55 ` Alan Cox 1 sibling, 1 reply; 31+ messages in thread From: Patrick J. LoPresti @ 2002-07-15 15:38 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel, Matthias Andree Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > Linus has explicitly stated what fsync on a directory does, during > several of the thousands of cycling repeated flamewars generated by MTA > authors > > If that isnt definitive I don't know what is Documentation/fsync.txt would be better. I mean, suppose I write to some MTA's authors to inform them that their product is "broken on Linux" and telling them how to fix it. They might think I am nuts, or that this behavior is an implementation coincidence. (Some of them even seem to think Linux is not complying with the relevant standards. That there is even an argument here means that the standards themselves are broken; standards are supposed to be very clear.) To where should I refer these authors to convince them that this really is how Linux behaves, by definition, now and forever? Should I point them at the flamewars in various mailing list archives? Should I suggest they write to Linus personally? I would rather refer them to Documentation/fsync.txt. Do you agree? Would you accept a patch to add it? - Pat ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 15:38 ` Patrick J. LoPresti @ 2002-07-15 16:55 ` Alan Cox 2002-07-15 15:29 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " Sandy Harris 2002-07-15 20:17 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti 0 siblings, 2 replies; 31+ messages in thread From: Alan Cox @ 2002-07-15 16:55 UTC (permalink / raw) To: Patrick J. LoPresti; +Cc: linux-kernel, Matthias Andree On Mon, 2002-07-15 at 16:38, Patrick J. LoPresti wrote: > Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > > > Linus has explicitly stated what fsync on a directory does, during > > several of the thousands of cycling repeated flamewars generated by MTA > > authors > > > > If that isnt definitive I don't know what is > > Documentation/fsync.txt would be better. Documentation/fs/fsync.txt or similar sounds a good idea ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories on NFS mounts 2002-07-15 16:55 ` Alan Cox @ 2002-07-15 15:29 ` Sandy Harris 2002-07-15 20:17 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti 1 sibling, 0 replies; 31+ messages in thread From: Sandy Harris @ 2002-07-15 15:29 UTC (permalink / raw) To: linux-kernel Alan Cox wrote: > > On Mon, 2002-07-15 at 16:38, Patrick J. LoPresti wrote: > > Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > > > > > Linus has explicitly stated what fsync on a directory does, during > > > several of the thousands of cycling repeated flamewars generated by MTA > > > authors > > > > > > If that isnt definitive I don't know what is > > > > Documentation/fsync.txt would be better. > > Documentation/fs/fsync.txt or similar sounds a good idea Why not just pout it in the man page for fsync? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 16:55 ` Alan Cox 2002-07-15 15:29 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " Sandy Harris @ 2002-07-15 20:17 ` Patrick J. LoPresti 2002-07-16 1:40 ` jw schultz 1 sibling, 1 reply; 31+ messages in thread From: Patrick J. LoPresti @ 2002-07-15 20:17 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel, Matthias Andree [-- Attachment #1: Type: text/plain, Size: 188 bytes --] Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > Documentation/fs/fsync.txt or similar sounds a good idea OK, attached is my first attempt at such a document. What do you think? - Pat [-- Attachment #2: fsync.txt --] [-- Type: text/plain, Size: 5768 bytes --] Linux fsync() semantics (or, "How to create a file reliably") Introduction ============ Consider the following C program: #include <unistd.h> #include <stdio.h> #include <fcntl.h> #include <string.h> int main (int argc, char *argv[]) { int fd; char *s = "Hello, world!\n"; fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL); if (fd < 0) return 1; if (write (fd, s, strlen(s)) < 0) return 3; if (fsync (fd) < 0) return 4; if (close (fd) < 0) return 5; return 0; } Question: If you compile and run this program, and it exits zero (success), and your machine then crashes, is it guaranteed that the file /tmp/foo will exist upon reboot? Answer: On many Unices, including *BSD, yes. On Linux, NO. How could this be? And what can you do about it? History ======= In the beginning was BSD with its Fast File System (FFS). Under FFS, changes to directories were "synchronous", meaning they were committed to disk before the system call (open/link/rename/etc.) returned. Changes to files (write()) were asynchronous. The fsync() system call allowed an application to force a file's pending writes to be committed to persistent media. In general, disks have reasonble throughput but horrible latency, so it is much faster to write many things all at once rather than one at a time. In other words, synchronous operations are slow. Enter Linux. By default, Linux makes all operations, including directory updates, asynchronous. Early file system benchmarks showed Linux beating the pants off of BSD, especially when lots of directory operations were involved. This annoyed the BSD folks, who claimed that synchronous directory updates are required for reliable operation. (As with most points of contention between Linux and BSD, this is both true and false... See below.) The problem with making directory operations asynchronous is that you then need to provide a way for the application to commit those changes to disk. Otherwise, it is impossible to write reliable applications. BSD softupdates =============== Sometime during the 90s, the BSD developers introduced "soft updates" to improve performance. These do two things. First, they make all file system operations asynchronous (like Linux). Second, they extend the fsync() system call so that it commits to disk BOTH the file's data AND any directories via which the file might be accessed. In other words, BSD with soft updates requires that you call fsync() on a file to commit any changes to its containing directory. This is why the program above "works" on BSD. Many programs are written these days to expect soft update semantics, because such algorithms will also work correctly under traditional FFS. The problem with the softupdates approach is that finding all paths to a file is complex, and the Linux developers hate complexity. Linux does NOT support this behavior for fsync() and probably never will. Standards ========= Quick aside: What do the relevant standards (POSIX, SuS) say? Is Linux violating some standard here? Well, different people, having read the standards, disagree on this point. This itself means the standards are not clear (which is a bad thing for a standard). This is probably because the standards were written when synchronous directory updates were the norm, and the authors did not even consider asynchronous directory updates. The Linux Solution ================== The Linux answer is simple: If you want to flush a modified directory to disk, call fsync() on the directory. In other words, to reliably create a file on Linux, you need to do something like this: #include <unistd.h> #include <stdio.h> #include <fcntl.h> #include <string.h> int main (int argc, char *argv[]) { int fd, dirfd; char *s = "Hello, world!\n"; fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL); if (fd < 0) return 1; dirfd = open ("/tmp", O_RDONLY); if (dirfd < 0) return 2; if (write (fd, s, strlen(s)) < 0) return 3; if (fsync (fd) < 0) return 4; if (close (fd) < 0) return 5; if (fsync (dirfd) < 0) return 6; if (close (dirfd) < 0) return 7; return 0; } If this program exits zero, the file /tmp/foo is guaranteed to be on disk and to have the correct contents. This is true for ALL versions of the Linux kernel and ALL file systems. Other choices ============= So you have written to the authors of your favorite MTA asking them to support Linux properly by using fsync() on directories. They have responded saying that "Linux is broken". (Be sure to ask them to justify this claim with chapter and verse from a standard. It is sure to be interesting.) What can you do? If the application does all its work in one directory, or a few directories, you can do "chattr +S" on the directory. This will cause all operations on that directory to be synchronous. You can use the "-o sync" mount option. This will cause ALL operations on that partition to be synchronous. This solves the problem, but is likely to be slow. In the current version of Linux, you can use the ext3 or ReiserFS file systems. These happen to commit their journals to disk whenever fsync() is called, which has the side-effect of providing semantics like BSD's soft updates. But note: This behavior is not guaranteed, and may change in future releases! But really, the best idea is to convince application authors to support the "Linux way" for committing directory updates. The semantics are simple, clear, and extremely efficient. So go bug those MTA authors until they listen :-). - Patrick LoPresti <patl@curl.com> July 2002 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 20:17 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti @ 2002-07-16 1:40 ` jw schultz 0 siblings, 0 replies; 31+ messages in thread From: jw schultz @ 2002-07-16 1:40 UTC (permalink / raw) To: linux-kernel On Mon, Jul 15, 2002 at 04:17:01PM -0400, Patrick J. LoPresti wrote: > Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > > > Documentation/fs/fsync.txt or similar sounds a good idea > > OK, attached is my first attempt at such a document. > > What do you think? > > - Pat > > Nice and clear. I expect it also applies to unlink(2) and rename(2). A simplified version of this with a list of popular "broken" MTAs and other spooling utilities might also go into the faq with a strong emphasis on the chattr and mount options. Content-Description: fsync.txt > Linux fsync() semantics > (or, "How to create a file reliably") > > > Introduction > ============ > > Consider the following C program: > > #include <unistd.h> > #include <stdio.h> > #include <fcntl.h> > #include <string.h> > > int > main (int argc, char *argv[]) { > int fd; > char *s = "Hello, world!\n"; > > fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL); > if (fd < 0) return 1; > > if (write (fd, s, strlen(s)) < 0) return 3; > if (fsync (fd) < 0) return 4; > if (close (fd) < 0) return 5; > > return 0; > } > > Question: If you compile and run this program, and it exits zero > (success), and your machine then crashes, is it guaranteed that the > file /tmp/foo will exist upon reboot? > > Answer: On many Unices, including *BSD, yes. > On Linux, NO. > > How could this be? And what can you do about it? > > > History > ======= > > In the beginning was BSD with its Fast File System (FFS). Under FFS, > changes to directories were "synchronous", meaning they were committed > to disk before the system call (open/link/rename/etc.) returned. > Changes to files (write()) were asynchronous. The fsync() system call > allowed an application to force a file's pending writes to be > committed to persistent media. > > In general, disks have reasonble throughput but horrible latency, so > it is much faster to write many things all at once rather than one at > a time. In other words, synchronous operations are slow. > > Enter Linux. By default, Linux makes all operations, including > directory updates, asynchronous. Early file system benchmarks showed > Linux beating the pants off of BSD, especially when lots of directory > operations were involved. This annoyed the BSD folks, who claimed > that synchronous directory updates are required for reliable > operation. (As with most points of contention between Linux and BSD, > this is both true and false... See below.) > > The problem with making directory operations asynchronous is that you > then need to provide a way for the application to commit those changes > to disk. Otherwise, it is impossible to write reliable applications. > > > BSD softupdates > =============== > > Sometime during the 90s, the BSD developers introduced "soft updates" > to improve performance. These do two things. First, they make all > file system operations asynchronous (like Linux). Second, they extend > the fsync() system call so that it commits to disk BOTH the file's > data AND any directories via which the file might be accessed. > > In other words, BSD with soft updates requires that you call fsync() > on a file to commit any changes to its containing directory. This is > why the program above "works" on BSD. > > Many programs are written these days to expect soft update semantics, > because such algorithms will also work correctly under traditional > FFS. > > The problem with the softupdates approach is that finding all paths to > a file is complex, and the Linux developers hate complexity. Linux > does NOT support this behavior for fsync() and probably never will. > > > Standards > ========= > > Quick aside: What do the relevant standards (POSIX, SuS) say? Is > Linux violating some standard here? > > Well, different people, having read the standards, disagree on this > point. This itself means the standards are not clear (which is a bad > thing for a standard). This is probably because the standards were > written when synchronous directory updates were the norm, and the > authors did not even consider asynchronous directory updates. > > > The Linux Solution > ================== > > The Linux answer is simple: If you want to flush a modified directory > to disk, call fsync() on the directory. > > In other words, to reliably create a file on Linux, you need to do > something like this: > > #include <unistd.h> > #include <stdio.h> > #include <fcntl.h> > #include <string.h> > > int > main (int argc, char *argv[]) { > int fd, dirfd; > char *s = "Hello, world!\n"; > > fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL); > if (fd < 0) return 1; > > dirfd = open ("/tmp", O_RDONLY); > if (dirfd < 0) return 2; > > if (write (fd, s, strlen(s)) < 0) return 3; > if (fsync (fd) < 0) return 4; > if (close (fd) < 0) return 5; > if (fsync (dirfd) < 0) return 6; > if (close (dirfd) < 0) return 7; > > return 0; > } > > If this program exits zero, the file /tmp/foo is guaranteed to be on > disk and to have the correct contents. This is true for ALL versions > of the Linux kernel and ALL file systems. > > > Other choices > ============= > > So you have written to the authors of your favorite MTA asking them to > support Linux properly by using fsync() on directories. They have > responded saying that "Linux is broken". (Be sure to ask them to > justify this claim with chapter and verse from a standard. It is sure > to be interesting.) What can you do? > > If the application does all its work in one directory, or a few > directories, you can do "chattr +S" on the directory. This will cause > all operations on that directory to be synchronous. > > You can use the "-o sync" mount option. This will cause ALL > operations on that partition to be synchronous. This solves the > problem, but is likely to be slow. > > In the current version of Linux, you can use the ext3 or ReiserFS file > systems. These happen to commit their journals to disk whenever > fsync() is called, which has the side-effect of providing semantics > like BSD's soft updates. But note: This behavior is not guaranteed, > and may change in future releases! > > But really, the best idea is to convince application authors to > support the "Linux way" for committing directory updates. The > semantics are simple, clear, and extremely efficient. So go bug those > MTA authors until they listen :-). > > > - Patrick LoPresti <patl@curl.com> > July 2002 -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 13:35 ` Matthias Andree [not found] ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org> @ 2002-07-15 15:20 ` Bill Rugolsky Jr. 2002-07-15 15:35 ` Matthias Andree 1 sibling, 1 reply; 31+ messages in thread From: Bill Rugolsky Jr. @ 2002-07-15 15:20 UTC (permalink / raw) To: linux-kernel; +Cc: Matthias Andree On Mon, Jul 15, 2002 at 03:35:07PM +0200, Matthias Andree wrote: > For the data of users not acquainted with kernel intrinsics, the way > things are now are most dangerous, and I'd really ask that Andrew > Morton's dirsync() patches (where still necessary) and tool patches > (chattr, mount) be deployed NOW and that -o dirsync (call it noasync for > compatibility) be the default. A safety-speed tradeoff should only > sacrifice safety at the explicit request and mke2fs should be told to > generate ext3fs by default NOW. Put dirsync in 2.4? Sure, good idea. Dangerous without it? To whom? Explain how it is dangerous? The journalling filesystems perform directory updates as transactions. It's dangerous to your MTA perhaps. Andrew Morton has bent over backwards to find and fix bugs in the synchronous write logic and to provide what you wanted, i.e., dirsync. He and Chris Mason fixed performance problems in ext3 and Reiserfs. Reread the thread -- you insisted repeatedly that you just wanted dirsync. Or was that just the opening gambit? > The argumentation that Linux leaves the choice of when to sync directory > data to the application is nice, but not more, and having this as tuning > option is fine, but to quote Wietse Venema "it's interesting to see that > out of the box, Linux handles logging more securely (sync writes) than > email (async directory updates)". And right he is. With all due respect to Wieste, that's nonsense: synchronous write in syslog or other logging facilities is a *userspace* policy issue. Default synchronous directory updates is a *kernel* policy issue. I don't have dirsync handy at the moment, so I can't test, but I have to ask: have you tried the simple (and IMHO devastating) benchmark that I posted back on 2001-08-02 comparing Linux to Solaris file creation, http://marc.theaimsgroup.com/?l=linux-kernel&m=99678208121947&w=2 i.e., copy a file tree (XFree86-4.1, 33027 files) with hard links. Recall: Solaris: 363.46s real 0.84s user 10.13s system Ext2: real 0m3.823s user 0m0.240s sys 0m3.570s Ext3: real 0m5.106s user 0m0.200s sys 0m3.700s "dirsync" gives you what you want; please mount /var (or wherever) -o dirsync and leave the kernel defaults as they are. Regards, Bill Rugolsky ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 15:20 ` Bill Rugolsky Jr. @ 2002-07-15 15:35 ` Matthias Andree 2002-07-15 16:14 ` Bill Rugolsky Jr. 0 siblings, 1 reply; 31+ messages in thread From: Matthias Andree @ 2002-07-15 15:35 UTC (permalink / raw) To: linux-kernel On Mon, 15 Jul 2002, Bill Rugolsky Jr. wrote: > Put dirsync in 2.4? Sure, good idea. Dangerous without it? To whom? > > Explain how it is dangerous? The journalling filesystems perform > directory updates as transactions. It's dangerous to your MTA > perhaps. Andrew Morton has bent over backwards to find and fix bugs in > the synchronous write logic and to provide what you wanted, i.e., > dirsync. He and Chris Mason fixed performance problems in ext3 and > Reiserfs. Reread the thread -- you insisted repeatedly that you just > wanted dirsync. Or was that just the opening gambit? The code is there, for ext3, but not for reiserfs. A year has passed, but still, dirsync is not the default. This is directed towards the maintainers of the kernel, not towards Andrew Morton. > With all due respect to Wieste, that's nonsense: synchronous write > in syslog or other logging facilities is a *userspace* policy issue. > Default synchronous directory updates is a *kernel* policy issue. I'm well aware of this, and that _by_default_ user-space is more cautious than kernel-space is beyond my horizon, I'm afraid. Of course, these things are not really related, as syslog and Linux kernel are separate projects, but still, it looks strange from the outside. > I don't have dirsync handy at the moment, so I can't test, but > I have to ask: have you tried the simple (and IMHO devastating) benchmark > that I posted back on 2001-08-02 comparing Linux to Solaris file creation, > > http://marc.theaimsgroup.com/?l=linux-kernel&m=99678208121947&w=2 > > i.e., copy a file tree (XFree86-4.1, 33027 files) with hard links. Nope, I prefer not to play disk hogging games on my Solaris boxen, both of which are in production :-) > Recall: > > Solaris: 363.46s real 0.84s user 10.13s system > Ext2: real 0m3.823s user 0m0.240s sys 0m3.570s > Ext3: real 0m5.106s user 0m0.200s sys 0m3.700s > > "dirsync" gives you what you want; please mount /var (or wherever) > -o dirsync and leave the kernel defaults as they are. /var and /home, indeed. So you prefer speed over safety. That's fine. But that's not sane for a kernel to do. Cheating benchmarks is what others may call it. I just call it sad. -- Matthias Andree ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-15 15:35 ` Matthias Andree @ 2002-07-15 16:14 ` Bill Rugolsky Jr. 0 siblings, 0 replies; 31+ messages in thread From: Bill Rugolsky Jr. @ 2002-07-15 16:14 UTC (permalink / raw) To: linux-kernel; +Cc: Matthias Andree On Mon, Jul 15, 2002 at 05:35:53PM +0200, Matthias Andree wrote: > The code is there, for ext3, but not for reiserfs. A year has passed, > but still, dirsync is not the default. This is directed towards the > maintainers of the kernel, not towards Andrew Morton. I'm in violent agreement that it should go into 2.4 *now that it is merged in 2.5*. You may have noticed that Marcelo has been occupied with a few other issues (VM, IDE). > > I don't have dirsync handy at the moment, so I can't test, but > > I have to ask: have you tried the simple (and IMHO devastating) benchmark > > that I posted back on 2001-08-02 comparing Linux to Solaris file creation, > > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=99678208121947&w=2 > > > > i.e., copy a file tree (XFree86-4.1, 33027 files) with hard links. > > Nope, I prefer not to play disk hogging games on my Solaris boxen, both > of which are in production :-) I'm not asking you to do it on your Solaris boxen -- I couldn't give a damn about slow, buggy Solaris I'm asking whether you have tested this on ext2/ext3 with/without dirsync. The gentlemanly thing to do when asking for a change to the kernel is to (honestly) assess its impact. > So you prefer speed over safety. That's fine. But that's not sane for a > kernel to do. Cheating benchmarks is what others may call it. I just > call it sad. Cheating benchmarks -- bah! Safety for *one* (naive) application class! dirsync buys me no useful safety on my build host, all it will do is slow down things like rpmbuild --rebuild. This is all rather silly. An MTA requires configuration, so what is the difficulty in using -o dirsync, or alternatively, and quite a bit more simply, executing chattr +D on the spool directory. It's quite simple: put dirsync in the kernel and tools, then add chattr +D to the post-install scripts for your favorite package manager. - Bill Rugolsky ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts
@ 2002-07-09 13:49 Trond Myklebust
2002-07-09 14:06 ` Richard B. Johnson
0 siblings, 1 reply; 31+ messages in thread
From: Trond Myklebust @ 2002-07-09 13:49 UTC (permalink / raw)
To: nfs, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 361 bytes --]
Hi,
There was a bug reported on the 'exim' user list a couple of months ago:
the Linux NFS client reports -EINVAL if you try to fsync() a directory.
The correct response would be to return a dummy '0' for success, since all
NFS operations that change the directory are supposed to be performed
synchronously on the server anyway...
Cheers,
Trond
[-- Attachment #2: linux-2.4.19-fsync_dir.dif --]
[-- Type: text/plain, Size: 1071 bytes --]
diff -u --recursive --new-file linux-2.4.19-rc1/fs/nfs/dir.c linux-2.4.19-fsync_dir/fs/nfs/dir.c
--- linux-2.4.19-rc1/fs/nfs/dir.c Tue Mar 12 16:35:02 2002
+++ linux-2.4.19-fsync_dir/fs/nfs/dir.c Tue Jul 9 15:41:29 2002
@@ -45,12 +45,14 @@
static int nfs_mknod(struct inode *, struct dentry *, int, int);
static int nfs_rename(struct inode *, struct dentry *,
struct inode *, struct dentry *);
+static int nfs_fsync_dir(struct file *, struct dentry *, int);
struct file_operations nfs_dir_operations = {
read: generic_read_dir,
readdir: nfs_readdir,
open: nfs_open,
release: nfs_release,
+ fsync: nfs_fsync_dir
};
struct inode_operations nfs_dir_inode_operations = {
@@ -401,6 +403,15 @@
return 0;
}
+/*
+ * All directory operations under NFS are synchronous, so fsync()
+ * is a dummy operation.
+ */
+int nfs_fsync_dir(struct file *filp, struct dentry *dentry, int datasync)
+{
+ return 0;
+}
+
/*
* A check for whether or not the parent directory has changed.
* In the case it has, we assume that the dentries are untrustworthy
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 13:49 Trond Myklebust @ 2002-07-09 14:06 ` Richard B. Johnson 2002-07-09 14:08 ` Trond Myklebust 2002-07-10 6:33 ` Alex Riesen 0 siblings, 2 replies; 31+ messages in thread From: Richard B. Johnson @ 2002-07-09 14:06 UTC (permalink / raw) To: Trond Myklebust; +Cc: nfs, linux-kernel On Tue, 9 Jul 2002, Trond Myklebust wrote: > Hi, > > There was a bug reported on the 'exim' user list a couple of months ago: > the Linux NFS client reports -EINVAL if you try to fsync() a directory. > > The correct response would be to return a dummy '0' for success, since all > NFS operations that change the directory are supposed to be performed > synchronously on the server anyway... > > Cheers, > Trond > > Isn't it supposed to return EINVAL if "fd is bound to a file which doesn't support synchronization..." That's what POSIX 4 says. Errors: EBADF fildes is not a valid file descriptor. EINVAL The file descriptor is valid, but the system doesn't support fsync on this particular file. I think code that opens a directory as a file is broken. We have opendir() for that and it returns a DIR pointer, not a file descriptor. If the directory was properly opened, one would never attempt to fsync() it. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 14:06 ` Richard B. Johnson @ 2002-07-09 14:08 ` Trond Myklebust 2002-07-09 15:06 ` Richard B. Johnson 2002-07-10 6:33 ` Alex Riesen 1 sibling, 1 reply; 31+ messages in thread From: Trond Myklebust @ 2002-07-09 14:08 UTC (permalink / raw) To: root; +Cc: nfs, linux-kernel >>>>> " " == Richard B Johnson <root@chaos.analogic.com> writes: > I think code that opens a directory as a file is broken. We > have opendir() for that and it returns a DIR pointer, not a > file descriptor. If the directory was properly opened, one > would never attempt to fsync() it. fsync() is supported on directories on local filesystems as a way of ensuring that changes (due to file creation etc) are committed to disk. Where is the POSIX violation in that? There is no reason why NFS, which ensures this anyway, should not adhere to this convention. Cheers, Trond ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 14:08 ` Trond Myklebust @ 2002-07-09 15:06 ` Richard B. Johnson 2002-07-09 16:56 ` Alan Cox 0 siblings, 1 reply; 31+ messages in thread From: Richard B. Johnson @ 2002-07-09 15:06 UTC (permalink / raw) To: Trond Myklebust; +Cc: nfs, linux-kernel On Tue, 9 Jul 2002, Trond Myklebust wrote: > >>>>> " " == Richard B Johnson <root@chaos.analogic.com> writes: > > > I think code that opens a directory as a file is broken. We > > have opendir() for that and it returns a DIR pointer, not a > > file descriptor. If the directory was properly opened, one > > would never attempt to fsync() it. > > fsync() is supported on directories on local filesystems as a way of > ensuring that changes (due to file creation etc) are committed to > disk. Where is the POSIX violation in that? > > There is no reason why NFS, which ensures this anyway, should > not adhere to this convention. > > Cheers, > Trond > - Well, no. It's not supported. You can't get a valid file-descriptor... #include <stdio.h> #include <unistd.h> #include <fcntl.h> int main() { int fd; fd = open("/", O_RDWR, 0); fsync(fd); } execve("./xxx", ["xxx"], [/* 32 vars */]) = 0 brk(0) = 0x804966c open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib/libc.so.6", O_RDONLY) = 3 old_mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4000c000 munmap(0x4000c000, 4096) = 0 old_mmap(NULL, 644232, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4000c000 mprotect(0x40097000, 74888, PROT_NONE) = 0 old_mmap(0x40097000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x8b000) = 0x40097000 old_mmap(0x4009d000, 50312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4009d000 close(3) = 0 mprotect(0x4000c000, 569344, PROT_READ|PROT_WRITE) = 0 mprotect(0x4000c000, 569344, PROT_READ|PROT_EXEC) = 0 personality(PER_LINUX) = 0 getpid() = 27544 open("/", O_RDWR) = -1 EISDIR (Is a directory) There are ways to 'cheat' and obtain a file-descriptor that references a directory, but cheating is against POSIX rules, also. You can open it read-only. But, Read-Only means that you can't update it, so fsync means nothing, will return 0 because it is already "whatever it was" since you can't modify it... getpid() = 27568 open("/", O_RDONLY) = 3 fsync(3) = 0 _exit(0) = ? My reading is that you need to fsync() every file within a directory to fsync() a directory. Playing tricks with a directory inode doesn't do it. Regardless, POSIX.4 declines to state exactly what "successfully transferred" means when it states that fsync() doesn't return until all data has been successfully transferred to the disk or underlying hardware. This is a real problem for a network file-system where data that will eventually get to a file-server in the Congo may be en-route for several minutes. If an application insists, it is up to the application to determine, probably once upon startup, just what kind of file synchronization is supported. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 15:06 ` Richard B. Johnson @ 2002-07-09 16:56 ` Alan Cox 2002-07-09 17:22 ` Richard B. Johnson 0 siblings, 1 reply; 31+ messages in thread From: Alan Cox @ 2002-07-09 16:56 UTC (permalink / raw) To: root; +Cc: Trond Myklebust, nfs, linux-kernel > > not adhere to this convention. > > Well, no. It's not supported. You can't get a valid file-descriptor... Wrong (as usual) > If an application insists, it is up to the application to determine, > probably once upon startup, just what kind of file synchronization > is supported. Linux defines fsync for directories ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 16:56 ` Alan Cox @ 2002-07-09 17:22 ` Richard B. Johnson 2002-07-09 19:11 ` Alan Cox 0 siblings, 1 reply; 31+ messages in thread From: Richard B. Johnson @ 2002-07-09 17:22 UTC (permalink / raw) To: Alan Cox; +Cc: Trond Myklebust, nfs, linux-kernel On Tue, 9 Jul 2002, Alan Cox wrote: > > > not adhere to this convention. > > > > Well, no. It's not supported. You can't get a valid file-descriptor... > > Wrong (as usual) Really? Then what is the meaning of fsync() on a read-only file- descriptor? You can't update the information you can't change. This is (as usual) just an example of your helpful responses. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 17:22 ` Richard B. Johnson @ 2002-07-09 19:11 ` Alan Cox 2002-07-09 19:13 ` Richard B. Johnson 0 siblings, 1 reply; 31+ messages in thread From: Alan Cox @ 2002-07-09 19:11 UTC (permalink / raw) To: root; +Cc: Alan Cox, Trond Myklebust, nfs, linux-kernel > Really? Then what is the meaning of fsync() on a read-only file- > descriptor? You can't update the information you can't change. fsync ensures the data for that inode/file content is on stable storage - note _the_ _data_ not only random things written by this specific file handle. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 19:11 ` Alan Cox @ 2002-07-09 19:13 ` Richard B. Johnson 2002-07-09 19:59 ` Alan Cox 0 siblings, 1 reply; 31+ messages in thread From: Richard B. Johnson @ 2002-07-09 19:13 UTC (permalink / raw) To: Alan Cox; +Cc: Trond Myklebust, nfs, linux-kernel On Tue, 9 Jul 2002, Alan Cox wrote: > > Really? Then what is the meaning of fsync() on a read-only file- > > descriptor? You can't update the information you can't change. > > fsync ensures the data for that inode/file content is on stable storage - note > _the_ _data_ not only random things written by this specific file handle. > That is what it's supposed to do with files. The attached code clearly shows that it doesn't work with directories. The fsync() instantly returns, even though there is buffered data still to be written. #include <stdio.h> #include <unistd.h> #include <fcntl.h> #define NR_WRITES 0x1000 int main() { char foo[0x10000]; int dirfd, outfd; int flags, i; outfd = open("/foo", O_WRONLY|O_TRUNC|O_CREAT, 0644); dirfd = open("/", O_RDONLY, 0); flags = fcntl(dirfd, F_GETFL); flags &= ~O_RDONLY; flags |= O_RDWR; fcntl(dirfd, F_SETFL, flags); fprintf(stderr, "Write %d bytes\n", sizeof(foo) * NR_WRITES); for(i=0; i< NR_WRITES; i++) write(outfd, foo, sizeof(foo)); fprintf(stderr, "Write complete\n"); fprintf(stderr, "Sync the directory\n"); fsync(dirfd); fprintf(stderr, "Done, returns immediately!\n"); close(outfd); fprintf(stderr, "Now execute sync and see if your disk is active!\n"); // unlink("/foo"); } Again, to assure that file-data is written to storage, one must execute fsync on files, not directories. The dummy return of 0, that Linux provides is a database bug waiting to happen. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 19:13 ` Richard B. Johnson @ 2002-07-09 19:59 ` Alan Cox 2002-07-09 19:50 ` Richard B. Johnson 0 siblings, 1 reply; 31+ messages in thread From: Alan Cox @ 2002-07-09 19:59 UTC (permalink / raw) To: root; +Cc: Alan Cox, Trond Myklebust, nfs, linux-kernel > That is what it's supposed to do with files. The attached code clearly > shows that it doesn't work with directories. The fsync() instantly > returns, even though there is buffered data still to be written. Your understanding or code is wrong. Its hard to tell which. fsync on the directory syncs the directory metadata not the file metadata ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 19:59 ` Alan Cox @ 2002-07-09 19:50 ` Richard B. Johnson 0 siblings, 0 replies; 31+ messages in thread From: Richard B. Johnson @ 2002-07-09 19:50 UTC (permalink / raw) To: Alan Cox; +Cc: Trond Myklebust, nfs, linux-kernel On Tue, 9 Jul 2002, Alan Cox wrote: > > That is what it's supposed to do with files. The attached code clearly > > shows that it doesn't work with directories. The fsync() instantly > > returns, even though there is buffered data still to be written. > > Your understanding or code is wrong. Its hard to tell which. > > fsync on the directory syncs the directory metadata not the file metadata > Well the original complaint was that Linux NFS didn't allow a directory to be fsync()ed. I showed that POSIX.4 doesn't provide for fsync()ing directories, only files, that you have to fsync() individual files, not the directories that contain them. Others said that fsync()ing individual files was not necessary, that you only have to fsync() the directory. I explained that you have to cheat to even get a fd that can be used to fsync() a directory. Then I showed that fsync()ing a directory in this manner doesn't work so, we are actually in violent agreement. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-09 14:06 ` Richard B. Johnson 2002-07-09 14:08 ` Trond Myklebust @ 2002-07-10 6:33 ` Alex Riesen 2002-07-10 11:20 ` Richard B. Johnson 1 sibling, 1 reply; 31+ messages in thread From: Alex Riesen @ 2002-07-10 6:33 UTC (permalink / raw) To: Richard B. Johnson; +Cc: linux-kernel On Tue, Jul 09, 2002 at 10:06:45AM -0400, Richard B. Johnson wrote: > I think code that opens a directory as a file is broken. We have > opendir() for that and it returns a DIR pointer, not a file descriptor. > If the directory was properly opened, one would never attempt to > fsync() it. It's the libc which defines it. Theere no syscall "opendir". How you think you can return what sus defines as "DIR*" from the kernel? offtopic: on aix you can do this: "cat ." ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts 2002-07-10 6:33 ` Alex Riesen @ 2002-07-10 11:20 ` Richard B. Johnson 0 siblings, 0 replies; 31+ messages in thread From: Richard B. Johnson @ 2002-07-10 11:20 UTC (permalink / raw) To: Alex Riesen; +Cc: linux-kernel On Wed, 10 Jul 2002, Alex Riesen wrote: > On Tue, Jul 09, 2002 at 10:06:45AM -0400, Richard B. Johnson wrote: > > I think code that opens a directory as a file is broken. We have > > opendir() for that and it returns a DIR pointer, not a file descriptor. > > If the directory was properly opened, one would never attempt to > > fsync() it. > > It's the libc which defines it. Theere no syscall "opendir". How you think > you can return what sus defines as "DIR*" from the kernel? > > offtopic: on aix you can do this: "cat ." > Any attempt to open a directory as a file and read it on Linux up to version 2.4.18 (at least), or on Sun (up to) SunOS 5.5.1, returns -1 with errno set to ISDIR (21). As mentioned several times, there are ways to 'cheat', but I was (and have been) talking about POSIX conformance. Script started on Wed Jul 10 07:15:46 2002 # od . od: .: Is a directory 0000000 # cat . cat: .: Is a directory # exit exit Script done on Wed Jul 10 07:15:58 2002 Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Windows-2000/Professional isn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2002-07-16 1:37 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20020715075221.GC21470@uncarved.com>
2002-07-15 12:45 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts Richard B. Johnson
2002-07-15 13:35 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715133507.GF32155@merlin.emma.line.org>
2002-07-15 14:49 ` Patrick J. LoPresti
2002-07-15 15:18 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715151833.GA22828@merlin.emma.line.org>
2002-07-15 16:10 ` Patrick J. LoPresti
2002-07-15 18:16 ` Matthias Andree
[not found] ` <mit.lcs.mail.linux-kernel/20020715181650.GA20665@merlin.emma.line.org>
2002-07-15 18:56 ` Patrick J. LoPresti
2002-07-15 20:50 ` Matthias Andree
2002-07-15 16:16 ` Alan Cox
2002-07-15 15:19 ` Matthias Andree
2002-07-15 16:45 ` Alan Cox
2002-07-15 15:38 ` Patrick J. LoPresti
2002-07-15 16:55 ` Alan Cox
2002-07-15 15:29 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine fordirectories " Sandy Harris
2002-07-15 20:17 ` [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories " Patrick J. LoPresti
2002-07-16 1:40 ` jw schultz
2002-07-15 15:20 ` Bill Rugolsky Jr.
2002-07-15 15:35 ` Matthias Andree
2002-07-15 16:14 ` Bill Rugolsky Jr.
2002-07-09 13:49 Trond Myklebust
2002-07-09 14:06 ` Richard B. Johnson
2002-07-09 14:08 ` Trond Myklebust
2002-07-09 15:06 ` Richard B. Johnson
2002-07-09 16:56 ` Alan Cox
2002-07-09 17:22 ` Richard B. Johnson
2002-07-09 19:11 ` Alan Cox
2002-07-09 19:13 ` Richard B. Johnson
2002-07-09 19:59 ` Alan Cox
2002-07-09 19:50 ` Richard B. Johnson
2002-07-10 6:33 ` Alex Riesen
2002-07-10 11:20 ` Richard B. Johnson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox