* Status of fsync() wrt mail servers @ 2003-09-10 5:29 lists 2003-09-10 10:51 ` Bennett Todd 0 siblings, 1 reply; 12+ messages in thread From: lists @ 2003-09-10 5:29 UTC (permalink / raw) To: reiserfs-list Hello, I'm in the process of researching OSes and filesystems for a new mail system. I'm hoping to use linux+reiserfs+postfix, and I'm wondering where reiserfs stands wrt to fsync(). Does reiserfs provide a mechanism to have truly synchronous writes with a single fsync() call? Thanks I've read a very long linux-kernel thread[1] from 2001 where Matthias Andree was petitioning for changes in the fsync() behavior, and I'm having trouble following what happened since then. Thanks [1] http://lists.insecure.org/lists/linux-kernel/2001/Jul/3545.html -- Cameron Moore [ The early bird gets the worm, but the second mouse gets the cheese. ] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 5:29 Status of fsync() wrt mail servers lists @ 2003-09-10 10:51 ` Bennett Todd 2003-09-10 11:14 ` Chris Mason 0 siblings, 1 reply; 12+ messages in thread From: Bennett Todd @ 2003-09-10 10:51 UTC (permalink / raw) To: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 863 bytes --] 2003-09-10T01:29:53 lists@unbeatenpath.net: > I'm in the process of researching OSes and filesystems for a new mail > system. I'm hoping to use linux+reiserfs+postfix, and I'm wondering > where reiserfs stands wrt to fsync(). Does reiserfs provide a mechanism > to have truly synchronous writes with a single fsync() call? Thanks I'm not really fond of the phrase "truly synchronous writes"; it can be read different ways by different people. What postfix demands (if you wish to adhere strictly to some peoples' interpretations of RFCs) is that when fsync returns, the filesystem guarantees that even if there's a crash an instant after, the file, data as well as metadata, will be intact when the machine comes up again. This is in support of a desire to positively commit to the sender that the receiving MTA has accepted receipt for a message. -Bennett [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 10:51 ` Bennett Todd @ 2003-09-10 11:14 ` Chris Mason 2003-09-10 11:41 ` Bennett Todd 0 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2003-09-10 11:14 UTC (permalink / raw) To: Bennett Todd; +Cc: reiserfs-list On Wed, 2003-09-10 at 06:51, Bennett Todd wrote: > 2003-09-10T01:29:53 lists@unbeatenpath.net: > > I'm in the process of researching OSes and filesystems for a new mail > > system. I'm hoping to use linux+reiserfs+postfix, and I'm wondering > > where reiserfs stands wrt to fsync(). Does reiserfs provide a mechanism > > to have truly synchronous writes with a single fsync() call? Thanks > > I'm not really fond of the phrase "truly synchronous writes"; it can > be read different ways by different people. > > What postfix demands (if you wish to adhere strictly to some > peoples' interpretations of RFCs) is that when fsync returns, the > filesystem guarantees that even if there's a crash an instant after, > the file, data as well as metadata, will be intact when the machine > comes up again. This is in support of a desire to positively commit > to the sender that the receiving MTA has accepted receipt for a > message. This is what reiserfs does, the metadata is on disk after an fsync, including any renames. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 11:14 ` Chris Mason @ 2003-09-10 11:41 ` Bennett Todd 2003-09-10 12:30 ` Chris Mason 0 siblings, 1 reply; 12+ messages in thread From: Bennett Todd @ 2003-09-10 11:41 UTC (permalink / raw) To: Chris Mason; +Cc: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 764 bytes --] 2003-09-10T07:14:34 Chris Mason: > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote: > > What postfix demands (if you wish to adhere strictly to some > > peoples' interpretations of RFCs) is that when fsync returns, the > > filesystem guarantees that even if there's a crash an instant after, > > the file, data as well as metadata, will be intact when the machine > > comes up again. This is in support of a desire to positively commit > > to the sender that the receiving MTA has accepted receipt for a > > message. > > This is what reiserfs does, the metadata is on disk after an fsync, > including any renames. Metadata, yes, I've got that. How about the data? Does return from fsync guarantee that the data will be intact as well? -Bennett [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 11:41 ` Bennett Todd @ 2003-09-10 12:30 ` Chris Mason 2003-09-10 15:18 ` Cameron Moore 0 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2003-09-10 12:30 UTC (permalink / raw) To: Bennett Todd; +Cc: reiserfs-list On Wed, 2003-09-10 at 07:41, Bennett Todd wrote: > 2003-09-10T07:14:34 Chris Mason: > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote: > > > What postfix demands (if you wish to adhere strictly to some > > > peoples' interpretations of RFCs) is that when fsync returns, the > > > filesystem guarantees that even if there's a crash an instant after, > > > the file, data as well as metadata, will be intact when the machine > > > comes up again. This is in support of a desire to positively commit > > > to the sender that the receiving MTA has accepted receipt for a > > > message. > > > > This is what reiserfs does, the metadata is on disk after an fsync, > > including any renames. > > Metadata, yes, I've got that. How about the data? Does return from > fsync guarantee that the data will be intact as well? Yes -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 12:30 ` Chris Mason @ 2003-09-10 15:18 ` Cameron Moore 2003-09-10 21:32 ` Mike Fedyk 0 siblings, 1 reply; 12+ messages in thread From: Cameron Moore @ 2003-09-10 15:18 UTC (permalink / raw) To: reiserfs-list * mason@suse.com (Chris Mason) [2003.09.10 07:31]: > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote: > > 2003-09-10T07:14:34 Chris Mason: > > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote: > > > > What postfix demands (if you wish to adhere strictly to some > > > > peoples' interpretations of RFCs) is that when fsync returns, the > > > > filesystem guarantees that even if there's a crash an instant after, > > > > the file, data as well as metadata, will be intact when the machine > > > > comes up again. This is in support of a desire to positively commit > > > > to the sender that the receiving MTA has accepted receipt for a > > > > message. > > > > > > This is what reiserfs does, the metadata is on disk after an fsync, > > > including any renames. > > > > Metadata, yes, I've got that. How about the data? Does return from > > fsync guarantee that the data will be intact as well? > > Yes Thanks for hashing this out while I was asleep. :-) Guess I'll go morph into a die-hard Reiser fan now. Thanks again -- Cameron Moore $\="Hacker";$,="another ";print"Just ","Perl "; ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 15:18 ` Cameron Moore @ 2003-09-10 21:32 ` Mike Fedyk 2003-09-10 22:33 ` Cameron Moore 0 siblings, 1 reply; 12+ messages in thread From: Mike Fedyk @ 2003-09-10 21:32 UTC (permalink / raw) To: reiserfs-list On Wed, Sep 10, 2003 at 10:18:21AM -0500, Cameron Moore wrote: > * mason@suse.com (Chris Mason) [2003.09.10 07:31]: > > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote: > > > 2003-09-10T07:14:34 Chris Mason: > > > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote: > > > > > What postfix demands (if you wish to adhere strictly to some > > > > > peoples' interpretations of RFCs) is that when fsync returns, the > > > > > filesystem guarantees that even if there's a crash an instant after, > > > > > the file, data as well as metadata, will be intact when the machine > > > > > comes up again. This is in support of a desire to positively commit > > > > > to the sender that the receiving MTA has accepted receipt for a > > > > > message. > > > > > > > > This is what reiserfs does, the metadata is on disk after an fsync, > > > > including any renames. > > > > > > Metadata, yes, I've got that. How about the data? Does return from > > > fsync guarantee that the data will be intact as well? > > > > Yes > > Thanks for hashing this out while I was asleep. :-) Guess I'll go > morph into a die-hard Reiser fan now. Thanks again The whole perpose of fsync, is to flush the data to the disk. That works even with ext2, but it has the possibility of not flushing the meta-data. With a journaled filesystem and fsync, you will have the data and meta-data on the disk after the call returns. Isn't that part of Posix or sus? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 21:32 ` Mike Fedyk @ 2003-09-10 22:33 ` Cameron Moore 2003-09-10 23:49 ` Mike Fedyk 0 siblings, 1 reply; 12+ messages in thread From: Cameron Moore @ 2003-09-10 22:33 UTC (permalink / raw) To: reiserfs-list * mfedyk@matchmail.com (Mike Fedyk) [2003.09.10 16:32]: > On Wed, Sep 10, 2003 at 10:18:21AM -0500, Cameron Moore wrote: > > * mason@suse.com (Chris Mason) [2003.09.10 07:31]: > > > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote: > > > > 2003-09-10T07:14:34 Chris Mason: > > > > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote: > > > > > > What postfix demands (if you wish to adhere strictly to some > > > > > > peoples' interpretations of RFCs) is that when fsync returns, the > > > > > > filesystem guarantees that even if there's a crash an instant after, > > > > > > the file, data as well as metadata, will be intact when the machine > > > > > > comes up again. This is in support of a desire to positively commit > > > > > > to the sender that the receiving MTA has accepted receipt for a > > > > > > message. > > > > > > > > > > This is what reiserfs does, the metadata is on disk after an fsync, > > > > > including any renames. > > > > > > > > Metadata, yes, I've got that. How about the data? Does return from > > > > fsync guarantee that the data will be intact as well? > > > > > > Yes > > > > Thanks for hashing this out while I was asleep. :-) Guess I'll go > > morph into a die-hard Reiser fan now. Thanks again > > The whole perpose of fsync, is to flush the data to the disk. That works > even with ext2, but it has the possibility of not flushing the meta-data. > > With a journaled filesystem and fsync, you will have the data and meta-data > on the disk after the call returns. > > Isn't that part of Posix or sus? I'm not an expert on this, but my reading of the linux-kernel discussion I cited was that ext3 (at least at that revision point) only guaranteed that metadata would be written to disk when you fsync()'d a file. You had to do a second fsync() on the parent directory to guarantee that the file's data was written to disk. -- Cameron Moore [ Smoking cures weight problems... eventually. ] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 22:33 ` Cameron Moore @ 2003-09-10 23:49 ` Mike Fedyk 2003-09-11 12:33 ` Matthias Andree 0 siblings, 1 reply; 12+ messages in thread From: Mike Fedyk @ 2003-09-10 23:49 UTC (permalink / raw) To: reiserfs-list On Wed, Sep 10, 2003 at 05:33:43PM -0500, Cameron Moore wrote: > * mfedyk@matchmail.com (Mike Fedyk) [2003.09.10 16:32]: > > On Wed, Sep 10, 2003 at 10:18:21AM -0500, Cameron Moore wrote: > > > * mason@suse.com (Chris Mason) [2003.09.10 07:31]: > > > > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote: > > > > > Metadata, yes, I've got that. How about the data? Does return from > > > > > fsync guarantee that the data will be intact as well? > > > > > > > > Yes > > > > > > Thanks for hashing this out while I was asleep. :-) Guess I'll go > > > morph into a die-hard Reiser fan now. Thanks again > > > > The whole perpose of fsync, is to flush the data to the disk. That works > > even with ext2, but it has the possibility of not flushing the meta-data. > > > > With a journaled filesystem and fsync, you will have the data and meta-data > > on the disk after the call returns. > > > > Isn't that part of Posix or sus? > > I'm not an expert on this, but my reading of the linux-kernel discussion > I cited was that ext3 (at least at that revision point) only guaranteed > that metadata would be written to disk when you fsync()'d a file. You > had to do a second fsync() on the parent directory to guarantee that the > file's data was written to disk. Ok, I've read through part of the thread, but I remember reading it before, so... What Matthias is asking for is to have any directory operation within the same filesystem to be on the disk when the directory operation call has completed. At the time, the only way to get that was to mount the filesystem in sync mode. That meant that any operation on that filesystem wouldn't return until it was on the disk, including data writes. The drawback of that is that each write() (typically 4k) call would wait until it was on the disk, and that's very slow. What Matthias wanted was a combination of sync mode, but only for directory operations. That's where ext3's dirsync mount option came from. With fsync() you write the file like normal (it's not guaranteed to be on the disk yet) where the call is buffered in memory, and it can be written out or not yet depending on memory pressure (virtual memory terms). Basically at this point it is in memory. When fsync() is called, all of the buffered data is sent to the disk, and the call doesn't return until the disk signals that it has received the data. You get that with or without dirsync. During the processing of a message the MTA will do several renames, moves, and other calls that manipulate its directory entry. Without dirsync, it is up to the filesystem and memory pressure to determine when the meta-data from those calls actually makes it to the disk. (5 seconds with ext3 and 30 seconds with reiserfs3). With dirsync, once the directory operation call is made, it will not return to the userspace program until the meta-data has made it the disk (because during the rename and directory operation calls, there is no data only meta-data which is filesystem accounting data (directory entries and etc.)) Or more likely made it to the journal in a journaling filesystem, which is all that is needed to make the gurantee that all state will be kept intact after the journal recovery (which is automatic at boot time) I don't know if reiserfs has a similar option (and are there modes for the other posix filesystems that this could be moved up to the vfs level?) So nothing about the effect of fsync() was mentioned, only that with -o sync it was pointless, since each write() call was already syncronous, and without -o sync, you would have the data, but not nessicarily know what its delivery state is (if the crash is at the wrong time). Anyone please point out any errors I may have made... Thanks, Mike ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-10 23:49 ` Mike Fedyk @ 2003-09-11 12:33 ` Matthias Andree 2003-09-11 17:25 ` Mike Fedyk 0 siblings, 1 reply; 12+ messages in thread From: Matthias Andree @ 2003-09-11 12:33 UTC (permalink / raw) To: Mike Fedyk; +Cc: reiserfs-list Mike Fedyk <mfedyk@matchmail.com> writes: > During the processing of a message the MTA will do several renames, moves, > and other calls that manipulate its directory entry. Different MTAs implement their queue differently. Postfix doesn't rename the file into place, unlike qmail, it just drops a file, fsync()s it and that's it. > Without dirsync, it is up to the filesystem and memory pressure to > determine when the meta-data from those calls actually makes it to the > disk. (5 seconds with ext3 and 30 seconds with reiserfs3). With > dirsync, once the directory operation call is made, it will not return > to the userspace program until the meta-data has made it the disk > (because during the rename and directory operation calls, there is no > data only meta-data which is filesystem accounting data (directory > entries and etc.)) Does reiserfs3.6 support dirsync? I thought it was ext3-specific until now. Please take care to distinguish (file) meta data from directory data. > So nothing about the effect of fsync() was mentioned, only that with -o sync > it was pointless, since each write() call was already syncronous, and > without -o sync, you would have the data, but not nessicarily know what its > delivery state is (if the crash is at the wrong time). Basically, what we know is that with Linux 2.4, ext3fs, reiserfs and XFS will flush all pending transactions (per file system) that were requested prior to a synchronous operation (fsync, sync, umount, ...) out to disk. This can heftily bite your back if you're running your MTA's queue on a large file system that has other sustained write load (logging, data bases, ...). I recently helped one qmail user debug this; the symptom was that the first mail in a burst of mails took 2 seconds to queue, subsequent mails were queued much quicker (70 ms). He was using ext3fs, and had one huge / (root) file system and so the "synch the whole file system" behaviour made his qmail-queue synch *all* his dirty blocks to disk... -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-11 12:33 ` Matthias Andree @ 2003-09-11 17:25 ` Mike Fedyk 2003-09-12 0:22 ` Matthias Andree 0 siblings, 1 reply; 12+ messages in thread From: Mike Fedyk @ 2003-09-11 17:25 UTC (permalink / raw) To: Matthias Andree; +Cc: reiserfs-list, linux-kernel On Thu, Sep 11, 2003 at 02:33:25PM +0200, Matthias Andree wrote: > Does reiserfs3.6 support dirsync? I thought it was ext3-specific until > now. > That was what I was asking too. > Please take care to distinguish (file) meta data from directory data. > Hmm, it seems to me, that all meta-data relating to the file fsync() was called on should be sent to the disk and waited for by the call. > Basically, what we know is that with Linux 2.4, ext3fs, reiserfs and XFS > will flush all pending transactions (per file system) that were > requested prior to a synchronous operation (fsync, sync, umount, ...) > out to disk. > > This can heftily bite your back if you're running your MTA's queue on a > large file system that has other sustained write load (logging, data > bases, ...). > > I recently helped one qmail user debug this; the symptom was that the > first mail in a burst of mails took 2 seconds to queue, subsequent mails > were queued much quicker (70 ms). He was using ext3fs, and had one huge > / (root) file system and so the "synch the whole file system" behaviour > made his qmail-queue synch *all* his dirty blocks to disk... Can you be sure the MTA wasn't calling sync() just to be sure (Many MTAs are funny in that they think the spool is on a seperate disk and filesystem). fsync() shouldn't be flushing anything not relating to the file it was called on (that includes directory entries related to the file also IMHO). Also, if the MTA wasn't running as root, it shouldn't be able to make sync() affect the entire system. Is there anything that says that sync() can't just flush the user's buffers unless you're running as root or with some CAP_ capability? Mike ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers 2003-09-11 17:25 ` Mike Fedyk @ 2003-09-12 0:22 ` Matthias Andree 0 siblings, 0 replies; 12+ messages in thread From: Matthias Andree @ 2003-09-12 0:22 UTC (permalink / raw) To: Matthias Andree; +Cc: reiserfs-list, linux-kernel Mike Fedyk <mfedyk@matchmail.com> writes: >> I recently helped one qmail user debug this; the symptom was that the >> first mail in a burst of mails took 2 seconds to queue, subsequent mails >> were queued much quicker (70 ms). He was using ext3fs, and had one huge >> / (root) file system and so the "synch the whole file system" behaviour >> made his qmail-queue synch *all* his dirty blocks to disk... > > Can you be sure the MTA wasn't calling sync() just to be sure (Many MTAs are > funny in that they think the spool is on a seperate disk and > filesystem). For qmail and Postfix I can be. sync(8) isn't remotely useful, because it's allowed to return before completion. > fsync() shouldn't be flushing anything not relating to the file it was > called on (that includes directory entries related to the file also > IMHO). It "should", but current implementations on Linux do exactly that: flush everything. Maybe you've got better luck with BSD softupdates, but that's going to munch disk I/O big time next time you reboot after a crash: fsck needed. It runs niced in the background so the machine boots up, but the box won't satisfy higher I/O demands. Looks like a "ex duobus malis" game. > Also, if the MTA wasn't running as root, it shouldn't be able to make sync() > affect the entire system. I'd like to see your plans that prevent DoS by local users... One machine's light load is another one's DoS attack. > Is there anything that says that sync() can't just flush the user's > buffers unless you're running as root or with some CAP_ capability? Does the kernel track "whose dirty buffer is this" (uid_t) at all? -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-09-12 0:22 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-09-10 5:29 Status of fsync() wrt mail servers lists 2003-09-10 10:51 ` Bennett Todd 2003-09-10 11:14 ` Chris Mason 2003-09-10 11:41 ` Bennett Todd 2003-09-10 12:30 ` Chris Mason 2003-09-10 15:18 ` Cameron Moore 2003-09-10 21:32 ` Mike Fedyk 2003-09-10 22:33 ` Cameron Moore 2003-09-10 23:49 ` Mike Fedyk 2003-09-11 12:33 ` Matthias Andree 2003-09-11 17:25 ` Mike Fedyk 2003-09-12 0:22 ` Matthias Andree
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.