* Status of fsync() wrt mail servers
@ 2003-09-10 5:29 lists
2003-09-10 10:51 ` Bennett Todd
0 siblings, 1 reply; 12+ messages in thread
From: lists @ 2003-09-10 5:29 UTC (permalink / raw)
To: reiserfs-list
Hello,
I'm in the process of researching OSes and filesystems for a new mail
system. I'm hoping to use linux+reiserfs+postfix, and I'm wondering
where reiserfs stands wrt to fsync(). Does reiserfs provide a mechanism
to have truly synchronous writes with a single fsync() call? Thanks
I've read a very long linux-kernel thread[1] from 2001 where Matthias
Andree was petitioning for changes in the fsync() behavior, and I'm
having trouble following what happened since then.
Thanks
[1] http://lists.insecure.org/lists/linux-kernel/2001/Jul/3545.html
--
Cameron Moore
[ The early bird gets the worm, but the second mouse gets the cheese. ]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 5:29 Status of fsync() wrt mail servers lists
@ 2003-09-10 10:51 ` Bennett Todd
2003-09-10 11:14 ` Chris Mason
0 siblings, 1 reply; 12+ messages in thread
From: Bennett Todd @ 2003-09-10 10:51 UTC (permalink / raw)
To: reiserfs-list
[-- Attachment #1: Type: text/plain, Size: 863 bytes --]
2003-09-10T01:29:53 lists@unbeatenpath.net:
> I'm in the process of researching OSes and filesystems for a new mail
> system. I'm hoping to use linux+reiserfs+postfix, and I'm wondering
> where reiserfs stands wrt to fsync(). Does reiserfs provide a mechanism
> to have truly synchronous writes with a single fsync() call? Thanks
I'm not really fond of the phrase "truly synchronous writes"; it can
be read different ways by different people.
What postfix demands (if you wish to adhere strictly to some
peoples' interpretations of RFCs) is that when fsync returns, the
filesystem guarantees that even if there's a crash an instant after,
the file, data as well as metadata, will be intact when the machine
comes up again. This is in support of a desire to positively commit
to the sender that the receiving MTA has accepted receipt for a
message.
-Bennett
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 10:51 ` Bennett Todd
@ 2003-09-10 11:14 ` Chris Mason
2003-09-10 11:41 ` Bennett Todd
0 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2003-09-10 11:14 UTC (permalink / raw)
To: Bennett Todd; +Cc: reiserfs-list
On Wed, 2003-09-10 at 06:51, Bennett Todd wrote:
> 2003-09-10T01:29:53 lists@unbeatenpath.net:
> > I'm in the process of researching OSes and filesystems for a new mail
> > system. I'm hoping to use linux+reiserfs+postfix, and I'm wondering
> > where reiserfs stands wrt to fsync(). Does reiserfs provide a mechanism
> > to have truly synchronous writes with a single fsync() call? Thanks
>
> I'm not really fond of the phrase "truly synchronous writes"; it can
> be read different ways by different people.
>
> What postfix demands (if you wish to adhere strictly to some
> peoples' interpretations of RFCs) is that when fsync returns, the
> filesystem guarantees that even if there's a crash an instant after,
> the file, data as well as metadata, will be intact when the machine
> comes up again. This is in support of a desire to positively commit
> to the sender that the receiving MTA has accepted receipt for a
> message.
This is what reiserfs does, the metadata is on disk after an fsync,
including any renames.
-chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 11:14 ` Chris Mason
@ 2003-09-10 11:41 ` Bennett Todd
2003-09-10 12:30 ` Chris Mason
0 siblings, 1 reply; 12+ messages in thread
From: Bennett Todd @ 2003-09-10 11:41 UTC (permalink / raw)
To: Chris Mason; +Cc: reiserfs-list
[-- Attachment #1: Type: text/plain, Size: 764 bytes --]
2003-09-10T07:14:34 Chris Mason:
> On Wed, 2003-09-10 at 06:51, Bennett Todd wrote:
> > What postfix demands (if you wish to adhere strictly to some
> > peoples' interpretations of RFCs) is that when fsync returns, the
> > filesystem guarantees that even if there's a crash an instant after,
> > the file, data as well as metadata, will be intact when the machine
> > comes up again. This is in support of a desire to positively commit
> > to the sender that the receiving MTA has accepted receipt for a
> > message.
>
> This is what reiserfs does, the metadata is on disk after an fsync,
> including any renames.
Metadata, yes, I've got that. How about the data? Does return from
fsync guarantee that the data will be intact as well?
-Bennett
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 11:41 ` Bennett Todd
@ 2003-09-10 12:30 ` Chris Mason
2003-09-10 15:18 ` Cameron Moore
0 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2003-09-10 12:30 UTC (permalink / raw)
To: Bennett Todd; +Cc: reiserfs-list
On Wed, 2003-09-10 at 07:41, Bennett Todd wrote:
> 2003-09-10T07:14:34 Chris Mason:
> > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote:
> > > What postfix demands (if you wish to adhere strictly to some
> > > peoples' interpretations of RFCs) is that when fsync returns, the
> > > filesystem guarantees that even if there's a crash an instant after,
> > > the file, data as well as metadata, will be intact when the machine
> > > comes up again. This is in support of a desire to positively commit
> > > to the sender that the receiving MTA has accepted receipt for a
> > > message.
> >
> > This is what reiserfs does, the metadata is on disk after an fsync,
> > including any renames.
>
> Metadata, yes, I've got that. How about the data? Does return from
> fsync guarantee that the data will be intact as well?
Yes
-chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 12:30 ` Chris Mason
@ 2003-09-10 15:18 ` Cameron Moore
2003-09-10 21:32 ` Mike Fedyk
0 siblings, 1 reply; 12+ messages in thread
From: Cameron Moore @ 2003-09-10 15:18 UTC (permalink / raw)
To: reiserfs-list
* mason@suse.com (Chris Mason) [2003.09.10 07:31]:
> On Wed, 2003-09-10 at 07:41, Bennett Todd wrote:
> > 2003-09-10T07:14:34 Chris Mason:
> > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote:
> > > > What postfix demands (if you wish to adhere strictly to some
> > > > peoples' interpretations of RFCs) is that when fsync returns, the
> > > > filesystem guarantees that even if there's a crash an instant after,
> > > > the file, data as well as metadata, will be intact when the machine
> > > > comes up again. This is in support of a desire to positively commit
> > > > to the sender that the receiving MTA has accepted receipt for a
> > > > message.
> > >
> > > This is what reiserfs does, the metadata is on disk after an fsync,
> > > including any renames.
> >
> > Metadata, yes, I've got that. How about the data? Does return from
> > fsync guarantee that the data will be intact as well?
>
> Yes
Thanks for hashing this out while I was asleep. :-) Guess I'll go
morph into a die-hard Reiser fan now. Thanks again
--
Cameron Moore
$\="Hacker";$,="another ";print"Just ","Perl ";
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 15:18 ` Cameron Moore
@ 2003-09-10 21:32 ` Mike Fedyk
2003-09-10 22:33 ` Cameron Moore
0 siblings, 1 reply; 12+ messages in thread
From: Mike Fedyk @ 2003-09-10 21:32 UTC (permalink / raw)
To: reiserfs-list
On Wed, Sep 10, 2003 at 10:18:21AM -0500, Cameron Moore wrote:
> * mason@suse.com (Chris Mason) [2003.09.10 07:31]:
> > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote:
> > > 2003-09-10T07:14:34 Chris Mason:
> > > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote:
> > > > > What postfix demands (if you wish to adhere strictly to some
> > > > > peoples' interpretations of RFCs) is that when fsync returns, the
> > > > > filesystem guarantees that even if there's a crash an instant after,
> > > > > the file, data as well as metadata, will be intact when the machine
> > > > > comes up again. This is in support of a desire to positively commit
> > > > > to the sender that the receiving MTA has accepted receipt for a
> > > > > message.
> > > >
> > > > This is what reiserfs does, the metadata is on disk after an fsync,
> > > > including any renames.
> > >
> > > Metadata, yes, I've got that. How about the data? Does return from
> > > fsync guarantee that the data will be intact as well?
> >
> > Yes
>
> Thanks for hashing this out while I was asleep. :-) Guess I'll go
> morph into a die-hard Reiser fan now. Thanks again
The whole perpose of fsync, is to flush the data to the disk. That works
even with ext2, but it has the possibility of not flushing the meta-data.
With a journaled filesystem and fsync, you will have the data and meta-data
on the disk after the call returns.
Isn't that part of Posix or sus?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 21:32 ` Mike Fedyk
@ 2003-09-10 22:33 ` Cameron Moore
2003-09-10 23:49 ` Mike Fedyk
0 siblings, 1 reply; 12+ messages in thread
From: Cameron Moore @ 2003-09-10 22:33 UTC (permalink / raw)
To: reiserfs-list
* mfedyk@matchmail.com (Mike Fedyk) [2003.09.10 16:32]:
> On Wed, Sep 10, 2003 at 10:18:21AM -0500, Cameron Moore wrote:
> > * mason@suse.com (Chris Mason) [2003.09.10 07:31]:
> > > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote:
> > > > 2003-09-10T07:14:34 Chris Mason:
> > > > > On Wed, 2003-09-10 at 06:51, Bennett Todd wrote:
> > > > > > What postfix demands (if you wish to adhere strictly to some
> > > > > > peoples' interpretations of RFCs) is that when fsync returns, the
> > > > > > filesystem guarantees that even if there's a crash an instant after,
> > > > > > the file, data as well as metadata, will be intact when the machine
> > > > > > comes up again. This is in support of a desire to positively commit
> > > > > > to the sender that the receiving MTA has accepted receipt for a
> > > > > > message.
> > > > >
> > > > > This is what reiserfs does, the metadata is on disk after an fsync,
> > > > > including any renames.
> > > >
> > > > Metadata, yes, I've got that. How about the data? Does return from
> > > > fsync guarantee that the data will be intact as well?
> > >
> > > Yes
> >
> > Thanks for hashing this out while I was asleep. :-) Guess I'll go
> > morph into a die-hard Reiser fan now. Thanks again
>
> The whole perpose of fsync, is to flush the data to the disk. That works
> even with ext2, but it has the possibility of not flushing the meta-data.
>
> With a journaled filesystem and fsync, you will have the data and meta-data
> on the disk after the call returns.
>
> Isn't that part of Posix or sus?
I'm not an expert on this, but my reading of the linux-kernel discussion
I cited was that ext3 (at least at that revision point) only guaranteed
that metadata would be written to disk when you fsync()'d a file. You
had to do a second fsync() on the parent directory to guarantee that the
file's data was written to disk.
--
Cameron Moore
[ Smoking cures weight problems... eventually. ]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 22:33 ` Cameron Moore
@ 2003-09-10 23:49 ` Mike Fedyk
2003-09-11 12:33 ` Matthias Andree
0 siblings, 1 reply; 12+ messages in thread
From: Mike Fedyk @ 2003-09-10 23:49 UTC (permalink / raw)
To: reiserfs-list
On Wed, Sep 10, 2003 at 05:33:43PM -0500, Cameron Moore wrote:
> * mfedyk@matchmail.com (Mike Fedyk) [2003.09.10 16:32]:
> > On Wed, Sep 10, 2003 at 10:18:21AM -0500, Cameron Moore wrote:
> > > * mason@suse.com (Chris Mason) [2003.09.10 07:31]:
> > > > On Wed, 2003-09-10 at 07:41, Bennett Todd wrote:
> > > > > Metadata, yes, I've got that. How about the data? Does return from
> > > > > fsync guarantee that the data will be intact as well?
> > > >
> > > > Yes
> > >
> > > Thanks for hashing this out while I was asleep. :-) Guess I'll go
> > > morph into a die-hard Reiser fan now. Thanks again
> >
> > The whole perpose of fsync, is to flush the data to the disk. That works
> > even with ext2, but it has the possibility of not flushing the meta-data.
> >
> > With a journaled filesystem and fsync, you will have the data and meta-data
> > on the disk after the call returns.
> >
> > Isn't that part of Posix or sus?
>
> I'm not an expert on this, but my reading of the linux-kernel discussion
> I cited was that ext3 (at least at that revision point) only guaranteed
> that metadata would be written to disk when you fsync()'d a file. You
> had to do a second fsync() on the parent directory to guarantee that the
> file's data was written to disk.
Ok, I've read through part of the thread, but I remember reading it before,
so...
What Matthias is asking for is to have any directory operation within the
same filesystem to be on the disk when the directory operation call has
completed. At the time, the only way to get that was to mount the
filesystem in sync mode. That meant that any operation on that filesystem
wouldn't return until it was on the disk, including data writes.
The drawback of that is that each write() (typically 4k) call would wait
until it was on the disk, and that's very slow. What Matthias wanted was a
combination of sync mode, but only for directory operations. That's where
ext3's dirsync mount option came from.
With fsync() you write the file like normal (it's not guaranteed
to be on the disk yet) where the call is buffered in memory, and it can be
written out or not yet depending on memory pressure (virtual memory terms).
Basically at this point it is in memory. When fsync() is called, all of the
buffered data is sent to the disk, and the call doesn't return until the
disk signals that it has received the data. You get that with or without
dirsync.
During the processing of a message the MTA will do several renames, moves,
and other calls that manipulate its directory entry. Without dirsync, it is
up to the filesystem and memory pressure to determine when the meta-data
from those calls actually makes it to the disk. (5 seconds with ext3 and 30
seconds with reiserfs3). With dirsync, once the directory operation call is
made, it will not return to the userspace program until the meta-data has
made it the disk (because during the rename and directory operation calls,
there is no data only meta-data which is filesystem accounting data
(directory entries and etc.)) Or more likely made it to the journal in a
journaling filesystem, which is all that is needed to make the gurantee that
all state will be kept intact after the journal recovery (which is automatic
at boot time)
I don't know if reiserfs has a similar option (and are there modes for the
other posix filesystems that this could be moved up to the vfs level?)
So nothing about the effect of fsync() was mentioned, only that with -o sync
it was pointless, since each write() call was already syncronous, and
without -o sync, you would have the data, but not nessicarily know what its
delivery state is (if the crash is at the wrong time).
Anyone please point out any errors I may have made...
Thanks,
Mike
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-10 23:49 ` Mike Fedyk
@ 2003-09-11 12:33 ` Matthias Andree
2003-09-11 17:25 ` Mike Fedyk
0 siblings, 1 reply; 12+ messages in thread
From: Matthias Andree @ 2003-09-11 12:33 UTC (permalink / raw)
To: Mike Fedyk; +Cc: reiserfs-list
Mike Fedyk <mfedyk@matchmail.com> writes:
> During the processing of a message the MTA will do several renames, moves,
> and other calls that manipulate its directory entry.
Different MTAs implement their queue differently. Postfix doesn't rename
the file into place, unlike qmail, it just drops a file, fsync()s it and
that's it.
> Without dirsync, it is up to the filesystem and memory pressure to
> determine when the meta-data from those calls actually makes it to the
> disk. (5 seconds with ext3 and 30 seconds with reiserfs3). With
> dirsync, once the directory operation call is made, it will not return
> to the userspace program until the meta-data has made it the disk
> (because during the rename and directory operation calls, there is no
> data only meta-data which is filesystem accounting data (directory
> entries and etc.))
Does reiserfs3.6 support dirsync? I thought it was ext3-specific until
now.
Please take care to distinguish (file) meta data from directory data.
> So nothing about the effect of fsync() was mentioned, only that with -o sync
> it was pointless, since each write() call was already syncronous, and
> without -o sync, you would have the data, but not nessicarily know what its
> delivery state is (if the crash is at the wrong time).
Basically, what we know is that with Linux 2.4, ext3fs, reiserfs and XFS
will flush all pending transactions (per file system) that were
requested prior to a synchronous operation (fsync, sync, umount, ...)
out to disk.
This can heftily bite your back if you're running your MTA's queue on a
large file system that has other sustained write load (logging, data
bases, ...).
I recently helped one qmail user debug this; the symptom was that the
first mail in a burst of mails took 2 seconds to queue, subsequent mails
were queued much quicker (70 ms). He was using ext3fs, and had one huge
/ (root) file system and so the "synch the whole file system" behaviour
made his qmail-queue synch *all* his dirty blocks to disk...
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-11 12:33 ` Matthias Andree
@ 2003-09-11 17:25 ` Mike Fedyk
2003-09-12 0:22 ` Matthias Andree
0 siblings, 1 reply; 12+ messages in thread
From: Mike Fedyk @ 2003-09-11 17:25 UTC (permalink / raw)
To: Matthias Andree; +Cc: reiserfs-list, linux-kernel
On Thu, Sep 11, 2003 at 02:33:25PM +0200, Matthias Andree wrote:
> Does reiserfs3.6 support dirsync? I thought it was ext3-specific until
> now.
>
That was what I was asking too.
> Please take care to distinguish (file) meta data from directory data.
>
Hmm, it seems to me, that all meta-data relating to the file fsync() was
called on should be sent to the disk and waited for by the call.
> Basically, what we know is that with Linux 2.4, ext3fs, reiserfs and XFS
> will flush all pending transactions (per file system) that were
> requested prior to a synchronous operation (fsync, sync, umount, ...)
> out to disk.
>
> This can heftily bite your back if you're running your MTA's queue on a
> large file system that has other sustained write load (logging, data
> bases, ...).
>
> I recently helped one qmail user debug this; the symptom was that the
> first mail in a burst of mails took 2 seconds to queue, subsequent mails
> were queued much quicker (70 ms). He was using ext3fs, and had one huge
> / (root) file system and so the "synch the whole file system" behaviour
> made his qmail-queue synch *all* his dirty blocks to disk...
Can you be sure the MTA wasn't calling sync() just to be sure (Many MTAs are
funny in that they think the spool is on a seperate disk and filesystem).
fsync() shouldn't be flushing anything not relating to the file it was
called on (that includes directory entries related to the file also IMHO).
Also, if the MTA wasn't running as root, it shouldn't be able to make sync()
affect the entire system. Is there anything that says that sync() can't
just flush the user's buffers unless you're running as root or with some
CAP_ capability?
Mike
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Status of fsync() wrt mail servers
2003-09-11 17:25 ` Mike Fedyk
@ 2003-09-12 0:22 ` Matthias Andree
0 siblings, 0 replies; 12+ messages in thread
From: Matthias Andree @ 2003-09-12 0:22 UTC (permalink / raw)
To: Matthias Andree; +Cc: reiserfs-list, linux-kernel
Mike Fedyk <mfedyk@matchmail.com> writes:
>> I recently helped one qmail user debug this; the symptom was that the
>> first mail in a burst of mails took 2 seconds to queue, subsequent mails
>> were queued much quicker (70 ms). He was using ext3fs, and had one huge
>> / (root) file system and so the "synch the whole file system" behaviour
>> made his qmail-queue synch *all* his dirty blocks to disk...
>
> Can you be sure the MTA wasn't calling sync() just to be sure (Many MTAs are
> funny in that they think the spool is on a seperate disk and
> filesystem).
For qmail and Postfix I can be. sync(8) isn't remotely useful, because
it's allowed to return before completion.
> fsync() shouldn't be flushing anything not relating to the file it was
> called on (that includes directory entries related to the file also
> IMHO).
It "should", but current implementations on Linux do exactly that: flush
everything. Maybe you've got better luck with BSD softupdates, but
that's going to munch disk I/O big time next time you reboot after a
crash: fsck needed. It runs niced in the background so the machine boots
up, but the box won't satisfy higher I/O demands. Looks like a "ex
duobus malis" game.
> Also, if the MTA wasn't running as root, it shouldn't be able to make sync()
> affect the entire system.
I'd like to see your plans that prevent DoS by local users...
One machine's light load is another one's DoS attack.
> Is there anything that says that sync() can't just flush the user's
> buffers unless you're running as root or with some CAP_ capability?
Does the kernel track "whose dirty buffer is this" (uid_t) at all?
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-09-12 0:22 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-10 5:29 Status of fsync() wrt mail servers lists
2003-09-10 10:51 ` Bennett Todd
2003-09-10 11:14 ` Chris Mason
2003-09-10 11:41 ` Bennett Todd
2003-09-10 12:30 ` Chris Mason
2003-09-10 15:18 ` Cameron Moore
2003-09-10 21:32 ` Mike Fedyk
2003-09-10 22:33 ` Cameron Moore
2003-09-10 23:49 ` Mike Fedyk
2003-09-11 12:33 ` Matthias Andree
2003-09-11 17:25 ` Mike Fedyk
2003-09-12 0:22 ` Matthias Andree
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.