* Syncing a file's metadata in a portable way
@ 2004-07-09 3:06 Alberto Bertogli
2004-07-09 9:39 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: Alberto Bertogli @ 2004-07-09 3:06 UTC (permalink / raw)
To: linux-kernel
Hi!
I wanted to know if there was a common, portable way of syncing a given
file's metadata.
In particular, I just want to create a file with open() and be sure that
after some operation the file has been properly created and even if there
is a crash, it can be accessed (modulo internal disk caches and all that
stuff).
I know that fsync() provides only data guarantees, and even the manpage
says clearly that in order to sync metadata an "explicit fsync on the file
descriptor of the directory is also needed".
However, the O_DIRECTORY flag is Linux only, making this mechanism not
portable.
Is there a way of doing this in a portable way?
I know that under some filesystems with some mount options this can be
assured just by open() returning, or fsync() on the file, but I was
looking for a more general way to do it.
Also, according to SUSv3, "If _POSIX_SYNCHRONIZED_IO is defined, the
fsync() function shall force all currently queued I/O operations
associated with the file". This seems to imply that metadata gets synced
too, or at least I think "I/O operations associated with the file" can be
interpreted to include metadata.
However, based on a quick grep at the glibc code, it seems that the flag
doesn't make a difference in this case.
Is this really used or enforced?
Thanks a lot,
Alberto
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Syncing a file's metadata in a portable way
2004-07-09 3:06 Syncing a file's metadata in a portable way Alberto Bertogli
@ 2004-07-09 9:39 ` Andrew Morton
2004-07-10 11:54 ` bert hubert
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-07-09 9:39 UTC (permalink / raw)
To: Alberto Bertogli; +Cc: linux-kernel
Alberto Bertogli <albertogli@telpin.com.ar> wrote:
>
>
> Hi!
>
> I wanted to know if there was a common, portable way of syncing a given
> file's metadata.
>
> In particular, I just want to create a file with open() and be sure that
> after some operation the file has been properly created and even if there
> is a crash, it can be accessed (modulo internal disk caches and all that
> stuff).
>
> I know that fsync() provides only data guarantees, and even the manpage
> says clearly that in order to sync metadata an "explicit fsync on the file
> descriptor of the directory is also needed".
It depends on the Linux filesystem. On ext3, for example, fsync() will
sync all of the filesytem's metadata (and data in journalled and ordered
data mode).
But on ext2 you'll need to fsync the directory. However, that only needs
to be done once, after the create.
> However, the O_DIRECTORY flag is Linux only, making this mechanism not
> portable.
>
> Is there a way of doing this in a portable way?
Doing a create, followed by a system-wide sync(), followed by
write/fsync/write/fsync/... will do what you want on all Linux
filesystems. That might be a bit of a performance problem if you're
creating a lot of files, although probably not.
This method should portable to other OS'es if they implement sync() sanely.
But note that they may not: according to the spec, sync() doesn't _have_ to
wait for all the queued I/O to complete prior to returning. It does on
Linux. Some additional sync()s may be needed on other OS'es.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Syncing a file's metadata in a portable way
2004-07-09 9:39 ` Andrew Morton
@ 2004-07-10 11:54 ` bert hubert
2004-07-10 20:14 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: bert hubert @ 2004-07-10 11:54 UTC (permalink / raw)
To: Andrew Morton; +Cc: Alberto Bertogli, linux-kernel
On Fri, Jul 09, 2004 at 02:39:48AM -0700, Andrew Morton wrote:
> It depends on the Linux filesystem. On ext3, for example, fsync() will
> sync all of the filesytem's metadata (and data in journalled and ordered
> data mode).
I've noticed that on ext3, SQLite transactions are nearly useless, with the
smallest transactions causing 5 megabyte/s writout activity based on
relatively small writes. kjournald bore a large part of that according to
laptop_mode's block dump.
Do we actually need to flush the journal on fsync? I'm no fs theorist but I
wonder if having data in the journal isn't good enough - in case of failure,
the data will be there on recovery?
--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Syncing a file's metadata in a portable way
2004-07-10 11:54 ` bert hubert
@ 2004-07-10 20:14 ` Andrew Morton
2004-07-11 10:27 ` bert hubert
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-07-10 20:14 UTC (permalink / raw)
To: bert hubert; +Cc: albertogli, linux-kernel
bert hubert <ahu@ds9a.nl> wrote:
>
> On Fri, Jul 09, 2004 at 02:39:48AM -0700, Andrew Morton wrote:
>
> > It depends on the Linux filesystem. On ext3, for example, fsync() will
> > sync all of the filesytem's metadata (and data in journalled and ordered
> > data mode).
>
> I've noticed that on ext3, SQLite transactions are nearly useless, with the
> smallest transactions causing 5 megabyte/s writout activity based on
> relatively small writes. kjournald bore a large part of that according to
> laptop_mode's block dump.
If only the one file has been written to, an fsync on ext3 shouldn't
produce any more writeout than an fsync on ext2.
If there are other files on the same fs which have been written to then
they will be accidentally fsynced too, unless you're using data=writeback.
Either that, or SQLite is broken.
> Do we actually need to flush the journal on fsync? I'm no fs theorist but I
> wonder if having data in the journal isn't good enough - in case of failure,
> the data will be there on recovery?
fsync in ordered data mode will sync file data to the main fs and will sync
metadata tothe journal. It will not sync previously-journalled metadata
back to the main fs, because that's not required for a succesful recovery.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Syncing a file's metadata in a portable way
2004-07-10 20:14 ` Andrew Morton
@ 2004-07-11 10:27 ` bert hubert
2004-07-11 10:35 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: bert hubert @ 2004-07-11 10:27 UTC (permalink / raw)
To: Andrew Morton; +Cc: albertogli, linux-kernel
On Sat, Jul 10, 2004 at 01:14:59PM -0700, Andrew Morton wrote:
> If only the one file has been written to, an fsync on ext3 shouldn't
> produce any more writeout than an fsync on ext2.
(...)
> Either that, or SQLite is broken.
I'll show strace and vmstat tomorrow - I found very little writes, no mmap,
some fsync and massive writeouts. On ext2, performance was two orders of
magnitude better.
Bert
--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Syncing a file's metadata in a portable way
2004-07-11 10:27 ` bert hubert
@ 2004-07-11 10:35 ` Andrew Morton
2004-07-11 14:19 ` Alberto Bertogli
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-07-11 10:35 UTC (permalink / raw)
To: bert hubert; +Cc: albertogli, linux-kernel
bert hubert <ahu@ds9a.nl> wrote:
>
> On Sat, Jul 10, 2004 at 01:14:59PM -0700, Andrew Morton wrote:
>
> > If only the one file has been written to, an fsync on ext3 shouldn't
> > produce any more writeout than an fsync on ext2.
> (...)
> > Either that, or SQLite is broken.
>
> I'll show strace and vmstat tomorrow - I found very little writes, no mmap,
> some fsync and massive writeouts. On ext2, performance was two orders of
> magnitude better.
>
One scenario which could cause this is if the application is writing a
large amount of data to a file and is repeatedly *overwriting* that data.
And the application is repeatedly adding new blocks to, and fsyncing a
separate file.
strace might tell us that, if the traces are skilfully captured and studied.
You should try data=writeback. Given that the app is using fsync() for its
own data integrity purposes anyway, you don't need data=ordered.
It's strange though. databases often preallocate the file space, so a
regular write won't add new blocks to the file and won't allocate any new
metadata. In this situation, an fsync() will only force a commit once per
second, when the inode mtime changes.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Syncing a file's metadata in a portable way
2004-07-11 10:35 ` Andrew Morton
@ 2004-07-11 14:19 ` Alberto Bertogli
0 siblings, 0 replies; 7+ messages in thread
From: Alberto Bertogli @ 2004-07-11 14:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: bert hubert, linux-kernel
On Sun, Jul 11, 2004 at 03:35:27AM -0700, Andrew Morton wrote:
> bert hubert <ahu@ds9a.nl> wrote:
> >
> > On Sat, Jul 10, 2004 at 01:14:59PM -0700, Andrew Morton wrote:
> >
> > > If only the one file has been written to, an fsync on ext3 shouldn't
> > > produce any more writeout than an fsync on ext2.
> > (...)
> > > Either that, or SQLite is broken.
> >
> > I'll show strace and vmstat tomorrow - I found very little writes, no mmap,
> > some fsync and massive writeouts. On ext2, performance was two orders of
> > magnitude better.
> >
>
> One scenario which could cause this is if the application is writing a
> large amount of data to a file and is repeatedly *overwriting* that data.
> And the application is repeatedly adding new blocks to, and fsyncing a
> separate file.
I don't know about SQLite, but I've written a small transactional I/O
library and it seems to trigger this behaviour too.
I test with fsx opening files O_SYNC against fsx using the library with a
mode called "lingering transactions" that write the data synchronously
only once when the trasaction is commited (and fsync()s at the end, which
doesn't seem to make a significant difference).
In this mode the library creates a file for each transaction, write to it
using pwrite and then fsync both the file and the parent directory. Then
it uses pwrite to write to the real file, without syncing it.
I'm using an USB flash so disk seeks are not so costly. Here are the
results, running "fsx -R -W -p 1024 -N 1000 testfile" as root, on the
flash. For more operations (-N) the relation between the tests is pretty
much the same.
Tests are:
* sync: fsx opening everything O_SYNC (uses write())
* linger: fsx using the library with the method described avobe (uses
pwrite and fsync)
Time is measured with "time" (real), and the time spent in write and fsync
with ltrace -S -c (in seconds), taken in different runs so ltrace overhead
doesn't show up in time. The other functions and system calls don't make a
significant difference.
I tested ext2, ext3 with data=ordered and data=writeback, without any
mount options.
test fs total time write fsync ltrace total
sync ext2 0m22.956s 69.007234 --- 153.888504
linger ext2 0m27.358s --- 81.107975 191.014929
sync ext3-o 0m23.709s 69.143989 --- 162.130448
linger ext3-o 0m37.234s --- 109.51823 243.963197
sync ext3-w 0m22.622s 71.071572 --- 160.095286
linger ext3-w 0m26.429s --- 76.482683 188.377637
So ext3 in writeback mode has almost the same numbers as ext2, but using
ordered mode is much more slower in the library case.
Thanks,
Alberto
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-07-11 14:16 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-09 3:06 Syncing a file's metadata in a portable way Alberto Bertogli
2004-07-09 9:39 ` Andrew Morton
2004-07-10 11:54 ` bert hubert
2004-07-10 20:14 ` Andrew Morton
2004-07-11 10:27 ` bert hubert
2004-07-11 10:35 ` Andrew Morton
2004-07-11 14:19 ` Alberto Bertogli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox