Syncing a file's metadata in a portable way

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Syncing a file's metadata in a portable way
@ 2004-07-09  3:06 Alberto Bertogli
  2004-07-09  9:39 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Alberto Bertogli @ 2004-07-09  3:06 UTC (permalink / raw)
  To: linux-kernel

Hi!

I wanted to know if there was a common, portable way of syncing a given
file's metadata.

In particular, I just want to create a file with open() and be sure that
after some operation the file has been properly created and even if there
is a crash, it can be accessed (modulo internal disk caches and all that
stuff).

I know that fsync() provides only data guarantees, and even the manpage
says clearly that in order to sync metadata an "explicit fsync on the file
descriptor of the directory is also needed".

However, the O_DIRECTORY flag is Linux only, making this mechanism not
portable.

Is there a way of doing this in a portable way?

I know that under some filesystems with some mount options this can be
assured just by open() returning, or fsync() on the file, but I was
looking for a more general way to do it.

Also, according to SUSv3, "If _POSIX_SYNCHRONIZED_IO is defined, the
fsync() function shall force all currently queued I/O operations
associated with the file". This seems to imply that metadata gets synced
too, or at least I think "I/O operations associated with the file" can be
interpreted to include metadata.

However, based on a quick grep at the glibc code, it seems that the flag
doesn't make a difference in this case.

Is this really used or enforced?

Thanks a lot,
		Alberto

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Syncing a file's metadata in a portable way
  2004-07-09  3:06 Syncing a file's metadata in a portable way Alberto Bertogli
@ 2004-07-09  9:39 ` Andrew Morton
  2004-07-10 11:54   ` bert hubert
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-07-09  9:39 UTC (permalink / raw)
  To: Alberto Bertogli; +Cc: linux-kernel

Alberto Bertogli <albertogli@telpin.com.ar> wrote:
>
> 
> Hi!
> 
> I wanted to know if there was a common, portable way of syncing a given
> file's metadata.
> 
> In particular, I just want to create a file with open() and be sure that
> after some operation the file has been properly created and even if there
> is a crash, it can be accessed (modulo internal disk caches and all that
> stuff).
> 
> I know that fsync() provides only data guarantees, and even the manpage
> says clearly that in order to sync metadata an "explicit fsync on the file
> descriptor of the directory is also needed".

It depends on the Linux filesystem.  On ext3, for example, fsync() will
sync all of the filesytem's metadata (and data in journalled and ordered
data mode).

But on ext2 you'll need to fsync the directory.  However, that only needs
to be done once, after the create.

> However, the O_DIRECTORY flag is Linux only, making this mechanism not
> portable.
>
> Is there a way of doing this in a portable way?

Doing a create, followed by a system-wide sync(), followed by
write/fsync/write/fsync/...  will do what you want on all Linux
filesystems.  That might be a bit of a performance problem if you're
creating a lot of files, although probably not.

This method should portable to other OS'es if they implement sync() sanely.

But note that they may not: according to the spec, sync() doesn't _have_ to
wait for all the queued I/O to complete prior to returning.  It does on
Linux.   Some additional sync()s may be needed on other OS'es.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Syncing a file's metadata in a portable way
  2004-07-09  9:39 ` Andrew Morton
@ 2004-07-10 11:54   ` bert hubert
  2004-07-10 20:14     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: bert hubert @ 2004-07-10 11:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alberto Bertogli, linux-kernel

On Fri, Jul 09, 2004 at 02:39:48AM -0700, Andrew Morton wrote:

> It depends on the Linux filesystem.  On ext3, for example, fsync() will
> sync all of the filesytem's metadata (and data in journalled and ordered
> data mode).

I've noticed that on ext3, SQLite transactions are nearly useless, with the
smallest transactions causing 5 megabyte/s writout activity based on
relatively small writes. kjournald bore a large part of that according to
laptop_mode's block dump.

Do we actually need to flush the journal on fsync? I'm no fs theorist but I
wonder if having data in the journal isn't good enough - in case of failure,
the data will be there on recovery?

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Syncing a file's metadata in a portable way
  2004-07-10 11:54   ` bert hubert
@ 2004-07-10 20:14     ` Andrew Morton
  2004-07-11 10:27       ` bert hubert
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-07-10 20:14 UTC (permalink / raw)
  To: bert hubert; +Cc: albertogli, linux-kernel

bert hubert <ahu@ds9a.nl> wrote:
>
> On Fri, Jul 09, 2004 at 02:39:48AM -0700, Andrew Morton wrote:
> 
> > It depends on the Linux filesystem.  On ext3, for example, fsync() will
> > sync all of the filesytem's metadata (and data in journalled and ordered
> > data mode).
> 
> I've noticed that on ext3, SQLite transactions are nearly useless, with the
> smallest transactions causing 5 megabyte/s writout activity based on
> relatively small writes. kjournald bore a large part of that according to
> laptop_mode's block dump.

If only the one file has been written to, an fsync on ext3 shouldn't
produce any more writeout than an fsync on ext2.

If there are other files on the same fs which have been written to then
they will be accidentally fsynced too, unless you're using data=writeback.

Either that, or SQLite is broken.

> Do we actually need to flush the journal on fsync? I'm no fs theorist but I
> wonder if having data in the journal isn't good enough - in case of failure,
> the data will be there on recovery?

fsync in ordered data mode will sync file data to the main fs and will sync
metadata tothe journal.  It will not sync previously-journalled metadata
back to the main fs, because that's not required for a succesful recovery.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Syncing a file's metadata in a portable way
  2004-07-10 20:14     ` Andrew Morton
@ 2004-07-11 10:27       ` bert hubert
  2004-07-11 10:35         ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: bert hubert @ 2004-07-11 10:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: albertogli, linux-kernel

On Sat, Jul 10, 2004 at 01:14:59PM -0700, Andrew Morton wrote:

> If only the one file has been written to, an fsync on ext3 shouldn't
> produce any more writeout than an fsync on ext2.
(...)
> Either that, or SQLite is broken.

I'll show strace and vmstat tomorrow - I found very little writes, no mmap,
some fsync and massive writeouts. On ext2, performance was two orders of
magnitude better.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Syncing a file's metadata in a portable way
  2004-07-11 10:27       ` bert hubert
@ 2004-07-11 10:35         ` Andrew Morton
  2004-07-11 14:19           ` Alberto Bertogli
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-07-11 10:35 UTC (permalink / raw)
  To: bert hubert; +Cc: albertogli, linux-kernel

bert hubert <ahu@ds9a.nl> wrote:
>
> On Sat, Jul 10, 2004 at 01:14:59PM -0700, Andrew Morton wrote:
> 
> > If only the one file has been written to, an fsync on ext3 shouldn't
> > produce any more writeout than an fsync on ext2.
> (...)
> > Either that, or SQLite is broken.
> 
> I'll show strace and vmstat tomorrow - I found very little writes, no mmap,
> some fsync and massive writeouts. On ext2, performance was two orders of
> magnitude better.
> 

One scenario which could cause this is if the application is writing a
large amount of data to a file and is repeatedly *overwriting* that data. 
And the application is repeatedly adding new blocks to, and fsyncing a
separate file.

strace might tell us that, if the traces are skilfully captured and studied.

You should try data=writeback.  Given that the app is using fsync() for its
own data integrity purposes anyway, you don't need data=ordered.

It's strange though.  databases often preallocate the file space, so a
regular write won't add new blocks to the file and won't allocate any new
metadata.  In this situation, an fsync() will only force a commit once per
second, when the inode mtime changes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Syncing a file's metadata in a portable way
  2004-07-11 10:35         ` Andrew Morton
@ 2004-07-11 14:19           ` Alberto Bertogli
  0 siblings, 0 replies; 7+ messages in thread
From: Alberto Bertogli @ 2004-07-11 14:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: bert hubert, linux-kernel

On Sun, Jul 11, 2004 at 03:35:27AM -0700, Andrew Morton wrote:
> bert hubert <ahu@ds9a.nl> wrote:
> >
> > On Sat, Jul 10, 2004 at 01:14:59PM -0700, Andrew Morton wrote:
> > 
> > > If only the one file has been written to, an fsync on ext3 shouldn't
> > > produce any more writeout than an fsync on ext2.
> > (...)
> > > Either that, or SQLite is broken.
> > 
> > I'll show strace and vmstat tomorrow - I found very little writes, no mmap,
> > some fsync and massive writeouts. On ext2, performance was two orders of
> > magnitude better.
> > 
> 
> One scenario which could cause this is if the application is writing a
> large amount of data to a file and is repeatedly *overwriting* that data. 
> And the application is repeatedly adding new blocks to, and fsyncing a
> separate file.

I don't know about SQLite, but I've written a small transactional I/O
library and it seems to trigger this behaviour too.

I test with fsx opening files O_SYNC against fsx using the library with a
mode called "lingering transactions" that write the data synchronously
only once when the trasaction is commited (and fsync()s at the end, which
doesn't seem to make a significant difference).

In this mode the library creates a file for each transaction, write to it
using pwrite and then fsync both the file and the parent directory. Then
it uses pwrite to write to the real file, without syncing it.

I'm using an USB flash so disk seeks are not so costly. Here are the
results, running "fsx -R -W -p 1024 -N 1000 testfile" as root, on the
flash. For more operations (-N) the relation between the tests is pretty
much the same.

Tests are:
* sync: fsx opening everything O_SYNC (uses write())
* linger: fsx using the library with the method described avobe (uses
	pwrite and fsync)

Time is measured with "time" (real), and the time spent in write and fsync
with ltrace -S -c (in seconds), taken in different runs so ltrace overhead
doesn't show up in time. The other functions and system calls don't make a
significant difference.

I tested ext2, ext3 with data=ordered and data=writeback, without any
mount options.

test	fs      total time  write      fsync      ltrace total

sync	ext2    0m22.956s   69.007234  ---        153.888504
linger	ext2    0m27.358s   ---        81.107975  191.014929

sync	ext3-o  0m23.709s   69.143989  ---        162.130448
linger	ext3-o  0m37.234s   ---        109.51823  243.963197

sync 	ext3-w  0m22.622s   71.071572  ---        160.095286
linger	ext3-w  0m26.429s   ---        76.482683  188.377637

So ext3 in writeback mode has almost the same numbers as ext2, but using
ordered mode is much more slower in the library case.

Thanks,
		Alberto

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-07-11 14:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-09  3:06 Syncing a file's metadata in a portable way Alberto Bertogli
2004-07-09  9:39 ` Andrew Morton
2004-07-10 11:54   ` bert hubert
2004-07-10 20:14     ` Andrew Morton
2004-07-11 10:27       ` bert hubert
2004-07-11 10:35         ` Andrew Morton
2004-07-11 14:19           ` Alberto Bertogli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox