[rfc] fsync

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [rfc] fsync_range?
@ 2009-01-20 16:47 Nick Piggin
  2009-01-20 18:31 ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-20 16:47 UTC (permalink / raw)
  To: linux-fsdevel

Just wondering if we should add an fsync_range syscall like AIX and
some BSDs have? It's pretty simple for the pagecache since it
already implements the full sync with range syncs anyway. For
filesystems and user programs, I imagine it is a bit easier to
convert to fsync_range from fsync rather than use the sync_file_range
syscall.

Having a flags argument is nice, but AIX seems to use O_SYNC as a
flag, I wonder if we should follow?

Patch isn't complete...
---
 fs/sync.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 50 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c
+++ linux-2.6/fs/sync.c
@@ -76,10 +76,12 @@ int file_fsync(struct file *filp, struct
 }
 
 /**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * vfs_fsync_range - perform a fsync or fdatasync on part of a file
  * @file:		file to sync
  * @dentry:		dentry of @file
  * @data:		only perform a fdatasync operation
+ * @start:		first byte to be synced
+ * @end:		last byte to be synced
  *
  * Write back data and metadata for @file to disk.  If @datasync is
  * set only metadata needed to access modified file data is written.
@@ -88,7 +90,8 @@ int file_fsync(struct file *filp, struct
  * only @dentry is set.  This can only happen when the filesystem
  * implements the export_operations API.
  */
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int vfs_fsync_range(struct file *file, struct dentry *dentry, int datasync,
+			loff_t start, loff_t end)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
@@ -112,7 +115,7 @@ int vfs_fsync(struct file *file, struct
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
+	ret = filemap_fdatawrite_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
@@ -123,12 +126,32 @@ int vfs_fsync(struct file *file, struct
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
+	err = wait_on_page_writeback_range(mapping,
+			start >> PAGE_CACHE_SHIFT, end >> PAGE_CACHE_SHIFT);
 	if (!ret)
 		ret = err;
 out:
 	return ret;
 }
+EXPORT_SYMBOL(vfs_fsync_range);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file:		file to sync
+ * @dentry:		dentry of @file
+ * @data:		only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk.  If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set.  This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	return vfs_fsync_range(file, dentry, datasync, 0, LLONG_MAX);
+}
 EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
@@ -154,6 +177,29 @@ SYSCALL_DEFINE1(fdatasync, unsigned int,
 	return do_fsync(fd, 1);
 }
 
+SYSCALL_DEFINE(fsync_range)(int fd, int how, loff_t start, loff_t length)
+{
+	struct file *file;
+	loff_t end;
+	int ret = -EBADF;
+
+	if (how != O_DSYNC && how != O_SYNC)
+		return -EINVAL;
+
+	if (length == 0)
+		end = LLONG_MAX;
+	else
+		end = start + length - 1;
+
+	file = fget(fd);
+	if (file) {
+		ret = vfs_fsync_range(file, file->f_path.dentry, how == O_DSYNC,
+								start, end);
+		fput(file);
+	}
+	return ret;
+}
+
 /*
  * sys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
@ 2009-01-20 18:31 ` Jamie Lokier
  2009-01-20 21:25   ` Bryan Henderson
  2009-01-21  1:29   ` Nick Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Jamie Lokier @ 2009-01-20 18:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> Just wondering if we should add an fsync_range syscall like AIX and
> some BSDs have? It's pretty simple for the pagecache since it
> already implements the full sync with range syncs anyway. For
> filesystems and user programs, I imagine it is a bit easier to
> convert to fsync_range from fsync rather than use the sync_file_range
> syscall.
> 
> Having a flags argument is nice, but AIX seems to use O_SYNC as a
> flag, I wonder if we should follow?

I like the idea.  It's much easier to understand than sync_file_range,
whose man page doesn't really explain how to use it correctly.

But how is fsync_range different from the sync_file_range syscall with
all its flags set?

For database writes, you typically write a bunch of stuff in various
regions of a big file (or multiple files), then ideally fdatasync
some/all of the written ranges - with writes committed to disk in the
best order determined by the OS and I/O scheduler.

For this, taking a vector of multiple ranges would be nice.
Alternatively, issuing parallel fsync_range calls from multiple
threads would approximate the same thing - if (big if) they aren't
serialised by the kernel.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-20 18:31 ` Jamie Lokier
@ 2009-01-20 21:25   ` Bryan Henderson
  2009-01-20 22:42     ` Jamie Lokier
  2009-01-21  1:36     ` Nick Piggin
  2009-01-21  1:29   ` Nick Piggin
  1 sibling, 2 replies; 42+ messages in thread
From: Bryan Henderson @ 2009-01-20 21:25 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin

> For database writes, you typically write a bunch of stuff in various
> regions of a big file (or multiple files), then ideally fdatasync
> some/all of the written ranges - with writes committed to disk in the
> best order determined by the OS and I/O scheduler.
> 
> For this, taking a vector of multiple ranges would be nice.
> Alternatively, issuing parallel fsync_range calls from multiple
> threads would approximate the same thing - if (big if) they aren't
> serialised by the kernel.

That sounds like a job for fadvise().  A new FADV_WILLSYNC says you're 
planning to sync that data soon.  The kernel responds by scheduling the 
I/O immediately.  fsync_range() takes a single range and in this case is 
just a wait.  I think it would be easier for the user as well as more 
flexible for the kernel than a multi-range fsync_range() or multiple 
threads.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-20 21:25   ` Bryan Henderson
@ 2009-01-20 22:42     ` Jamie Lokier
  2009-01-21 19:43       ` Bryan Henderson
  2009-01-21  1:36     ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-20 22:42 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin

Bryan Henderson wrote:
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
> > 
> > For this, taking a vector of multiple ranges would be nice.
> > Alternatively, issuing parallel fsync_range calls from multiple
> > threads would approximate the same thing - if (big if) they aren't
> > serialised by the kernel.
> 
> That sounds like a job for fadvise().  A new FADV_WILLSYNC says you're 
> planning to sync that data soon.  The kernel responds by scheduling the 
> I/O immediately.  fsync_range() takes a single range and in this case is 
> just a wait.  I think it would be easier for the user as well as more 
> flexible for the kernel than a multi-range fsync_range() or multiple 
> threads.

FADV_WILLSYNC is already implemented: sync_file_range() with
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE.  That will block in
a few circumstances, but maybe that's inevitable.

If you called FADV_WILLSYNC on a few ranges to mean "soon", how do you
wait until those ranges are properly committed?  How do you ensure the
right low-level I/O barriers are sent for those ranges before you
start writing post-barrier data?

I think you're saying call FADV_WILLSYNC first on all the ranges, then
call fsync_range() on each range in turn to wait for the I/O to be
complete - although that will cause unnecessary I/O barriers, one per
fsync_range().

You can do something like that with sync_file_range() at the moment,
except no way to ask for the barrier.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-20 22:42     ` Jamie Lokier
@ 2009-01-21 19:43       ` Bryan Henderson
  2009-01-21 21:08         ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Bryan Henderson @ 2009-01-21 19:43 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin

> Bryan Henderson wrote:
> > > For this, taking a vector of multiple ranges would be nice.
> > > Alternatively, issuing parallel fsync_range calls from multiple
> > > threads would approximate the same thing - if (big if) they aren't
> > > serialised by the kernel.
> > 
> > That sounds like a job for fadvise().  A new FADV_WILLSYNC says you're 

> > planning to sync that data soon.  The kernel responds by scheduling 
the 
> > I/O immediately.  fsync_range() takes a single range and in this case 
is 
> > just a wait.  I think it would be easier for the user as well as more 
> > flexible for the kernel than a multi-range fsync_range() or multiple 
> > threads.
> 
> FADV_WILLSYNC is already implemented: sync_file_range() with
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE.  That will block in
> a few circumstances, but maybe that's inevitable.

There's actually a basic difference between a system call that says 
"initiate writeout" and one that says, "I plan to sync this soon," even if 
they have the same results in practice.  (And Nick's "I won't write any 
time soon" idea is yet another).  Though reasonable minds differ, the 
advice-to-kernel approach to managing file caches seems to be winning over 
instructions-to-kernel,
and I personally much prefer it.

> I think you're saying call FADV_WILLSYNC first on all the ranges, then
> call fsync_range() on each range in turn to wait for the I/O to be
> complete

Right.  The later calls tend to return immediately, of course.

>- although that will cause unnecessary I/O barriers, one per
>fsync_range().

What do I/O barriers have to do with it?  An I/O barrier says, "don't 
harden later writes before these have hardened," whereas fsync_range() 
says, "harden these writes now."  Does Linux these days send an I/O 
barrier to the block subsystem and/or device as part of fsync()?

Or are we talking about the command to the device to harden all earlier 
writes (now) against a device power loss?  Does fsync() do that?

Either way, I can see that multiple fsync_ranges's in a row would be a 
little worse than just one, but it's pretty bad problem anyway, so I don't 
know if you could tell the difference.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 19:43       ` Bryan Henderson
@ 2009-01-21 21:08         ` Jamie Lokier
  2009-01-21 22:44           ` Bryan Henderson
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 21:08 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin

Bryan Henderson wrote:
> >- although that will cause unnecessary I/O barriers, one per
> >fsync_range().
> 
> What do I/O barriers have to do with it?  An I/O barrier says, "don't 
> harden later writes before these have hardened," whereas fsync_range() 
> says, "harden these writes now."  Does Linux these days send an I/O 
> barrier to the block subsystem and/or device as part of fsync()?

For better or worse, I/O barriers and I/O flushes are the same thing
in the Linux block layer.  I've argued for treating them distinctly,
because there are different I/O scheduling opportunities around each
of them, but there wasn't much interest.

> Or are we talking about the command to the device to harden all earlier 
> writes (now) against a device power loss?  Does fsync() do that?

Ultimately that's what we're talking about, yes.  Imho fsync() should
do that, because a userspace database/filesystem should have access to
the same integrity guarantees as an in-kernel filesystem.  Linux
fsync() doesn't always send the command - it's a bit unpredictable
last time I looked.

There are other opinions.  MacOSX fsync() doesn't - because it has an
fcntl() which is a stronger version of fsync() documented for that
case.  They preferred reduced integrity of fsync() to keep benchmarks
on par with other OSes which don't send the command.

Interestingly, Windows _does_ have the option to send the command to
the device, controlled by userspace.  If you set the Windows
equivalents to O_DSYNC and O_DIRECT at the same time, then calls to
the Windows equivalent to fdatasync() cause an I/O barrier command to
be sent to the disk if necessary.  The Windows documentation even
explain the different between OS caching and device caching and when
each one occurs, too.  Wow - it looks like Windows (later versions)
has the edge in doing the right thing here for quite some time...

   http://www.microsoft.com/sql/alwayson/storage-requirements.mspx
   http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx

> Either way, I can see that multiple fsync_ranges's in a row would be a 
> little worse than just one, but it's pretty bad problem anyway, so I don't 
> know if you could tell the difference.

A little?  It's the difference between letting the disk schedule 100
scattered writes itself, and forcing the disk to write them in the
order you sent them from userspace, aside from the doubling the rate
of device commands...

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 21:08         ` Jamie Lokier
@ 2009-01-21 22:44           ` Bryan Henderson
  2009-01-21 23:31             ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Bryan Henderson @ 2009-01-21 22:44 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin

Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 01:08:55 PM:

> For better or worse, I/O barriers and I/O flushes are the same thing
> in the Linux block layer.  I've argued for treating them distinctly,
> because there are different I/O scheduling opportunities around each
> of them, but there wasn't much interest.

It's hard to see how they could be combined -- flushing (waiting for the 
queue of writes to drain) is what you do -- at great performance cost -- 
when you don't have barriers available.  The point of a barrier is to 
avoid having the queue run dry.

But I don't suppose it matters for this discussion.

> > Or are we talking about the command to the device to harden all 
earlier 
> > writes (now) against a device power loss?  Does fsync() do that?
> 
> Ultimately that's what we're talking about, yes.  Imho fsync() should
> do that, because a userspace database/filesystem should have access to
> the same integrity guarantees as an in-kernel filesystem.  Linux
> fsync() doesn't always send the command - it's a bit unpredictable
> last time I looked.

Yes, it's the old performance vs integrity issue.  Drives long ago came 
out with features to defeat operating system integrity efforts, in 
exchange for performance, by doing write caching by default, ignoring 
explicit demands to write through, etc.  Obviously, some people want that, 
but I _have_ seen Linux developers escalate the battle for control of the 
disk drive.  I can just never remember where it stands at any moment.

But it doesn't matter in this discussion because my point is that if you 
accept the performance hit for integrity (I suppose we're saying that in 
current Linux, in some configurations, if a process does frequent fsyncs 
of a file, every process writing to every drive that file touches will 
slow to write-through speed), it will be about the same with 100 
fsync_ranges in quick succession as for 1.

> A little?  It's the difference between letting the disk schedule 100
> scattered writes itself, and forcing the disk to write them in the
> order you sent them from userspace, aside from the doubling the rate
> of device commands...

Again, in the scenario I'm talking about, all the writes were in the Linux 
I/O queue before the first fsync_range() (thanks to fadvises) , so this 
doesn't happen.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 22:44           ` Bryan Henderson
@ 2009-01-21 23:31             ` Jamie Lokier
  0 siblings, 0 replies; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 23:31 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin

Bryan Henderson wrote:
> Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 01:08:55 PM:
> 
> > For better or worse, I/O barriers and I/O flushes are the same thing
> > in the Linux block layer.  I've argued for treating them distinctly,
> > because there are different I/O scheduling opportunities around each
> > of them, but there wasn't much interest.
> 
> It's hard to see how they could be combined -- flushing (waiting for the 
> queue of writes to drain) is what you do -- at great performance cost -- 
> when you don't have barriers available.  The point of a barrier is to 
> avoid having the queue run dry.

Linux has a combined flush+barrier primitve in the block layer.
Actually it's not a primitive op, it's a flag on a write meaning "do
flush+barrier before and after this write", but that dates from fs
transaction commits, and isn't appropriate for fsync.

> Yes, it's the old performance vs integrity issue.  Drives long ago came 
> out with features to defeat operating system integrity efforts, in 
> exchange for performance, by doing write caching by default, ignoring 
> explicit demands to write through, etc.  Obviously, some people want that, 
> but I _have_ seen Linux developers escalate the battle for control of the 
> disk drive.  I can just never remember where it stands at any moment.

Last time I read about it, a few drives did it for a little while,
then they stopped doing it and such drives are rare, if they exist at
all, now.

Forget about "Linux battling for control".  Windows does this barrier
stuff too, as does every other major OS, and Microsoft documents it in
some depth.

Upmarket systems use battery-backed disk controllers of course, to get
speed and integrity together.  Or increasingly SSDs.

Certain downmarket (= cheapo) systems benefit noticably from the right
barriers.  Pull the power on a cheap Linux-based media player with a
hard disk inside, and if it's using ext3 with barriers off, expect
filesystem corruption from time to time.  I and others working on such
things have seen it.  With barriers on, never see any corruption.
This is with the cheapest small consumer disks you can find.

> But it doesn't matter in this discussion because my point is that if you 
> accept the performance hit for integrity (I suppose we're saying that in 
> current Linux, in some configurations, if a process does frequent fsyncs 
> of a file, every process writing to every drive that file touches will 
> slow to write-through speed), it will be about the same with 100 
> fsync_ranges in quick succession as for 1.

Write-through speed depends _heavily_ on head seeking with a
rotational disk.

100 fsync_ranges _for one commited app-level transaction_ is different
from a succession of 100 transactions to commit.  If an app requires
one transaction which happens to modify 100 different places in a
database file, you want those written in the best head seeking order.

> > A little?  It's the difference between letting the disk schedule 100
> > scattered writes itself, and forcing the disk to write them in the
> > order you sent them from userspace, aside from the doubling the rate
> > of device commands...
> 
> Again, in the scenario I'm talking about, all the writes were in the Linux 
> I/O queue before the first fsync_range() (thanks to fadvises) , so this 
> doesn't happen.

Maybe you're right about this. :-)
(Persuaded).

fadvise() which blocks is rather overloading the "hint" meaning of fadvise().
It could work though.

It smells more like sync_file_range(), where userspace is responsible
for deciding what order to submit the ranges in (because of the
blocking), than fsyncv(), where the kernel uses any heuristic it likes
including knowledge of filesystem block layout (higher level than
elevator, but lower level than plain file offset).

For userspace, that's not much different from what databases using
O_DIRECT have to do _already_.  They_ have to decide what order to
submit I/O ranges in, one range at a time, and with AIO they get about
the same amount of block elevator flexibility.  Which is exactly one
full block queue's worth of sorting at the head of a streaming pump of
file offsets.

So maybe the fadvise() method is ok...

It does mean two system calls per file range, though.  One fadvise()
per range to submit I/O, one fsync_range() to wait for all of it
afterwards.  That smells like sync_file_range() too.

Back to fsyncv() again?  Which does have the benefit of being easy to
understand too :-)

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-20 21:25   ` Bryan Henderson
  2009-01-20 22:42     ` Jamie Lokier
@ 2009-01-21  1:36     ` Nick Piggin
  2009-01-21 19:58       ` Bryan Henderson
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  1:36 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Jamie Lokier, linux-fsdevel

On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote:
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
> > 
> > For this, taking a vector of multiple ranges would be nice.
> > Alternatively, issuing parallel fsync_range calls from multiple
> > threads would approximate the same thing - if (big if) they aren't
> > serialised by the kernel.
> 
> That sounds like a job for fadvise().  A new FADV_WILLSYNC says you're 
> planning to sync that data soon.  The kernel responds by scheduling the 
> I/O immediately.  fsync_range() takes a single range and in this case is 
> just a wait.  I think it would be easier for the user as well as more 
> flexible for the kernel than a multi-range fsync_range() or multiple 
> threads.

A problem is that the kernel will not always be able to schedule the
IO without blocking (various mutexes or block device queues full etc).
And it takes multiple system calls.

If this is an important functionality, I think we could do an fsyncv.

Having an FADV_ for asynchronous writeout wouldn't hurt either.
POSIX_FADV_DONTNEED basically does that except it also drops the
cache afterwards, wheras FADV_WONTDIRTY or something doesn't
necessarily want that. It would be easy to add one to DTRT.
(and I notice FADV_DONTNEED is not taking notice of the given
range when starting writeout)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  1:36     ` Nick Piggin
@ 2009-01-21 19:58       ` Bryan Henderson
  2009-01-21 20:53         ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Bryan Henderson @ 2009-01-21 19:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jamie Lokier, linux-fsdevel

Nick Piggin <npiggin@suse.de> wrote on 01/20/2009 05:36:06 PM:

> On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote:
> > > For this, taking a vector of multiple ranges would be nice.
> > > Alternatively, issuing parallel fsync_range calls from multiple
> > > threads would approximate the same thing - if (big if) they aren't
> > > serialised by the kernel.
> > 
> > That sounds like a job for fadvise().  A new FADV_WILLSYNC says you're 

> > planning to sync that data soon.  The kernel responds by scheduling 
the 
> > I/O immediately.  fsync_range() takes a single range and in this case 
is 
> > just a wait.  I think it would be easier for the user as well as more 
> > flexible for the kernel than a multi-range fsync_range() or multiple 
> > threads.
> 
> A problem is that the kernel will not always be able to schedule the
> IO without blocking (various mutexes or block device queues full etc).

I don't really see the problem with that.  We're talking about a program 
that is doing device-synchronous I/O.  Blocking is a way of life.  Plus, 
the beauty of advice is that if it's hard occasionally, the kernel can 
just ignore it.

> And it takes multiple system calls.

When you're reading, the system call overhead is significant in an 
operation that just copies from memory to memory, so we have readv() and 
accept the added complexity and harder to use interface.  When you're 
syncing a file, you're running so slowly that I doubt the overhead of 
multiple system calls is noticeable.  There are a lot of other multiple 
system call sequences I would try to replace with one complex one before 
worrying about multi-range file sync.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 19:58       ` Bryan Henderson
@ 2009-01-21 20:53         ` Jamie Lokier
  2009-01-21 22:14           ` Bryan Henderson
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 20:53 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Nick Piggin, linux-fsdevel

Bryan Henderson wrote:
> Nick Piggin <npiggin@suse.de> wrote on 01/20/2009 05:36:06 PM:
> 
> > On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote:
> > > > For this, taking a vector of multiple ranges would be nice.
> > > > Alternatively, issuing parallel fsync_range calls from multiple
> > > > threads would approximate the same thing - if (big if) they aren't
> > > > serialised by the kernel.
> > > 
> > > That sounds like a job for fadvise().  A new FADV_WILLSYNC says you're 
> 
> > > planning to sync that data soon.  The kernel responds by scheduling 
> the 
> > > I/O immediately.  fsync_range() takes a single range and in this case 
> is 
> > > just a wait.  I think it would be easier for the user as well as more 
> > > flexible for the kernel than a multi-range fsync_range() or multiple 
> > > threads.
> > 
> > A problem is that the kernel will not always be able to schedule the
> > IO without blocking (various mutexes or block device queues full etc).
> 
> I don't really see the problem with that.  We're talking about a program 
> that is doing device-synchronous I/O.  Blocking is a way of life.  Plus, 
> the beauty of advice is that if it's hard occasionally, the kernel can 
> just ignore it.

If you have 100 file regions, each one a few pages in size, and you do
100 fsync_range() calls, that results in potentally far from optimal
I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache
flushes (I/O barriers) instead of just one at the end.  100 head seeks
and 100 cache flush ops can be very expensive.

This is the point of taking a vector of ranges to flush - or some
other way to "plug" the I/O and only wait for it after submitting it
all.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 20:53         ` Jamie Lokier
@ 2009-01-21 22:14           ` Bryan Henderson
  2009-01-21 22:30             ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Bryan Henderson @ 2009-01-21 22:14 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin

Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 12:53:56 PM:

> Bryan Henderson wrote:
> > Nick Piggin <npiggin@suse.de> wrote on 01/20/2009 05:36:06 PM:
> > 
> > > On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote:
> > > > > For this, taking a vector of multiple ranges would be nice.
> > > > > Alternatively, issuing parallel fsync_range calls from multiple
> > > > > threads would approximate the same thing - if (big if) they 
aren't
> > > > > serialised by the kernel.
> > > > 
> > > > That sounds like a job for fadvise().  A new FADV_WILLSYNC says 
you're 
> > 
> > > > planning to sync that data soon.  The kernel responds by 
scheduling 
> > the 
> > > > I/O immediately.  fsync_range() takes a single range and in this 
case 
> > is 
> > > > just a wait.  I think it would be easier for the user as well as 
more 
> > > > flexible for the kernel than a multi-range fsync_range() or 
multiple 
> > > > threads.
> > > 
> > > A problem is that the kernel will not always be able to schedule the
> > > IO without blocking (various mutexes or block device queues full 
etc).
> > 
> > I don't really see the problem with that.  We're talking about a 
program 
> > that is doing device-synchronous I/O.  Blocking is a way of life. 
Plus, 
> > the beauty of advice is that if it's hard occasionally, the kernel can 

> > just ignore it.
> 
> If you have 100 file regions, each one a few pages in size, and you do
> 100 fsync_range() calls, that results in potentally far from optimal
> I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache
> flushes (I/O barriers) instead of just one at the end.  100 head seeks
> and 100 cache flush ops can be very expensive.

You got lost in the thread here.  I proposed a fadvise() that would result 
in I/O scheduling; Nick said the fadvise() might have to block; I said so 
what?  Now you seem to be talking about 100 fsync_range() calls, each of 
which starts and then waits for a sync of one range.

Getting back to I/O scheduled as a result of an fadvise(): if it blocks 
because the block queue is full, then it's going to block with a 
multi-range fsync_range() as well.  The other blocks are kind of vague, 
but I assume they're rare and about the same as for multi-range 
fsync_range().

> This is the point of taking a vector of ranges to flush - or some
> other way to "plug" the I/O and only wait for it after submitting it
> all.

My fadvise-based proposal waits for I/O only after it has all been 
submitted.

But plugging (delaying the start of I/O even though it is ready to go and 
the device is idle) is rarely a good idea.  It can help for short bursts 
to a mostly idle device (typically saves half a seek per burst), but a 
busy device provides a natural plug.  It thus can't help throughput, but 
can improve the response time of a burst.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 22:14           ` Bryan Henderson
@ 2009-01-21 22:30             ` Jamie Lokier
  2009-01-22  1:52               ` Bryan Henderson
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 22:30 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin

Bryan Henderson wrote:
> > If you have 100 file regions, each one a few pages in size, and you do
> > 100 fsync_range() calls, that results in potentally far from optimal
> > I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache
> > flushes (I/O barriers) instead of just one at the end.  100 head seeks
> > and 100 cache flush ops can be very expensive.
> 
> You got lost in the thread here.  I proposed a fadvise() that would result 
> in I/O scheduling; Nick said the fadvise() might have to block; I said so 
> what?  Now you seem to be talking about 100 fsync_range() calls, each of 
> which starts and then waits for a sync of one range.
>
> Getting back to I/O scheduled as a result of an fadvise(): if it blocks 
> because the block queue is full, then it's going to block with a 
> multi-range fsync_range() as well.

No, why would it block?  The block queue has room for (say) 100 small
file ranges.  If you submit 1000 ranges, sure the first 900 may block,
then you've got 100 left in the queue.

Then you call fsync_range() 1000 times, the first 900 are NOPs as you
say because the data has been written.  The remaining 100 (size of the
block queue) are forced to write serially.  They're even written to
the disk platter in order.

> My fadvise-based proposal waits for I/O only after it has all been 
> submitted.

Are you saying one call to fsync_range() should wait for all the
writes which have been queued by the fadvice to different ranges?

> But plugging (delaying the start of I/O even though it is ready to go and 
> the device is idle) is rarely a good idea.  It can help for short bursts 
> to a mostly idle device (typically saves half a seek per burst), but a 
> busy device provides a natural plug.  It thus can't help throughput, but 
> can improve the response time of a burst.

I agree, plugging doesn't make a big difference.

However, letting the disk or elevator reorder the writes it has room
for does sometimes make a big difference.  That's the point.  We're
not talking about forcibly _delaying_ I/O, we're talking about giving
the block elevator, and disk's own elevator, freedom to do their job
by not forcibly _flushing_ and _waiting_ between each individual
request for the length of the queue.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 22:30             ` Jamie Lokier
@ 2009-01-22  1:52               ` Bryan Henderson
  2009-01-22  3:41                 ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Bryan Henderson @ 2009-01-22  1:52 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin

Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 02:30:03 PM:

> > Getting back to I/O scheduled as a result of an fadvise(): if it 
blocks 
> > because the block queue is full, then it's going to block with a 
> > multi-range fsync_range() as well.
> 
> No, why would it block?  The block queue has room for (say) 100 small
> file ranges.  If you submit 1000 ranges, sure the first 900 may block,
> then you've got 100 left in the queue.

Yes, those are the blocks Nick mentioned.  They're the same as with 
multi-range fsync_range(),
in which the one system call submits 1000 ranges.

> Then you call fsync_range() 1000 times, the first 900 are NOPs as you
> say because the data has been written.  The remaining 100 (size of the
> block queue) are forced to write serially.  They're even written to
> the disk platter in order.

I don't see why they would go serially or in any particular order. They're 
in the Linux queue in sorted, coalesced form and go down to the disk in 
batches for the drive to do its own coalescing and ordering.  Same as with 
multi-range fsync_range().  The Linux I/O scheduler isn't going to wait 
for the forthcoming fsync_range() to start any I/O that's in its queue.

>Linux has a combined flush+barrier primitve in the block layer.
>Actually it's not a primitive op, it's a flag on a write meaning "do
>flush+barrier before and after this write",

I think you said that wrong, because a barrier isn't something you do. The 
flag says, "put a barrier before and after this write," and I think you're 
saying it also implies that as the barrier passes, Linux does a device 
flush (e.g. SCSI Synchronize Cache) command.  That would make sense as a 
poor man's way of propagating the barrier into the device.  SCSI devices 
have barriers too, but they would be harder for Linux to use than a 
Synchronize Cache, so maybe Linux doesn't yet.

I can also see that it makes sense for fsync() to use this combination.  I 
was confused before because both the device and Linux block layer have 
barriers and flushes and I didn't know which ones we were talking about.

>> Yes, it's the old performance vs integrity issue.  Drives long ago came 

>> out with features to defeat operating system integrity efforts, in 
>> exchange for performance, by doing write caching by default, ignoring 
>> explicit demands to write through, etc.  Obviously, some people want 
that, 
>> but I _have_ seen Linux developers escalate the battle for control of 
the 
>> disk drive.  I can just never remember where it stands at any moment.

> ...

>Forget about "Linux battling for control".  Windows does this barrier
>stuff too, as does every other major OS, and Microsoft documents it in
>some depth.

Not sure what you want me to forget about; you seem to be confirming that 
Linux, as well as all other OSes are engaged in this battle (with disk 
designers), and it seems like a natural state of engineering practice to 
me.

>fadvise() which blocks is rather overloading the "hint" meaning of 
fadvise().

Yeah, I'm not totally comfortable with that either.  I've been pretty much 
assuming that all the ranges from this database transaction generally fit 
in the I/O queue.

I wonder what existing fadvise(FADV_DONTNEED) does, since Linux has the 
same "schedule the I/O right now" response to that.  Just ignore the hint 
after the queue is full?

>It does mean two system calls per file range, though.  One fadvise()
>per range to submit I/O, one fsync_range() to wait for all of it
>afterwards.  That smells like sync_file_range() too.
>
>Back to fsyncv() again?

I said in another subthread that I don't think system call overhead is at 
all noticeable in a program that is doing device-synchronous I/O.  Not 
enough to justify a fsyncv() like we justify readv().

Hey, here's a small example of how the flexibility of the single range 
fadvise plus single range fsync_range beats a multi-range 
fsyncv/fsync_range:  Early in this thread, we noted the value of feeding 
multiple ranges of the file to the block layer at once for syncing.  Then 
we noticed that it would also be useful to feed multiple ranges of 
multiple files, requiring a different interface.  With the two system 
calls per range, that latter requirement was already met without thinking 
about it.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-22  1:52               ` Bryan Henderson
@ 2009-01-22  3:41                 ` Jamie Lokier
  0 siblings, 0 replies; 42+ messages in thread
From: Jamie Lokier @ 2009-01-22  3:41 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin

Bryan Henderson wrote:
> > No, why would it block?  The block queue has room for (say) 100 small
> > file ranges.  If you submit 1000 ranges, sure the first 900 may block,
> > then you've got 100 left in the queue.
> 
> Yes, those are the blocks Nick mentioned.  They're the same as with 
> multi-range fsync_range(),
> in which the one system call submits 1000 ranges.

Yes, except that fsync_range() theoretically has flexibility to order
them prior to the block queue with filesystem internal knowledge.  I
doubt if that would ever be implemented, but you never know.

> > Then you call fsync_range() 1000 times, the first 900 are NOPs as you
> > say because the data has been written.  The remaining 100 (size of the
> > block queue) are forced to write serially.  They're even written to
> > the disk platter in order.
> 
> I don't see why they would go serially or in any particular order.

You're right, please ignore my brain fart.
> >Linux has a combined flush+barrier primitve in the block layer.
> >Actually it's not a primitive op, it's a flag on a write meaning "do
> >flush+barrier before and after this write",
> 
> I think you said that wrong, because a barrier isn't something you do. The 
> flag says, "put a barrier before and after this write," and I think you're 
> saying it also implies that as the barrier passes, Linux does a device 
> flush (e.g. SCSI Synchronize Cache) command.  That would make sense as a 
> poor man's way of propagating the barrier into the device.  SCSI devices 
> have barriers too, but they would be harder for Linux to use than a 
> Synchronize Cache, so maybe Linux doesn't yet.

That's all correct.  Linux does a device flush on PATA if the device
write cache is enabled; I don't know if it does one on SCSI.  Two
flushes are done, before and after the flagged write I/O.  There's a
"softbarrier" aspect too: other writes cannot be reordered around
these I/Os, and on devices which accept overlapping commands, the
device queue is drained around the softbarrier.

On PATA that's really all you can do.  On SATA with NCQ, and on SCSI,
if the device accepts1 enough commands in flight at once, it's cheaper
to disable the device write cache.  It's a balancing act, depending on
how often you flush.  I don't think Linux has ever used the SCSI
barrier capabilities.

One other thing it can do is synchronous write, called FUA on SATA, so
flush+write+flush becomes flush+syncwrite.

The only thing Linux would gain from separating flush ops from barrier
ops in the block request queue, is different reordering and coalescing
opportunities.  It's not permitted to move writes in either direction
around a barrier, but it is permitted to move writes earlier past a
flush, and that may allow flushes to coalesce.

> Yeah, I'm not totally comfortable with that either.  I've been pretty much 
> assuming that all the ranges from this database transaction generally fit 
> in the I/O queue.

I wouldn't assume that.  It's legitimate to write gigabytes of data in
a transaction, then want to fsync it all before writing a commit
block.  Only about 1 x system RAM's worth of dirty data will need
flushing at that point :-)

> I said in another subthread that I don't think system call overhead is at 
> all noticeable in a program that is doing device-synchronous I/O.  Not 
> enough to justify a fsyncv() like we justify readv().

Btw, historically the justification for readv() was for sockets, not
files.  Separate reads don't work the same.

Yes, system call overhead is quite small.

But I saw recently on the QEMU list that they want to add preadv() and
pwritev() to Linux, because of the difference it makes to performance
compared with a sequence of pread() and pwrite() calls.

That surprises me.  (I wonder if they measured it).

fsync_range() does less work per-page than read/write.  In some
scenarios, fsync_range() is scanning over large numbers of pages as
quickly as possible, skipping the clean+not-writing pages.  I wonder
if that justifies fsyncv() :-)

> Hey, here's a small example of how the flexibility of the single range 
> fadvise plus single range fsync_range beats a multi-range 
> fsyncv/fsync_range:  Early in this thread, we noted the value of feeding 
> multiple ranges of the file to the block layer at once for syncing.  Then 
> we noticed that it would also be useful to feed multiple ranges of 
> multiple files, requiring a different interface.  With the two system 
> calls per range, that latter requirement was already met without thinking 
> about it.

That's why Nick proposed fsyncv take (file, start, length) tuples,
to sync multiple files :-)

If you do it the blocking-fadvise way, the blocking bits.

You'll block while feeding requests for the first file, until you get
started on the second, and so on.  No chance for parallelism -
e.g. what if the files are on different devices in btrfs? :-) (Same
for different extents in the same file actually).

That said, I'll be very surprised if fsyncv() is implemented smarter
than that.  As an API allowing the possibility, it sort of works
though.  Who knows, it might just pass the work on to btrfs or Tux3 to
optimise cleverly :-)

(That said #2, an AIO based API would _in principle_ provide yet more
freedom to convery what the app wants without overconstraining.
Overlap, parallelism, and not having to batch things up, but submit
them as needs come in.  Is that realistic?)

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-20 18:31 ` Jamie Lokier
  2009-01-20 21:25   ` Bryan Henderson
@ 2009-01-21  1:29   ` Nick Piggin
  2009-01-21  3:15     ` Jamie Lokier
  2009-01-21  3:25     ` Jamie Lokier
  1 sibling, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  1:29 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Tue, Jan 20, 2009 at 06:31:21PM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > Just wondering if we should add an fsync_range syscall like AIX and
> > some BSDs have? It's pretty simple for the pagecache since it
> > already implements the full sync with range syncs anyway. For
> > filesystems and user programs, I imagine it is a bit easier to
> > convert to fsync_range from fsync rather than use the sync_file_range
> > syscall.
> > 
> > Having a flags argument is nice, but AIX seems to use O_SYNC as a
> > flag, I wonder if we should follow?
> 
> I like the idea.  It's much easier to understand than sync_file_range,
> whose man page doesn't really explain how to use it correctly.
> 
> But how is fsync_range different from the sync_file_range syscall with
> all its flags set?

sync_file_range would have to wait, then write, then wait. It also
does not call into the filesystem's ->fsync function, I don't know
what the wider consequences of that are for all filesystems, but
for some it means that metadata required to read back the data is
not synced properly, and often it means that metadata sync will not
work.

Filesystems could also much more easily get converted to a ->fsync_range
function if that would be beneficial to any of them.

> For database writes, you typically write a bunch of stuff in various
> regions of a big file (or multiple files), then ideally fdatasync
> some/all of the written ranges - with writes committed to disk in the
> best order determined by the OS and I/O scheduler.

Do you know which databases do this? It will be nice to ask their
input and see whether it helps them (I presume it is an OSS database
because the "big" ones just use direct IO and manage their own
buffers, right?)

Today, they will have to just fsync the whole file. So they first must
identify which parts of the file need syncing, and then gather those
parts as a vector.

> For this, taking a vector of multiple ranges would be nice.
> Alternatively, issuing parallel fsync_range calls from multiple
> threads would approximate the same thing - if (big if) they aren't
> serialised by the kernel.

I was thinking about doing something like that, but I just wanted to
get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
could implement fsync_range on top of that?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  1:29   ` Nick Piggin
@ 2009-01-21  3:15     ` Jamie Lokier
  2009-01-21  3:48       ` Nick Piggin
  2009-01-21  4:16       ` Nick Piggin
  2009-01-21  3:25     ` Jamie Lokier
  1 sibling, 2 replies; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21  3:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> > I like the idea.  It's much easier to understand than sync_file_range,
> > whose man page doesn't really explain how to use it correctly.
> > 
> > But how is fsync_range different from the sync_file_range syscall with
> > all its flags set?
> 
> sync_file_range would have to wait, then write, then wait. It also
> does not call into the filesystem's ->fsync function, I don't know
> what the wider consequences of that are for all filesystems, but
> for some it means that metadata required to read back the data is
> not synced properly, and often it means that metadata sync will not
> work.

fsync_range() must also wait, write, then wait again.

The reason is this sequence of events:

    1. App calls write() on a page, dirtying it.
    2. Data writeout is initiated by usual kernel task.
    3. App calls write() on the page again, dirtying it again.
    4. App calls fsync_range() on the page.
    5. ... Dum de dum, time passes ...
    6. Writeout from step 2 completes.

    7. fsync_range() initiates another writeout, because the
       in-progress writeout from step 2 might not include the changes from
       step 3.

    7. fsync_range() waits for writout from step 7.
    8. fsync_range() requests a device cache flush if needed (we hope!).
    9. Returns to app.

Therefore fsync_range() must wait for in-progress writeout to
complete, before initiating more writeout and waiting again.

This is the reason sync_file_range() has all those flags.  As I said,
the man page doesn't really explain how to use it properly.

An optimisation would be to detect I/O that's been queued on an
elevator, but where the page has not actually been read (i.e. no DMA
or bounce buffer copy done yet).  Most queued I/O presumably falls
into this category, and the second writeout would not be required.

But perhaps this doesn't happen much in real life?

Also the kernel is in a better position to decide which order to do
everything in, and how best to batch it.

Also, during the first wait (for in-progress writeout) the kernel
could skip ahead to queuing some of the other pages for writeout as
long as there is room in the request queue, and come back to the other
pages later.

> Filesystems could also much more easily get converted to a ->fsync_range
> function if that would be beneficial to any of them.
> 
> 
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
>  
> Do you know which databases do this? It will be nice to ask their
> input and see whether it helps them (I presume it is an OSS database
> because the "big" ones just use direct IO and manage their own
> buffers, right?)

I don't know if anyone uses sync_file_range(), or if it even works
reliably, since it's not going to get much testing.

I don't use it myself yet.  My interest is in developing (yet
another?)  high performance but reliable database engine, not an SQL
one though.  That's why I keep noticing the issues with fsync,
sync_file_range, barriers etc.

Take a look at this, though:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html

"The results show fadvise + sync_file_range is on par or better than
O_DIRECT. Detailed results are attached."

By the way, direct I/O is nice but (a) not always possible, and (b)
you don't get the integrity barriers, do you?

> Today, they will have to just fsync the whole file. So they first must
> identify which parts of the file need syncing, and then gather those
> parts as a vector.

Having to fsync the whole file is one reason that some databases use
separate journal files - so fsync only flushes the journal file, not
the big data file which can sometimes be more relaxed.

It's also a reason some databases recommend splitting the database
into multiple files of limited size - so the hit from fsync is reduced.

When a single file is used for journal and data (like
e.g. ext3-in-a-file), every transaction (actually coalesced set of
transactions) forces the disk head back and forth between two data
areas.  If the journal can be synced by itself, the disk head doesn't
need to move back and forth as much.

Identifying which parts to sync isn't much different than a modern
filesystem needs to do with its barriers, journals and journal-trees.
They have a lot in common.  This is bread and butter stuff for
database engines.

fsync_range would remove those reasons for using separate files,
making the database-in-a-single-file implementations more efficient.
That is administratively much nicer, imho.

Similar for userspace filesystem-in-a-file, which is basically the same.

> > For this, taking a vector of multiple ranges would be nice.
> > Alternatively, issuing parallel fsync_range calls from multiple
> > threads would approximate the same thing - if (big if) they aren't
> > serialised by the kernel.
> 
> I was thinking about doing something like that, but I just wanted to
> get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
> could implement fsync_range on top of that?

Rather than fsyncv, is there some way to separate the fsync into parts?

   1. A sequence of system calls to designate ranges.
   2. A call to say "commit and wait on all those ranges given in step 1".

It seems sync_file_range() isn't _that_ far off doing that, except it
doesn't get the metadata right, as you say, and it doesn't have a
place for the I/O barrier either.

An additional couple of flags to sync_file_range() would sort out the
API:

   SYNC_FILE_RANGE_METADATA

      Commit the file metadata such as modification time and
      attributes.  Think fsync() versus fdatasync().

   SYNC_FILE_RANGE_IO_BARRIER

      Include a block device cache flush if needed, same as normal
      fsync() and fdatasync() are expected to.  The flag gives the syscall
      some flexibility to not do so. 

For the filesystem metadata, which you noticed is needed to access the
data on some filesystems, that should _always_ be committed.  Not
doing so is a bug in sync_file_range() to be fixed.

fdatasync() must commit the metadata needed to access the file data,
by the way.  In case it wasn't obvious. :-) This includes the file
size, if that's grown.  Many OSes have an O_DSYNC which is equivalent
to fdatasync() after each write, and is documented to write the inode
and other metadata needed to access flushed data if the file size has
increased.

With sync_file_range() fixed, all the other syscalls fsync(),
fdatasync() and fsync_range() could be implemented in terms of it -
possibly simplifying the code.  Maybe O_SYNC and O_DSYNC could use it
too.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  3:15     ` Jamie Lokier
@ 2009-01-21  3:48       ` Nick Piggin
  2009-01-21  5:24         ` Jamie Lokier
  2009-01-21  4:16       ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  3:48 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > > I like the idea.  It's much easier to understand than sync_file_range,
> > > whose man page doesn't really explain how to use it correctly.
> > > 
> > > But how is fsync_range different from the sync_file_range syscall with
> > > all its flags set?
> > 
> > sync_file_range would have to wait, then write, then wait. It also
> > does not call into the filesystem's ->fsync function, I don't know
> > what the wider consequences of that are for all filesystems, but
> > for some it means that metadata required to read back the data is
> > not synced properly, and often it means that metadata sync will not
> > work.
> 
> fsync_range() must also wait, write, then wait again.
> 
> The reason is this sequence of events:
> 
>     1. App calls write() on a page, dirtying it.
>     2. Data writeout is initiated by usual kernel task.
>     3. App calls write() on the page again, dirtying it again.
>     4. App calls fsync_range() on the page.
>     5. ... Dum de dum, time passes ...
>     6. Writeout from step 2 completes.
> 
>     7. fsync_range() initiates another writeout, because the
>        in-progress writeout from step 2 might not include the changes from
>        step 3.
> 
>     7. fsync_range() waits for writout from step 7.
>     8. fsync_range() requests a device cache flush if needed (we hope!).
>     9. Returns to app.
> 
> Therefore fsync_range() must wait for in-progress writeout to
> complete, before initiating more writeout and waiting again.

That's only in rare cases where writeout is started but not completed
before we last dirty it and before we call the next fsync. I'd say in
most cases, we won't have to wait (it should often remain clean).

 
> This is the reason sync_file_range() has all those flags.  As I said,
> the man page doesn't really explain how to use it properly.

Well, one can read what the code does. Aside from that extra wait,
and the problem of not syncing metadata, one thing I dislike about
it is that it exposes the new concept of "writeout" to the userspace
ABI.  Previously all we cared about is whether something is safe
on disk or not. So I think it is reasonable to augment the traditional
data integrity APIs which will probably be more easily used by
existing apps.
 

> An optimisation would be to detect I/O that's been queued on an
> elevator, but where the page has not actually been read (i.e. no DMA
> or bounce buffer copy done yet).  Most queued I/O presumably falls
> into this category, and the second writeout would not be required.
> 
> But perhaps this doesn't happen much in real life?

I doubt it would be worth the complexity. It would probably be pretty
fiddly and ugly change to the pagecache.


> Also the kernel is in a better position to decide which order to do
> everything in, and how best to batch it.

Better position than what? I proposed fsync_range (or fsyncv) to be
in-kernel too, of course.

 
> Also, during the first wait (for in-progress writeout) the kernel
> could skip ahead to queuing some of the other pages for writeout as
> long as there is room in the request queue, and come back to the other
> pages later.

Sure it could. That adds yet more complexity and opens possibility for
livelock (you go back to the page you were waiting for to find it was
since redirtied and under writeout again).


> > > For database writes, you typically write a bunch of stuff in various
> > > regions of a big file (or multiple files), then ideally fdatasync
> > > some/all of the written ranges - with writes committed to disk in the
> > > best order determined by the OS and I/O scheduler.
> >  
> > Do you know which databases do this? It will be nice to ask their
> > input and see whether it helps them (I presume it is an OSS database
> > because the "big" ones just use direct IO and manage their own
> > buffers, right?)
> 
> I don't know if anyone uses sync_file_range(), or if it even works
> reliably, since it's not going to get much testing.

The problem is that it is hard to verify. Even if it is getting lots
of testing, it is not getting enough testing with the block device
being shut off or throwing errors at exactly the right time.

In 2.6.29 I just fixed a handful of data integrity and error reporting
bugs in sync that have been there for basically all of 2.6.

 
> I don't use it myself yet.  My interest is in developing (yet
> another?)  high performance but reliable database engine, not an SQL
> one though.  That's why I keep noticing the issues with fsync,
> sync_file_range, barriers etc.
> 
> Take a look at this, though:
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> 
> "The results show fadvise + sync_file_range is on par or better than
> O_DIRECT. Detailed results are attached."

That's not to say fsync would be any worse. And it's just a microbenchmark
anyway.

 
> By the way, direct I/O is nice but (a) not always possible, and (b)
> you don't get the integrity barriers, do you?

It should. But I wasn't advocating it versus pagecache + syncing,
just wondering what databases could use fsyncv so we can see if
they can test.

 
> > Today, they will have to just fsync the whole file. So they first must
> > identify which parts of the file need syncing, and then gather those
> > parts as a vector.
> 
> Having to fsync the whole file is one reason that some databases use
> separate journal files - so fsync only flushes the journal file, not
> the big data file which can sometimes be more relaxed.
> 
> It's also a reason some databases recommend splitting the database
> into multiple files of limited size - so the hit from fsync is reduced.
> 
> When a single file is used for journal and data (like
> e.g. ext3-in-a-file), every transaction (actually coalesced set of
> transactions) forces the disk head back and forth between two data
> areas.  If the journal can be synced by itself, the disk head doesn't
> need to move back and forth as much.
> 
> Identifying which parts to sync isn't much different than a modern
> filesystem needs to do with its barriers, journals and journal-trees.
> They have a lot in common.  This is bread and butter stuff for
> database engines.
> 
> fsync_range would remove those reasons for using separate files,
> making the database-in-a-single-file implementations more efficient.
> That is administratively much nicer, imho.
> 
> Similar for userspace filesystem-in-a-file, which is basically the same.

Although I think a large part is IOPs rather than data throughput,
so cost of fsync_range often might not be much better.


> > > For this, taking a vector of multiple ranges would be nice.
> > > Alternatively, issuing parallel fsync_range calls from multiple
> > > threads would approximate the same thing - if (big if) they aren't
> > > serialised by the kernel.
> > 
> > I was thinking about doing something like that, but I just wanted to
> > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
> > could implement fsync_range on top of that?
> 
> Rather than fsyncv, is there some way to separate the fsync into parts?
> 
>    1. A sequence of system calls to designate ranges.
>    2. A call to say "commit and wait on all those ranges given in step 1".

What's the problem with fsyncv? The problem with your proposal is that
it takes multiple syscalls and that it requires the kernel to build up
state over syscalls which is nasty.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  3:48       ` Nick Piggin
@ 2009-01-21  5:24         ` Jamie Lokier
  2009-01-21  6:16           ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21  5:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> > > sync_file_range would have to wait, then write, then wait. It also
> > > does not call into the filesystem's ->fsync function, I don't know
> > > what the wider consequences of that are for all filesystems, but
> > > for some it means that metadata required to read back the data is
> > > not synced properly, and often it means that metadata sync will not
> > > work.
> > 
> > fsync_range() must also wait, write, then wait again.
> > 
> > The reason is this sequence of events:
> > 
> >     1. App calls write() on a page, dirtying it.
> >     2. Data writeout is initiated by usual kernel task.
> >     3. App calls write() on the page again, dirtying it again.
> >     4. App calls fsync_range() on the page.
> >     5. ... Dum de dum, time passes ...
> >     6. Writeout from step 2 completes.
> > 
> >     7. fsync_range() initiates another writeout, because the
> >        in-progress writeout from step 2 might not include the changes from
> >        step 3.
> > 
> >     7. fsync_range() waits for writout from step 7.
> >     8. fsync_range() requests a device cache flush if needed (we hope!).
> >     9. Returns to app.
> > 
> > Therefore fsync_range() must wait for in-progress writeout to
> > complete, before initiating more writeout and waiting again.
> 
> That's only in rare cases where writeout is started but not completed
> before we last dirty it and before we call the next fsync. I'd say in
> most cases, we won't have to wait (it should often remain clean).

Agreed it's rare.  In those cases, sync_file_range() doesn't wait
twice either.  Both functions are the same in this part.

> > This is the reason sync_file_range() has all those flags.  As I said,
> > the man page doesn't really explain how to use it properly.
> 
> Well, one can read what the code does. Aside from that extra wait,

There shouldn't be an extra wait.

> and the problem of not syncing metadata,

A bug.

> one thing I dislike about it is that it exposes the new concept of
> "writeout" to the userspace ABI.  Previously all we cared about is
> whether something is safe on disk or not. So I think it is
> reasonable to augment the traditional data integrity APIs which will
> probably be more easily used by existing apps.

I agree entirely.

Everyone knows what fsync_range() does, just from the name.

Was there some reason, perhaps for performance or flexibility, for
exposing the "writeout" concept to userspace?

> > Also the kernel is in a better position to decide which order to do
> > everything in, and how best to batch it.
> 
> Better position than what? I proposed fsync_range (or fsyncv) to be
> in-kernel too, of course.

I mean the kernel is in a better position than userspace's lame
attempts to call sync_file_range() in a clever way for optimal
performance :-)

> > Also, during the first wait (for in-progress writeout) the kernel
> > could skip ahead to queuing some of the other pages for writeout as
> > long as there is room in the request queue, and come back to the other
> > pages later.
> 
> Sure it could. That adds yet more complexity and opens possibility for
> livelock (you go back to the page you were waiting for to find it was
> since redirtied and under writeout again).

Didn't you have a patch that fix a similar livelock against other apps
in fsync()?

I agree about the complexity.  It's probably such a rare case.  It
must be handled correctly, though - two waits when needed, one wait
usually.

> > > > For database writes, you typically write a bunch of stuff in various
> > > > regions of a big file (or multiple files), then ideally fdatasync
> > > > some/all of the written ranges - with writes committed to disk in the
> > > > best order determined by the OS and I/O scheduler.
> > >  
> > > Do you know which databases do this? It will be nice to ask their
> > > input and see whether it helps them (I presume it is an OSS database
> > > because the "big" ones just use direct IO and manage their own
> > > buffers, right?)
> > 
> > I don't know if anyone uses sync_file_range(), or if it even works
> > reliably, since it's not going to get much testing.
> 
> The problem is that it is hard to verify. Even if it is getting lots
> of testing, it is not getting enough testing with the block device
> being shut off or throwing errors at exactly the right time.

QEMU would be good for testing this sort of thing, but it doesn't
sound like an easy test to write.

> In 2.6.29 I just fixed a handful of data integrity and error reporting
> bugs in sync that have been there for basically all of 2.6.

Thank you so much!

When I started work on a database engine, I cared about storage
integrity a lot.  I looked into fsync integrity on Linux and came out
running because the smell was so bad.

> > Take a look at this, though:
> > 
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> > 
> > "The results show fadvise + sync_file_range is on par or better than
> > O_DIRECT. Detailed results are attached."
> 
> That's not to say fsync would be any worse. And it's just a microbenchmark
> anyway.

In the end he was using O_DIRECT synchronously.  You have to overlap
O_DIRECT with AIO (the only time AIO on Linux really works) to get
sensible performance.  So ignore that result.

> > By the way, direct I/O is nice but (a) not always possible, and (b)
> > you don't get the integrity barriers, do you?
> 
> It should.

O_DIRECT can't do an I/O barrier after every write because performance
would suck.  Really badly.  However, a database engine with any
self-respect would want I/O barriers at certain points for data integrity.

I suggest fdatasync() et al. should issue the barrier if there have
been any writes, including O_DIRECT writes, since the last barrier.
That could be a file-wide single flag "there have been writes since
last barrier".

> > fsync_range would remove those reasons for using separate files,
> > making the database-in-a-single-file implementations more efficient.
> > That is administratively much nicer, imho.
> > 
> > Similar for userspace filesystem-in-a-file, which is basically the same.
> 
> Although I think a large part is IOPs rather than data throughput,
> so cost of fsync_range often might not be much better.

IOPs are affected by head seeking.  If the head is forced to seek
between journal area and main data on every serial transaction, IOPs
drops substantially.  fsync_range() would reduce that seeking, for
databases (and filesystems) which store both in the same file.

> > > > For this, taking a vector of multiple ranges would be nice.
> > > > Alternatively, issuing parallel fsync_range calls from multiple
> > > > threads would approximate the same thing - if (big if) they aren't
> > > > serialised by the kernel.
> > > 
> > > I was thinking about doing something like that, but I just wanted to
> > > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
> > > could implement fsync_range on top of that?
> > 
> > Rather than fsyncv, is there some way to separate the fsync into parts?
> > 
> >    1. A sequence of system calls to designate ranges.
> >    2. A call to say "commit and wait on all those ranges given in step 1".
> 
> What's the problem with fsyncv? The problem with your proposal is that
> it takes multiple syscalls and that it requires the kernel to build up
> state over syscalls which is nasty.

I guess I'm coming back to sync_file_range(), which sort of does that
separation :-)

Also, see the other mail, about the PostgreSQL folks wanting to sync
optimally multiple files at once, not serialised.

I don't have a problem with fsyncv() per se.  Should it take a single
file descriptor and list of file-ranges, or a list of file descriptors
with ranges?  The latter is more general, but too vectory without
justification is a good way to get syscalls NAKd by Linus.

In theory, pluggable Linux-AIO would be a great multiple-request
submission mechanism.  There's IOCB_CMD_FDSYNC (AIO request), just add
IOCB_CMD_FDSYNC_RANGE.  There's room under the hood of that API for
batching sensibly, and putting the waits and barriers in the best
places.  But Linux-AIO does not have a reputation for actually
working, though the API looks good in theory.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  5:24         ` Jamie Lokier
@ 2009-01-21  6:16           ` Nick Piggin
  2009-01-21 11:18             ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  6:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Jan 21, 2009 at 05:24:01AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > That's only in rare cases where writeout is started but not completed
> > before we last dirty it and before we call the next fsync. I'd say in
> > most cases, we won't have to wait (it should often remain clean).
> 
> Agreed it's rare.  In those cases, sync_file_range() doesn't wait
> twice either.  Both functions are the same in this part.
> 
> > > This is the reason sync_file_range() has all those flags.  As I said,
> > > the man page doesn't really explain how to use it properly.
> > 
> > Well, one can read what the code does. Aside from that extra wait,
> 
> There shouldn't be an extra wait.

Of course there is becaues it has to wait on writeout of clean pages,
then writeout dirty pages, then wait on writeout of dirty pages.

 
> > one thing I dislike about it is that it exposes the new concept of
> > "writeout" to the userspace ABI.  Previously all we cared about is
> > whether something is safe on disk or not. So I think it is
> > reasonable to augment the traditional data integrity APIs which will
> > probably be more easily used by existing apps.
> 
> I agree entirely.
> 
> Everyone knows what fsync_range() does, just from the name.
> 
> Was there some reason, perhaps for performance or flexibility, for
> exposing the "writeout" concept to userspace?

I don't think I ever saw actual numbers to justify it. The async
writeout part of it I guess is one aspect, but one could just add
an async flag to fsync (like msync) to get mostly the same result.
 

> > > Also the kernel is in a better position to decide which order to do
> > > everything in, and how best to batch it.
> > 
> > Better position than what? I proposed fsync_range (or fsyncv) to be
> > in-kernel too, of course.
> 
> I mean the kernel is in a better position than userspace's lame
> attempts to call sync_file_range() in a clever way for optimal
> performance :-)

OK, agreed. In which case, fsyncv is a winner because you'd be able
to sync multiple files and multiple ranges within each file.

 
> > > Also, during the first wait (for in-progress writeout) the kernel
> > > could skip ahead to queuing some of the other pages for writeout as
> > > long as there is room in the request queue, and come back to the other
> > > pages later.
> > 
> > Sure it could. That adds yet more complexity and opens possibility for
> > livelock (you go back to the page you were waiting for to find it was
> > since redirtied and under writeout again).
> 
> Didn't you have a patch that fix a similar livelock against other apps
> in fsync()?

Well, that was more of "really slow progress". This could actually be
a real livelock because progress may never be made.


> > The problem is that it is hard to verify. Even if it is getting lots
> > of testing, it is not getting enough testing with the block device
> > being shut off or throwing errors at exactly the right time.
> 
> QEMU would be good for testing this sort of thing, but it doesn't
> sound like an easy test to write.
> 
> > In 2.6.29 I just fixed a handful of data integrity and error reporting
> > bugs in sync that have been there for basically all of 2.6.
> 
> Thank you so much!
> 
> When I started work on a database engine, I cared about storage
> integrity a lot.  I looked into fsync integrity on Linux and came out
> running because the smell was so bad.

I guess that abruptly shutting down the block device queue could be
used to pick up some bugs. That could be done using a real host and brd
quite easily.

The problem with some of those bugs I fixed is that some could take
quite a rare and transient situation before the window even opens for
possible data corruption. Then you have to crash the machine at that
time, and hope the pattern that was written out is in fact one that
will cause corruption.

I tried to write some debug infrastructure; basically putting sequence
counts in the struct page and going bug if the page is found to be
still dirty after the last fsync event but before the next dirty page
event... that kind of handles the simple case of the pagecache, but
not really the filesystem or block device parts of the equation, which
seem to be more difficult.

 
> > > Take a look at this, though:
> > > 
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> > > 
> > > "The results show fadvise + sync_file_range is on par or better than
> > > O_DIRECT. Detailed results are attached."
> > 
> > That's not to say fsync would be any worse. And it's just a microbenchmark
> > anyway.
> 
> In the end he was using O_DIRECT synchronously.  You have to overlap
> O_DIRECT with AIO (the only time AIO on Linux really works) to get
> sensible performance.  So ignore that result.

Ah OK.


> > > By the way, direct I/O is nice but (a) not always possible, and (b)
> > > you don't get the integrity barriers, do you?
> > 
> > It should.
> 
> O_DIRECT can't do an I/O barrier after every write because performance
> would suck.  Really badly.  However, a database engine with any
> self-respect would want I/O barriers at certain points for data integrity.

Hmm, I don't follow why that should be the case. Doesn't any self
respecting storage controller tell us the data is safe when it
hits its non volatile RAM?

 
> I suggest fdatasync() et al. should issue the barrier if there have
> been any writes, including O_DIRECT writes, since the last barrier.
> That could be a file-wide single flag "there have been writes since
> last barrier".

Well, I'd say the less that simpler applications have to care about,
the better. For Oracle and DB2 etc. I think we could have a mode that
turns off intermediate block device barriers and give them a syscall
or ioctl to issue the barrier manually. If that helps them significantly.


> > > fsync_range would remove those reasons for using separate files,
> > > making the database-in-a-single-file implementations more efficient.
> > > That is administratively much nicer, imho.
> > > 
> > > Similar for userspace filesystem-in-a-file, which is basically the same.
> > 
> > Although I think a large part is IOPs rather than data throughput,
> > so cost of fsync_range often might not be much better.
> 
> IOPs are affected by head seeking.  If the head is forced to seek
> between journal area and main data on every serial transaction, IOPs
> drops substantially.  fsync_range() would reduce that seeking, for
> databases (and filesystems) which store both in the same file.

OK I see your point. But that's not to say you couldn't have two
files or partitions laied out next to one another. But yes no
question that fsync_range is more flexible.>


> > What's the problem with fsyncv? The problem with your proposal is that
> > it takes multiple syscalls and that it requires the kernel to build up
> > state over syscalls which is nasty.
> 
> I guess I'm coming back to sync_file_range(), which sort of does that
> separation :-)
> 
> Also, see the other mail, about the PostgreSQL folks wanting to sync
> optimally multiple files at once, not serialised.
> 
> I don't have a problem with fsyncv() per se.  Should it take a single
> file descriptor and list of file-ranges, or a list of file descriptors
> with ranges?  The latter is more general, but too vectory without
> justification is a good way to get syscalls NAKd by Linus.

The latter, I think. It is indeed much more useful (you could sync
a hundred files and have them share a lot of the block device
flushes / barriers).
 

> In theory, pluggable Linux-AIO would be a great multiple-request
> submission mechanism.  There's IOCB_CMD_FDSYNC (AIO request), just add
> IOCB_CMD_FDSYNC_RANGE.  There's room under the hood of that API for
> batching sensibly, and putting the waits and barriers in the best
> places.  But Linux-AIO does not have a reputation for actually
> working, though the API looks good in theory.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  6:16           ` Nick Piggin
@ 2009-01-21 11:18             ` Jamie Lokier
  2009-01-21 11:41               ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 11:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> > > That's only in rare cases where writeout is started but not completed
> > > before we last dirty it and before we call the next fsync. I'd say in
> > > most cases, we won't have to wait (it should often remain clean).
>
> > There shouldn't be an extra wait. [in sync_file_range]
> 
> Of course there is becaues it has to wait on writeout of clean pages,
> then writeout dirty pages, then wait on writeout of dirty pages.

Eh?  How is that different from the "only in rare cases where writeout
is started but not completed" in your code?

Oh, let me guess.  sync_file_range() will wait for writeout to
complete on pages where the dirty bit was cleared when they were
queued for writout and have not been dirtied since, while
fsync_range() will not wait for those?

I distinctly remember someone... yes, Andrew Morton, explaining why
the double wait is needed for integrity.

    http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272270.html

That's how I learned what (at least one person thinks) is the
intended semantics of sync_file_range().

I'll just quote one line from Andrew's post:
>> It's an interesting problem, with potentially high payback.

Back to that subtlety of waiting, and integrity.

If fsync_range does not wait at all on a page which is under writeout
and clean (not dirtied since the writeout was queued), it will not
achieve integrity.

That can happen due to the following events:

    1. App calls write(), dirties page.
    2. Background dirty flushing starts writeout, clears dirty bit.
    3. App calls fsync_range() on the page.
    4. fsync_range() doesn't wait on it because it's clean.
    5. Bang, app things the write is committed when it isn't.

On the other hand, if I've misunderstood and it will wait on that
page, but not twice, then I think it's the same as what
sync_file_range() is _supposed_ to do.

sync_file_range() is misunderstood.  Possibly due to the man page,
hand-waving and implementation.

I don't think the flags mean "wait on all writeouts" _then_ "initiate
all dirty writeouts" _then_ "wait on all writeouts".

I think they mean *for each page in parallel* do that, or at least do
its best with those constraints.

In other words, no double-waiting or excessive serialisation.

Don't get me wrong, I think fsync_range() is a much cleaner idea, and
much more likely to be used.

If fsync_range() is coming, it wouldn't do any harm, imho, to delete
sync_file_range() completely, and replace it with a stub which calls
fsync_range().  Or ENOSYS, then we'll find out if anyone used it :-)
Your implementation will obviously be better, since all your kind
attention to fsync integrity generally.

Andrew Morton did write, though:
>>The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
>>userspace can get as much data into the queue as possible, to permit the
>>kernel to optimise IO scheduling better.

I wonder if there is something to that, or if it was just wishful
thinking.

-- Jamie








   


doesn't want to share.

it's supposed to be 

> 
>  
> > > one thing I dislike about it is that it exposes the new concept of
> > > "writeout" to the userspace ABI.  Previously all we cared about is
> > > whether something is safe on disk or not. So I think it is
> > > reasonable to augment the traditional data integrity APIs which will
> > > probably be more easily used by existing apps.
> > 
> > I agree entirely.
> > 
> > Everyone knows what fsync_range() does, just from the name.
> > 
> > Was there some reason, perhaps for performance or flexibility, for
> > exposing the "writeout" concept to userspace?
> 
> I don't think I ever saw actual numbers to justify it. The async
> writeout part of it I guess is one aspect, but one could just add
> an async flag to fsync (like msync) to get mostly the same result.
>  
> 
> > > > Also the kernel is in a better position to decide which order to do
> > > > everything in, and how best to batch it.
> > > 
> > > Better position than what? I proposed fsync_range (or fsyncv) to be
> > > in-kernel too, of course.
> > 
> > I mean the kernel is in a better position than userspace's lame
> > attempts to call sync_file_range() in a clever way for optimal
> > performance :-)
> 
> OK, agreed. In which case, fsyncv is a winner because you'd be able
> to sync multiple files and multiple ranges within each file.
> 
>  
> > > > Also, during the first wait (for in-progress writeout) the kernel
> > > > could skip ahead to queuing some of the other pages for writeout as
> > > > long as there is room in the request queue, and come back to the other
> > > > pages later.
> > > 
> > > Sure it could. That adds yet more complexity and opens possibility for
> > > livelock (you go back to the page you were waiting for to find it was
> > > since redirtied and under writeout again).
> > 
> > Didn't you have a patch that fix a similar livelock against other apps
> > in fsync()?
> 
> Well, that was more of "really slow progress". This could actually be
> a real livelock because progress may never be made.
> 
> 
> > > The problem is that it is hard to verify. Even if it is getting lots
> > > of testing, it is not getting enough testing with the block device
> > > being shut off or throwing errors at exactly the right time.
> > 
> > QEMU would be good for testing this sort of thing, but it doesn't
> > sound like an easy test to write.
> > 
> > > In 2.6.29 I just fixed a handful of data integrity and error reporting
> > > bugs in sync that have been there for basically all of 2.6.
> > 
> > Thank you so much!
> > 
> > When I started work on a database engine, I cared about storage
> > integrity a lot.  I looked into fsync integrity on Linux and came out
> > running because the smell was so bad.
> 
> I guess that abruptly shutting down the block device queue could be
> used to pick up some bugs. That could be done using a real host and brd
> quite easily.
> 
> The problem with some of those bugs I fixed is that some could take
> quite a rare and transient situation before the window even opens for
> possible data corruption. Then you have to crash the machine at that
> time, and hope the pattern that was written out is in fact one that
> will cause corruption.
> 
> I tried to write some debug infrastructure; basically putting sequence
> counts in the struct page and going bug if the page is found to be
> still dirty after the last fsync event but before the next dirty page
> event... that kind of handles the simple case of the pagecache, but
> not really the filesystem or block device parts of the equation, which
> seem to be more difficult.
> 
>  
> > > > Take a look at this, though:
> > > > 
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> > > > 
> > > > "The results show fadvise + sync_file_range is on par or better than
> > > > O_DIRECT. Detailed results are attached."
> > > 
> > > That's not to say fsync would be any worse. And it's just a microbenchmark
> > > anyway.
> > 
> > In the end he was using O_DIRECT synchronously.  You have to overlap
> > O_DIRECT with AIO (the only time AIO on Linux really works) to get
> > sensible performance.  So ignore that result.
> 
> Ah OK.
> 
> 
> > > > By the way, direct I/O is nice but (a) not always possible, and (b)
> > > > you don't get the integrity barriers, do you?
> > > 
> > > It should.
> > 
> > O_DIRECT can't do an I/O barrier after every write because performance
> > would suck.  Really badly.  However, a database engine with any
> > self-respect would want I/O barriers at certain points for data integrity.
> 
> Hmm, I don't follow why that should be the case. Doesn't any self
> respecting storage controller tell us the data is safe when it
> hits its non volatile RAM?
> 
>  
> > I suggest fdatasync() et al. should issue the barrier if there have
> > been any writes, including O_DIRECT writes, since the last barrier.
> > That could be a file-wide single flag "there have been writes since
> > last barrier".
> 
> Well, I'd say the less that simpler applications have to care about,
> the better. For Oracle and DB2 etc. I think we could have a mode that
> turns off intermediate block device barriers and give them a syscall
> or ioctl to issue the barrier manually. If that helps them significantly.
> 
> 
> > > > fsync_range would remove those reasons for using separate files,
> > > > making the database-in-a-single-file implementations more efficient.
> > > > That is administratively much nicer, imho.
> > > > 
> > > > Similar for userspace filesystem-in-a-file, which is basically the same.
> > > 
> > > Although I think a large part is IOPs rather than data throughput,
> > > so cost of fsync_range often might not be much better.
> > 
> > IOPs are affected by head seeking.  If the head is forced to seek
> > between journal area and main data on every serial transaction, IOPs
> > drops substantially.  fsync_range() would reduce that seeking, for
> > databases (and filesystems) which store both in the same file.
> 
> OK I see your point. But that's not to say you couldn't have two
> files or partitions laied out next to one another. But yes no
> question that fsync_range is more flexible.>
> 
> 
> > > What's the problem with fsyncv? The problem with your proposal is that
> > > it takes multiple syscalls and that it requires the kernel to build up
> > > state over syscalls which is nasty.
> > 
> > I guess I'm coming back to sync_file_range(), which sort of does that
> > separation :-)
> > 
> > Also, see the other mail, about the PostgreSQL folks wanting to sync
> > optimally multiple files at once, not serialised.
> > 
> > I don't have a problem with fsyncv() per se.  Should it take a single
> > file descriptor and list of file-ranges, or a list of file descriptors
> > with ranges?  The latter is more general, but too vectory without
> > justification is a good way to get syscalls NAKd by Linus.
> 
> The latter, I think. It is indeed much more useful (you could sync
> a hundred files and have them share a lot of the block device
> flushes / barriers).
>  
> 
> > In theory, pluggable Linux-AIO would be a great multiple-request
> > submission mechanism.  There's IOCB_CMD_FDSYNC (AIO request), just add
> > IOCB_CMD_FDSYNC_RANGE.  There's room under the hood of that API for
> > batching sensibly, and putting the waits and barriers in the best
> > places.  But Linux-AIO does not have a reputation for actually
> > working, though the API looks good in theory.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 11:18             ` Jamie Lokier
@ 2009-01-21 11:41               ` Nick Piggin
  2009-01-21 12:09                 ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-21 11:41 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Jan 21, 2009 at 11:18:02AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > > > That's only in rare cases where writeout is started but not completed
> > > > before we last dirty it and before we call the next fsync. I'd say in
> > > > most cases, we won't have to wait (it should often remain clean).
> >
> > > There shouldn't be an extra wait. [in sync_file_range]
> > 
> > Of course there is becaues it has to wait on writeout of clean pages,
> > then writeout dirty pages, then wait on writeout of dirty pages.
> 
> Eh?  How is that different from the "only in rare cases where writeout
> is started but not completed" in your code?

No, in my code it is where writeout is started and the page has
been redirtied. If writeout has started and the page is still
clean (which should be the more common case of the two), then it
doesn't have to.

> Oh, let me guess.  sync_file_range() will wait for writeout to
> complete on pages where the dirty bit was cleared when they were
> queued for writout and have not been dirtied since, while
> fsync_range() will not wait for those?
> 
> I distinctly remember someone... yes, Andrew Morton, explaining why
> the double wait is needed for integrity.
> 
>     http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272270.html
> 
> That's how I learned what (at least one person thinks) is the
> intended semantics of sync_file_range().

The double wait is needed by sync_file_range, firstly because the
first flag is simply defined to wait for writeout, but also because
the "start writeout" flag is defined not to writeout on pages
which are dirty but already under writeout.

> I'll just quote one line from Andrew's post:
> >> It's an interesting problem, with potentially high payback.
> 
> Back to that subtlety of waiting, and integrity.
> 
> If fsync_range does not wait at all on a page which is under writeout
> and clean (not dirtied since the writeout was queued), it will not
> achieve integrity.
> 
> That can happen due to the following events:
> 
>     1. App calls write(), dirties page.
>     2. Background dirty flushing starts writeout, clears dirty bit.
>     3. App calls fsync_range() on the page.
>     4. fsync_range() doesn't wait on it because it's clean.
>     5. Bang, app things the write is committed when it isn't.

No, because fsync_range still has to wait for writeout pages *after*
it has submitted dirty pages for writeout. This includes all pages,
not just ones it has submitted just now.

> I don't think the flags mean "wait on all writeouts" _then_ "initiate
> all dirty writeouts" _then_ "wait on all writeouts".

They do. It is explicitly stated and that is exactly how it is
implemented (except "initiate writeout against all dirty pages" is
"initiate writeout against all dirty pages not already under writeout").

> I think they mean *for each page in parallel* do that, or at least do
> its best with those constraints.
> 
> In other words, no double-waiting or excessive serialisation.

Well you can do it for each page in parallel, yes. This is what we
discussed about starting writeout against *other* pages if we find
a page under writeout that we have to wait for. And then coming back
to that page to process it. This opens the whole livelock and
complexity thing.

> Don't get me wrong, I think fsync_range() is a much cleaner idea, and
> much more likely to be used.
> 
> If fsync_range() is coming, it wouldn't do any harm, imho, to delete
> sync_file_range() completely, and replace it with a stub which calls
> fsync_range().  Or ENOSYS, then we'll find out if anyone used it :-)
> Your implementation will obviously be better, since all your kind
> attention to fsync integrity generally.

Well, given that postgresql post that they need to sync multiple
files, I think fsyncv is a nice way forward. It can be used to
implement fsync_range too, which is slightly portable.

> Andrew Morton did write, though:
> >>The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
> >>userspace can get as much data into the queue as possible, to permit the
> >>kernel to optimise IO scheduling better.
> 
> I wonder if there is something to that, or if it was just wishful
> thinking.

Another problem with that, precisely because it is tied up with
idea of writeout mixed with data integrity semantics, is that
the kernel is *not* free to do what it thinks best. It has to
start writeout on all those pages and not return until writeout
is started.

If the queue fills up it has to block. It cannot schedule a thread
to write out asynchronously, etc. Because userspace is directing
how the implementation should work rather than the high level
intention.

Andrew and I have well archived difference of opinion on this ;) I
have no interest in ripping out sync_file_range. I can't say it is
wrong or always going to be suboptimal. But I think it is fine to
extend the more traditional fsync APIs too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 11:41               ` Nick Piggin
@ 2009-01-21 12:09                 ` Jamie Lokier
  0 siblings, 0 replies; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 12:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> Well, given that postgresql post that they need to sync multiple
> files, I think fsyncv is a nice way forward. It can be used to
> implement fsync_range too, which is slightly portable.

Also, fsyncv on multiple files could issue just the one disk cache
flush, if they're all to the same disk...

[about sync_file_range]
> If the queue fills up it has to block. It cannot schedule a thread
> to write out asynchronously, etc. Because userspace is directing
> how the implementation should work rather than the high level
> intention.

I agree that it's overly constraining, and pushes unnecessary tuning
work into userspace.

All these calls, btw, would be much more "optimisable" in the kernel
if they were AIOs.  Let the kernel decide things like how much to
batch, how much to parallelise, and still have the hint which comes
from AIO submission order (userspace threads doing synchronous I/O
loses this bit).

But that doesn't seem likely to happen because it's really quite hard.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  3:15     ` Jamie Lokier
  2009-01-21  3:48       ` Nick Piggin
@ 2009-01-21  4:16       ` Nick Piggin
  2009-01-21  4:59         ` Jamie Lokier
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  4:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> An additional couple of flags to sync_file_range() would sort out the
> API:
> 
>    SYNC_FILE_RANGE_METADATA
> 
>       Commit the file metadata such as modification time and
>       attributes.  Think fsync() versus fdatasync().

Note that the problems with sync_file_range is not that it lacks a
metadata flag like fsync vs fdatasync. It is that it does not even
sync the metadata required to retrieve the data (which of course
fdatasync must do, otherwise it would be useless).

This is just another reason why I prefer to just try to evolve the
traditional fsync interface slowly.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  4:16       ` Nick Piggin
@ 2009-01-21  4:59         ` Jamie Lokier
  2009-01-21  6:23           ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21  4:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote:
> > Nick Piggin wrote:
> > An additional couple of flags to sync_file_range() would sort out the
> > API:
> > 
> >    SYNC_FILE_RANGE_METADATA
> > 
> >       Commit the file metadata such as modification time and
> >       attributes.  Think fsync() versus fdatasync().
> 
> Note that the problems with sync_file_range is not that it lacks a
> metadata flag like fsync vs fdatasync. It is that it does not even
> sync the metadata required to retrieve the data (which of course
> fdatasync must do, otherwise it would be useless).

Oh, I agree about that.

(Different meaning of metadata, btw.  That's the term used in O_SYNC
vs. O_DSYNC documentation for other unixes that I've read, that's why
I used it in that flag, for consistency with other unixes.)

> This is just another reason why I prefer to just try to evolve the
> traditional fsync interface slowly.

But sync_file_range() has a bug, which you've pointed out - the
missing _data-retrieval_ metadata isn't synced.  In other words, it's
completely useless.

If that bug isn't going to be fixed, delete sync_file_range()
altogether.  There's no point keeping it if it's broken.  And if it's
fixed, it'll do what your fsync_range() does, so why have both?

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  4:59         ` Jamie Lokier
@ 2009-01-21  6:23           ` Nick Piggin
  2009-01-21 12:02             ` Jamie Lokier
  2009-01-21 12:13             ` Theodore Tso
  0 siblings, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  6:23 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Jan 21, 2009 at 04:59:21AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > This is just another reason why I prefer to just try to evolve the
> > traditional fsync interface slowly.
> 
> But sync_file_range() has a bug, which you've pointed out - the
> missing _data-retrieval_ metadata isn't synced.  In other words, it's
> completely useless.

I don't know. I don't think this is a newly discovered problem.
I think it's been known for a while, so I don't know what's
going on.

> If that bug isn't going to be fixed, delete sync_file_range()
> altogether.  There's no point keeping it if it's broken.  And if it's
> fixed, it'll do what your fsync_range() does, so why have both?

Well the thing is it doesn't. Again it comes back to the whole
writeout thing, which makes it more constraining on the kernel to
optimise.

For example, my fsync "livelock" avoidance patches did the following:

1. find all pages which are dirty or under writeout first.
2. write out the dirty pages.
3. wait for our set of pages.

Simple, obvious, and the kernel can optimise this well because the
userspace has asked for a high level request "make this data safe"
rather than low level directives. We can't do this same nice simple
sequence with sync_file_range because SYNC_FILE_RANGE_WAIT_AFTER
means we have to wait for all writeout pages in the range, including
unrelated ones, after the dirty writeout. SYNC_FILE_RANGE_WAIT_BEFORE
means we have to wait for clean writeout pages before we even start
doing real work.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  6:23           ` Nick Piggin
@ 2009-01-21 12:02             ` Jamie Lokier
  2009-01-21 12:13             ` Theodore Tso
  1 sibling, 0 replies; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 12:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> Again it comes back to the whole writeout thing, which makes it more
> constraining on the kernel to optimise.

Cute :-)
It was intended to make it easier to optimise, but maybe it failed.

> For example, my fsync "livelock" avoidance patches did the following:
> 
> 1. find all pages which are dirty or under writeout first.
> 2. write out the dirty pages.
> 3. wait for our set of pages.
> 
> Simple, obvious, and the kernel can optimise this well because the
> userspace has asked for a high level request "make this data safe"
> rather than low level directives. We can't do this same nice simple
> sequence with sync_file_range because SYNC_FILE_RANGE_WAIT_AFTER
> means we have to wait for all writeout pages in the range, including
> unrelated ones, after the dirty writeout. SYNC_FILE_RANGE_WAIT_BEFORE
> means we have to wait for clean writeout pages before we even start
> doing real work.

As noted in my other mail just now, although sync_file_range() is
described as though it does the three bulk operations consecutively, I
think it wouldn't be too shocking to think the intended semantics
_could_ be:

    "wait and initiate writeous _as if_ we did, for each page _in parallel_ {
        if (SYNC_FILE_RANGE_WAIT_BEFORE && page->writeout) wait(page)
        if (SYNC_FILE_RANGE_WRITE) start_writeout(page)
        if (SYNC_FILE_RANGE_WAIT_AFTER && writeout) wait(page)
     }"

That permits many strategies, and I think one of them is the nice
livelock-avoiding fsync you describe up above.

You might be able to squeeze the sync_file_range() flags into that by
chopping it up like this.  Btw, you omitted step 1.5 "wait for dirty
pages which are already under writeout", but it's made explicit here:

    1. find all pages which are dirty or under writeout first,
       and remember which of them are dirty _and_ under writeout (DW).
    2. if (SYNC_FILE_RANGE_WRITE)
           write out the dirty pages not in DW.
    3. if (SYNC_FILE_RANGE_WAIT_BEFORE) {
           wait for the set of pages in DW.
           write out the pages in DW.
       }
    4. if (SYNC_FILE_RANGE_WAIT_BEFORE || SYNC_FILE_RANGE_WAIT_AFTER)
           wait for our set of pages.

However, maybe the flags aren't all that useful really, and maybe
sync_file_range() could be replaced by a stub which ignores the flags
and calls fsync_range().

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  6:23           ` Nick Piggin
  2009-01-21 12:02             ` Jamie Lokier
@ 2009-01-21 12:13             ` Theodore Tso
  2009-01-21 12:37               ` Jamie Lokier
  1 sibling, 1 reply; 42+ messages in thread
From: Theodore Tso @ 2009-01-21 12:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jamie Lokier, linux-fsdevel

On Wed, Jan 21, 2009 at 07:23:06AM +0100, Nick Piggin wrote:
> > 
> > But sync_file_range() has a bug, which you've pointed out - the
> > missing _data-retrieval_ metadata isn't synced.  In other words, it's
> > completely useless.
> 
> I don't know. I don't think this is a newly discovered problem.
> I think it's been known for a while, so I don't know what's
> going on.

We should ask if anyone is actually using sync_file_range (cough,
<Oracle>, cough, cough).  But if I had to guess, for those people who
are using it, they don't much care, because 99% of the time they are
overwriting data blocks within a file which isn't changing in size, so
there is no data-retrieval metadata to sync.  That is, the database
file is only rarely grown in size, and when they do that, they can
either preallocate via pre-filling or via posix_fallocate(), and then
follow it up with a normal fsync(); but most of the time, they aren't
mucking with the data-retrieval metadata, so it simply isn't an issue
for them....

It's not general purpose, but the question is whether or not any of
the primary users of this interface require the more general-purpose
functionality. 

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 12:13             ` Theodore Tso
@ 2009-01-21 12:37               ` Jamie Lokier
  2009-01-21 14:12                 ` Theodore Tso
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 12:37 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nick Piggin, linux-fsdevel

Theodore Tso wrote:
> On Wed, Jan 21, 2009 at 07:23:06AM +0100, Nick Piggin wrote:
> > > 
> > > But sync_file_range() has a bug, which you've pointed out - the
> > > missing _data-retrieval_ metadata isn't synced.  In other words, it's
> > > completely useless.
> > 
> > I don't know. I don't think this is a newly discovered problem.
> > I think it's been known for a while, so I don't know what's
> > going on.
> 
> We should ask if anyone is actually using sync_file_range (cough,
> <Oracle>, cough, cough).  But if I had to guess, for those people who
> are using it, they don't much care, because 99% of the time they are
> overwriting data blocks within a file which isn't changing in size, so
> there is no data-retrieval metadata to sync.

What about btrfs with data checksums?  Doesn't that count among
data-retrieval metadata?  What about nilfs, which always writes data
to a new place?  Etc.

I'm wondering what exactly sync_file_range() definitely writes, and
what it doesn't write.

If it's just in use by Oracle, and nobody's sure what it does, that
smacks of those secret APIs in Windows that made Word run a bit faster
than everyone else's word processer...  sort of. :-)

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 12:37               ` Jamie Lokier
@ 2009-01-21 14:12                 ` Theodore Tso
  2009-01-21 14:35                   ` Chris Mason
  2009-01-22 21:18                   ` Florian Weimer
  0 siblings, 2 replies; 42+ messages in thread
From: Theodore Tso @ 2009-01-21 14:12 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Nick Piggin, linux-fsdevel

On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> 
> What about btrfs with data checksums?  Doesn't that count among
> data-retrieval metadata?  What about nilfs, which always writes data
> to a new place?  Etc.
> 
> I'm wondering what exactly sync_file_range() definitely writes, and
> what it doesn't write.
> 
> If it's just in use by Oracle, and nobody's sure what it does, that
> smacks of those secret APIs in Windows that made Word run a bit faster
> than everyone else's word processer...  sort of. :-)

Actually, I take that back; Oracle (and most other enterprise
databases; the world is not just Oracle --- there's also DB2, for
example) generally uses Direct I/O, so I wonder if they are using
sync_file_range() at all.

I do wonder though how well or poorly Oracle will work on btrfs, or
indeed any filesystem that uses WAFL-like or log-structutred
filesystem-like algorithms.  Most of the enterprise databases have
been optimized for use on block devices and filesystems where you do
write-in-place acesses; and some enterprise databases do their own
data checksumming.  So if I had to guess, I suspect the answer to the
question I posed is "disastrously".  :-)  After all, such db's
generally are happiest when the OS acts as a program loader than then
gets the heck out of the way of the filesystem, hence their use of
DIO.

Which again brings me back to the question --- I wonder who is
actually using sync_file_range, and what for?  I would assume it is
some database, most likely; so maybe we should check with MySQL or
Postgres?

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 14:12                 ` Theodore Tso
@ 2009-01-21 14:35                   ` Chris Mason
  2009-01-21 15:58                     ` Eric Sandeen
  2009-01-21 20:41                     ` Jamie Lokier
  2009-01-22 21:18                   ` Florian Weimer
  1 sibling, 2 replies; 42+ messages in thread
From: Chris Mason @ 2009-01-21 14:35 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jamie Lokier, Nick Piggin, linux-fsdevel, Eric Sandeen

On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote:
> On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> > 
> > What about btrfs with data checksums?  Doesn't that count among
> > data-retrieval metadata?  What about nilfs, which always writes data
> > to a new place?  Etc.
> > 
> > I'm wondering what exactly sync_file_range() definitely writes, and
> > what it doesn't write.
> > 
> > If it's just in use by Oracle, and nobody's sure what it does, that
> > smacks of those secret APIs in Windows that made Word run a bit faster
> > than everyone else's word processer...  sort of. :-)
> 
> Actually, I take that back; Oracle (and most other enterprise
> databases; the world is not just Oracle --- there's also DB2, for
> example) generally uses Direct I/O, so I wonder if they are using
> sync_file_range() at all.

Usually if they don't use O_DIRECT, they use O_SYNC.

> 
> I do wonder though how well or poorly Oracle will work on btrfs, or
> indeed any filesystem that uses WAFL-like or log-structutred
> filesystem-like algorithms.  Most of the enterprise databases have
> been optimized for use on block devices and filesystems where you do
> write-in-place acesses; and some enterprise databases do their own
> data checksumming.  So if I had to guess, I suspect the answer to the
> question I posed is "disastrously".  :-)

Yes, I think btrfs' nodatacow option is pretty important for database
use.

> After all, such db's
> generally are happiest when the OS acts as a program loader than then
> gets the heck out of the way of the filesystem, hence their use of
> DIO.
> 
> Which again brings me back to the question --- I wonder who is
> actually using sync_file_range, and what for?  I would assume it is
> some database, most likely; so maybe we should check with MySQL or
> Postgres?

Eric, didn't you have a magic script for grepping the sources/binaries
in fedora for syscalls? 

-chris



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 14:35                   ` Chris Mason
@ 2009-01-21 15:58                     ` Eric Sandeen
  2009-01-21 20:41                     ` Jamie Lokier
  1 sibling, 0 replies; 42+ messages in thread
From: Eric Sandeen @ 2009-01-21 15:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: Theodore Tso, Jamie Lokier, Nick Piggin, linux-fsdevel

Chris Mason wrote:
...

>> Which again brings me back to the question --- I wonder who is
>> actually using sync_file_range, and what for?  I would assume it is
>> some database, most likely; so maybe we should check with MySQL or
>> Postgres?
> 
> Eric, didn't you have a magic script for grepping the sources/binaries
> in fedora for syscalls? 
> 
> -chris

Yep (binaries) -

http://sandeen.fedorapeople.org/utilities/summarise-stat64.pl

Thanks to Greg Banks!  I don't currently have an exploded fedora tree to
run it over, but could do so after I un-busy myself again...

-Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 14:35                   ` Chris Mason
  2009-01-21 15:58                     ` Eric Sandeen
@ 2009-01-21 20:41                     ` Jamie Lokier
  2009-01-21 21:23                       ` jim owens
  1 sibling, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 20:41 UTC (permalink / raw)
  To: Chris Mason; +Cc: Theodore Tso, Nick Piggin, linux-fsdevel, Eric Sandeen

Chris Mason wrote:
> On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote:
> > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> > > 
> > > What about btrfs with data checksums?  Doesn't that count among
> > > data-retrieval metadata?  What about nilfs, which always writes data
> > > to a new place?  Etc.
> > > 
> > > I'm wondering what exactly sync_file_range() definitely writes, and
> > > what it doesn't write.
> > > 
> > > If it's just in use by Oracle, and nobody's sure what it does, that
> > > smacks of those secret APIs in Windows that made Word run a bit faster
> > > than everyone else's word processer...  sort of. :-)
> > 
> > Actually, I take that back; Oracle (and most other enterprise
> > databases; the world is not just Oracle --- there's also DB2, for
> > example) generally uses Direct I/O, so I wonder if they are using
> > sync_file_range() at all.
> 
> Usually if they don't use O_DIRECT, they use O_SYNC.

There's a case for using both together.

An O_DIRECT write convert to non-direct in some conditions.  When that
happens, you want the properties of O_SYNC.  It is documented to
happen on some other OSes - and maybe for VxFS on Linux.

Linux is nicer than some other platforms in returning EINVAL usually
for O_DIRECT whose alignment isn't satisfactory, but it can still fall
back to buffered I/O in some circumstances.  I think current kernels
do a sync in that case, but some earlier 2.6 kernels failed to.

Oh, you'd use O_DSYNC instead of course...  No point committing inode
updates all the time, only size increases, and most OSes document that
O_DSYNC does commit size increases.

By the way, emulators/VMs like QEMU and KVM use much the same methods
to access virtual disk images as databases do, for the same reasons.

> > I do wonder though how well or poorly Oracle will work on btrfs, or
> > indeed any filesystem that uses WAFL-like or log-structutred
> > filesystem-like algorithms.  Most of the enterprise databases have
> > been optimized for use on block devices and filesystems where you do
> > write-in-place acesses; and some enterprise databases do their own
> > data checksumming.  So if I had to guess, I suspect the answer to the
> > question I posed is "disastrously".  :-)
> 
> Yes, I think btrfs' nodatacow option is pretty important for database
> use.

Does O_DIRECT on btrfs still allocate new data blocks?
That's not very direct :-)

I'm thinking if O_DIRECT is set, considering what's likely to request
it, it may be reasonable for it to mean "overwrite in place" too
(except for files which are actually COW-shared with others of course).

> > After all, such db's
> > generally are happiest when the OS acts as a program loader than then
> > gets the heck out of the way of the filesystem, hence their use of
> > DIO.
> > 
> > Which again brings me back to the question --- I wonder who is
> > actually using sync_file_range, and what for?  I would assume it is
> > some database, most likely; so maybe we should check with MySQL or
> > Postgres?
> 
> Eric, didn't you have a magic script for grepping the sources/binaries
> in fedora for syscalls? 

sync_file_range does not appear anywhere in

    db-4.7.25
    mysql-dfsg-5.0.67
    postgresql-8.3.5
    sqlite3-3.5.9

(On Ubuntu; presumably the same in other distros).

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 20:41                     ` Jamie Lokier
@ 2009-01-21 21:23                       ` jim owens
  2009-01-21 21:59                         ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: jim owens @ 2009-01-21 21:23 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Chris Mason, Theodore Tso, Nick Piggin, linux-fsdevel,
	Eric Sandeen

Jamie Lokier wrote:
> 
> Does O_DIRECT on btrfs still allocate new data blocks?
> That's not very direct :-)
> 
> I'm thinking if O_DIRECT is set, considering what's likely to request
> it, it may be reasonable for it to mean "overwrite in place" too
> (except for files which are actually COW-shared with others of course).

O_DIRECT for databases is to bypass the OS file data cache.

Those (oracle) who have long experience with it on unix
know that the physical storage location can change on
a filesystem.

I do not think we want to make a special case,
it should be up to the db admin to choose cow/nocow
because if they want SNAPSHOTS they need cow.

jim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 21:23                       ` jim owens
@ 2009-01-21 21:59                         ` Jamie Lokier
  2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21 21:59 UTC (permalink / raw)
  To: jim owens
  Cc: Chris Mason, Theodore Tso, Nick Piggin, linux-fsdevel,
	Eric Sandeen

jim owens wrote:
> Jamie Lokier wrote:
> >
> >Does O_DIRECT on btrfs still allocate new data blocks?
> >That's not very direct :-)
> >
> >I'm thinking if O_DIRECT is set, considering what's likely to request
> >it, it may be reasonable for it to mean "overwrite in place" too
> >(except for files which are actually COW-shared with others of course).
> 
> O_DIRECT for databases is to bypass the OS file data cache.
> 
> Those (oracle) who have long experience with it on unix
> know that the physical storage location can change on
> a filesystem.
> 
> I do not think we want to make a special case,
> it should be up to the db admin to choose cow/nocow
> because if they want SNAPSHOTS they need cow.

SNAPSHOTS is what "except for files which are actually COW-shared with
others of course" refers to.  An option to "choose" to corrupt
snapshots would be very silly.

Writing in place or new-place on a *non-shared* (i.e. non-snapshotted)
file is the choice which is useful.  It's a filesystem implementation
detail, not a semantic difference.  I'm suggesting writing in place
may do no harm and be more like the expected behaviour with programs
that use O_DIRECT, which are usually databases.

How about a btrfs mount option?
in_place_write=never/always/direct_only.  (Default direct_only).

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: btrfs O_DIRECT was [rfc] fsync_range?
  2009-01-21 21:59                         ` Jamie Lokier
@ 2009-01-21 23:08                           ` jim owens
  2009-01-22  0:06                             ` Jamie Lokier
  0 siblings, 1 reply; 42+ messages in thread
From: jim owens @ 2009-01-21 23:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chris Mason, linux-fsdevel

Jamie Lokier wrote:
> 
> Writing in place or new-place on a *non-shared* (i.e. non-snapshotted)
> file is the choice which is useful.  It's a filesystem implementation
> detail, not a semantic difference.  I'm suggesting writing in place
> may do no harm and be more like the expected behaviour with programs
> that use O_DIRECT, which are usually databases.
> 
> How about a btrfs mount option?
> in_place_write=never/always/direct_only.  (Default direct_only).

The harm is creating a special guarantee for just one case
of "don't move my data" based on a transient file open mode.

What about defragmenting or moving the extent to another
device for performance or for (failing) device removal?

We are on a slippery slope for presumed expectations.

jim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: btrfs O_DIRECT was [rfc] fsync_range?
  2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
@ 2009-01-22  0:06                             ` Jamie Lokier
  2009-01-22 13:50                               ` jim owens
  0 siblings, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-22  0:06 UTC (permalink / raw)
  To: jim owens; +Cc: Chris Mason, linux-fsdevel

jim owens wrote:
> Jamie Lokier wrote:
> >
> >Writing in place or new-place on a *non-shared* (i.e. non-snapshotted)
> >file is the choice which is useful.  It's a filesystem implementation
> >detail, not a semantic difference.  I'm suggesting writing in place
> >may do no harm and be more like the expected behaviour with programs
> >that use O_DIRECT, which are usually databases.
> >
> >How about a btrfs mount option?
> >in_place_write=never/always/direct_only.  (Default direct_only).
> 
> The harm is creating a special guarantee for just one case
> of "don't move my data" based on a transient file open mode.
> 
> What about defragmenting or moving the extent to another
> device for performance or for (failing) device removal?
> 
> We are on a slippery slope for presumed expectations.

Don't make it a guarantee, just a hint to filesystem write strategy.

It's ok to move data around when useful, we're not talking about a
hard requirement, but a performance knob.

The question is just what performance and fragmentation
characteristics do programs that use O_DIRECT have?

They are nearly all databases, filesystems-in-a-file, or virtual
machine disks.  I'm guessing virtually all of those _particular_
applications programs would perform significantly differently with a
write-in-place strategy for most writes, although you'd still want
access to the bells and whistles of snapshots and COW and so on when
requested.

Note I said differently :-) I'm not sure write-in-place performs
better for those sort of applications.  It's just a guess.

Oracle probably has a really good idea how it performs on ZFS compared
with a block device (which is always in place) - and knows whether ZFS
does in-place writes with O_DIRECT or not.  Chris?

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: btrfs O_DIRECT was [rfc] fsync_range?
  2009-01-22  0:06                             ` Jamie Lokier
@ 2009-01-22 13:50                               ` jim owens
  0 siblings, 0 replies; 42+ messages in thread
From: jim owens @ 2009-01-22 13:50 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chris Mason, linux-fsdevel

Jamie Lokier wrote:
> jim owens wrote:
>> Jamie Lokier wrote:
>>> Writing in place or new-place on a *non-shared* (i.e. non-snapshotted)
>>> file is the choice which is useful.  It's a filesystem implementation
>>> detail, not a semantic difference.  I'm suggesting writing in place
>>> may do no harm and be more like the expected behaviour with programs
>>> that use O_DIRECT, which are usually databases.
>>>
>>> How about a btrfs mount option?
>>> in_place_write=never/always/direct_only.  (Default direct_only).
>> The harm is creating a special guarantee for just one case
>> of "don't move my data" based on a transient file open mode.
>>
>> What about defragmenting or moving the extent to another
>> device for performance or for (failing) device removal?
>>
>> We are on a slippery slope for presumed expectations.
> 
> Don't make it a guarantee, just a hint to filesystem write strategy.
> 
> It's ok to move data around when useful, we're not talking about a
> hard requirement, but a performance knob.
> 
> The question is just what performance and fragmentation
> characteristics do programs that use O_DIRECT have?
> 
> They are nearly all databases, filesystems-in-a-file, or virtual
> machine disks.  I'm guessing virtually all of those _particular_
> applications programs would perform significantly differently with a
> write-in-place strategy for most writes, although you'd still want
> access to the bells and whistles of snapshots and COW and so on when
> requested.
> 
> Note I said differently :-) I'm not sure write-in-place performs
> better for those sort of applications.  It's just a guess.

I'm very certain that write-in-place performs much better
than cow because as we all know, doing storage allocation
is expensive.  So many databases preallocate their files.

> Oracle probably has a really good idea how it performs on ZFS compared
> with a block device (which is always in place) - and knows whether ZFS
> does in-place writes with O_DIRECT or not.  Chris?

We only disagree how the rule to write-in-place is defined
and more importantly documented so it is easy to understand.

Btrfs allows each individual file to have "nodatacow" set
as an attribute.  That is an easy rule to document for
the db admin.  Much easier than "if nothing else takes
precedence to make it cow, O_DIRECT will write-in-place".

jim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21 14:12                 ` Theodore Tso
  2009-01-21 14:35                   ` Chris Mason
@ 2009-01-22 21:18                   ` Florian Weimer
  2009-01-22 21:23                     ` Florian Weimer
  1 sibling, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2009-01-22 21:18 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jamie Lokier, Nick Piggin, linux-fsdevel

* Theodore Tso:

> Actually, I take that back; Oracle (and most other enterprise
> databases; the world is not just Oracle --- there's also DB2, for
> example) generally uses Direct I/O, so I wonder if they are using
> sync_file_range() at all.

Recent PostgreSQL might use it because it has got a single-threaded
background writer which benefits from non-blocking fsync().  I'll have
to check to be sure, though.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-22 21:18                   ` Florian Weimer
@ 2009-01-22 21:23                     ` Florian Weimer
  0 siblings, 0 replies; 42+ messages in thread
From: Florian Weimer @ 2009-01-22 21:23 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jamie Lokier, Nick Piggin, linux-fsdevel

* Florian Weimer:

> * Theodore Tso:
>
>> Actually, I take that back; Oracle (and most other enterprise
>> databases; the world is not just Oracle --- there's also DB2, for
>> example) generally uses Direct I/O, so I wonder if they are using
>> sync_file_range() at all.
>
> Recent PostgreSQL might use it because it has got a single-threaded
> background writer which benefits from non-blocking fsync().  I'll have
> to check to be sure, though.

Uhm, it doesn't.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  1:29   ` Nick Piggin
  2009-01-21  3:15     ` Jamie Lokier
@ 2009-01-21  3:25     ` Jamie Lokier
  2009-01-21  3:52       ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Jamie Lokier @ 2009-01-21  3:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

Nick Piggin wrote:
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
>  
> Do you know which databases do this? It will be nice to ask their
> input and see whether it helps them (I presume it is an OSS database
> because the "big" ones just use direct IO and manage their own
> buffers, right?)

I just found this:

   http://markmail.org/message/injyo7coein7o3xz
   (Postgresql)

Tom Lane writes (on org.postgreql.pgsql-hackets):
>Greg Stark <gsst...@mit.edu> writes:
>> Come to think of it I wonder whether there's anything to be gained by
>> using smaller files for tables. Instead of 1G files maybe 256M files
>> or something like that to reduce the hit of fsyncing a file.
>>
>> Actually probably not. The weak part of our current approach is that
>> we tell the kernel "sync this file", then "sync that file", etc, in a
>> more or less random order. This leads to a probably non-optimal
>> sequence of disk accesses to complete a checkpoint. What we would
>> really like is a way to tell the kernel "sync all these files, and let
>> me know when you're done" --- then the kernel and hardware have some
>> shot at scheduling all the writes in an intelligent fashion.
>>
>> sync_file_range() is not that exactly, but since it lets you request
>> syncing and then go back and wait for the syncs later, we could get
>> the desired effect with two passes over the file list. (If the file
>> list is longer than our allowed number of open files, though, the
>> extra opens/closes could hurt.)
>>
>> Smaller files would make the I/O scheduling problem worse not better. 

So if you can make
commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that
would be even better ;-)

Seems to me the Postgresql thing could be improved by issuing parallel
fdatasync() calls each in their own thread.  Not optimal, exactly, but
more parallelism to schedule around.  (But limited by the I/O request
queue being full with big flushes, so potentially one fdatasync()
starving the others.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] fsync_range?
  2009-01-21  3:25     ` Jamie Lokier
@ 2009-01-21  3:52       ` Nick Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-21  3:52 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Jan 21, 2009 at 03:25:20AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > > For database writes, you typically write a bunch of stuff in various
> > > regions of a big file (or multiple files), then ideally fdatasync
> > > some/all of the written ranges - with writes committed to disk in the
> > > best order determined by the OS and I/O scheduler.
> >  
> > Do you know which databases do this? It will be nice to ask their
> > input and see whether it helps them (I presume it is an OSS database
> > because the "big" ones just use direct IO and manage their own
> > buffers, right?)
> 
> I just found this:
> 
>    http://markmail.org/message/injyo7coein7o3xz
>    (Postgresql)
> 
> Tom Lane writes (on org.postgreql.pgsql-hackets):
> >Greg Stark <gsst...@mit.edu> writes:
> >> Come to think of it I wonder whether there's anything to be gained by
> >> using smaller files for tables. Instead of 1G files maybe 256M files
> >> or something like that to reduce the hit of fsyncing a file.
> >>
> >> Actually probably not. The weak part of our current approach is that
> >> we tell the kernel "sync this file", then "sync that file", etc, in a
> >> more or less random order. This leads to a probably non-optimal
> >> sequence of disk accesses to complete a checkpoint. What we would
> >> really like is a way to tell the kernel "sync all these files, and let
> >> me know when you're done" --- then the kernel and hardware have some
> >> shot at scheduling all the writes in an intelligent fashion.
> >>
> >> sync_file_range() is not that exactly, but since it lets you request
> >> syncing and then go back and wait for the syncs later, we could get
> >> the desired effect with two passes over the file list. (If the file
> >> list is longer than our allowed number of open files, though, the
> >> extra opens/closes could hurt.)
> >>
> >> Smaller files would make the I/O scheduling problem worse not better. 

Interesting.


> So if you can make
> commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that
> would be even better ;-)

fsyncv? Send multiple inode,range tuples to the kernel to sync.


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-01-22 21:55 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25   ` Bryan Henderson
2009-01-20 22:42     ` Jamie Lokier
2009-01-21 19:43       ` Bryan Henderson
2009-01-21 21:08         ` Jamie Lokier
2009-01-21 22:44           ` Bryan Henderson
2009-01-21 23:31             ` Jamie Lokier
2009-01-21  1:36     ` Nick Piggin
2009-01-21 19:58       ` Bryan Henderson
2009-01-21 20:53         ` Jamie Lokier
2009-01-21 22:14           ` Bryan Henderson
2009-01-21 22:30             ` Jamie Lokier
2009-01-22  1:52               ` Bryan Henderson
2009-01-22  3:41                 ` Jamie Lokier
2009-01-21  1:29   ` Nick Piggin
2009-01-21  3:15     ` Jamie Lokier
2009-01-21  3:48       ` Nick Piggin
2009-01-21  5:24         ` Jamie Lokier
2009-01-21  6:16           ` Nick Piggin
2009-01-21 11:18             ` Jamie Lokier
2009-01-21 11:41               ` Nick Piggin
2009-01-21 12:09                 ` Jamie Lokier
2009-01-21  4:16       ` Nick Piggin
2009-01-21  4:59         ` Jamie Lokier
2009-01-21  6:23           ` Nick Piggin
2009-01-21 12:02             ` Jamie Lokier
2009-01-21 12:13             ` Theodore Tso
2009-01-21 12:37               ` Jamie Lokier
2009-01-21 14:12                 ` Theodore Tso
2009-01-21 14:35                   ` Chris Mason
2009-01-21 15:58                     ` Eric Sandeen
2009-01-21 20:41                     ` Jamie Lokier
2009-01-21 21:23                       ` jim owens
2009-01-21 21:59                         ` Jamie Lokier
2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
2009-01-22  0:06                             ` Jamie Lokier
2009-01-22 13:50                               ` jim owens
2009-01-22 21:18                   ` Florian Weimer
2009-01-22 21:23                     ` Florian Weimer
2009-01-21  3:25     ` Jamie Lokier
2009-01-21  3:52       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).