* [rfc] fsync_range?
@ 2009-01-20 16:47 Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-20 16:47 UTC (permalink / raw)
To: linux-fsdevel
Just wondering if we should add an fsync_range syscall like AIX and
some BSDs have? It's pretty simple for the pagecache since it
already implements the full sync with range syncs anyway. For
filesystems and user programs, I imagine it is a bit easier to
convert to fsync_range from fsync rather than use the sync_file_range
syscall.
Having a flags argument is nice, but AIX seems to use O_SYNC as a
flag, I wonder if we should follow?
Patch isn't complete...
---
fs/sync.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 50 insertions(+), 4 deletions(-)
Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c
+++ linux-2.6/fs/sync.c
@@ -76,10 +76,12 @@ int file_fsync(struct file *filp, struct
}
/**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * vfs_fsync_range - perform a fsync or fdatasync on part of a file
* @file: file to sync
* @dentry: dentry of @file
* @data: only perform a fdatasync operation
+ * @start: first byte to be synced
+ * @end: last byte to be synced
*
* Write back data and metadata for @file to disk. If @datasync is
* set only metadata needed to access modified file data is written.
@@ -88,7 +90,8 @@ int file_fsync(struct file *filp, struct
* only @dentry is set. This can only happen when the filesystem
* implements the export_operations API.
*/
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int vfs_fsync_range(struct file *file, struct dentry *dentry, int datasync,
+ loff_t start, loff_t end)
{
const struct file_operations *fop;
struct address_space *mapping;
@@ -112,7 +115,7 @@ int vfs_fsync(struct file *file, struct
goto out;
}
- ret = filemap_fdatawrite(mapping);
+ ret = filemap_fdatawrite_range(mapping, start, end);
/*
* We need to protect against concurrent writers, which could cause
@@ -123,12 +126,32 @@ int vfs_fsync(struct file *file, struct
if (!ret)
ret = err;
mutex_unlock(&mapping->host->i_mutex);
- err = filemap_fdatawait(mapping);
+ err = wait_on_page_writeback_range(mapping,
+ start >> PAGE_CACHE_SHIFT, end >> PAGE_CACHE_SHIFT);
if (!ret)
ret = err;
out:
return ret;
}
+EXPORT_SYMBOL(vfs_fsync_range);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file: file to sync
+ * @dentry: dentry of @file
+ * @data: only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk. If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set. This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+ return vfs_fsync_range(file, dentry, datasync, 0, LLONG_MAX);
+}
EXPORT_SYMBOL(vfs_fsync);
static int do_fsync(unsigned int fd, int datasync)
@@ -154,6 +177,29 @@ SYSCALL_DEFINE1(fdatasync, unsigned int,
return do_fsync(fd, 1);
}
+SYSCALL_DEFINE(fsync_range)(int fd, int how, loff_t start, loff_t length)
+{
+ struct file *file;
+ loff_t end;
+ int ret = -EBADF;
+
+ if (how != O_DSYNC && how != O_SYNC)
+ return -EINVAL;
+
+ if (length == 0)
+ end = LLONG_MAX;
+ else
+ end = start + length - 1;
+
+ file = fget(fd);
+ if (file) {
+ ret = vfs_fsync_range(file, file->f_path.dentry, how == O_DSYNC,
+ start, end);
+ fput(file);
+ }
+ return ret;
+}
+
/*
* sys_sync_file_range() permits finely controlled syncing over a segment of
* a file in the range offset .. (offset+nbytes-1) inclusive. If nbytes is
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [rfc] fsync_range? 2009-01-20 16:47 [rfc] fsync_range? Nick Piggin @ 2009-01-20 18:31 ` Jamie Lokier 2009-01-20 21:25 ` Bryan Henderson 2009-01-21 1:29 ` Nick Piggin 0 siblings, 2 replies; 42+ messages in thread From: Jamie Lokier @ 2009-01-20 18:31 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > Just wondering if we should add an fsync_range syscall like AIX and > some BSDs have? It's pretty simple for the pagecache since it > already implements the full sync with range syncs anyway. For > filesystems and user programs, I imagine it is a bit easier to > convert to fsync_range from fsync rather than use the sync_file_range > syscall. > > Having a flags argument is nice, but AIX seems to use O_SYNC as a > flag, I wonder if we should follow? I like the idea. It's much easier to understand than sync_file_range, whose man page doesn't really explain how to use it correctly. But how is fsync_range different from the sync_file_range syscall with all its flags set? For database writes, you typically write a bunch of stuff in various regions of a big file (or multiple files), then ideally fdatasync some/all of the written ranges - with writes committed to disk in the best order determined by the OS and I/O scheduler. For this, taking a vector of multiple ranges would be nice. Alternatively, issuing parallel fsync_range calls from multiple threads would approximate the same thing - if (big if) they aren't serialised by the kernel. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-20 18:31 ` Jamie Lokier @ 2009-01-20 21:25 ` Bryan Henderson 2009-01-20 22:42 ` Jamie Lokier 2009-01-21 1:36 ` Nick Piggin 2009-01-21 1:29 ` Nick Piggin 1 sibling, 2 replies; 42+ messages in thread From: Bryan Henderson @ 2009-01-20 21:25 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin > For database writes, you typically write a bunch of stuff in various > regions of a big file (or multiple files), then ideally fdatasync > some/all of the written ranges - with writes committed to disk in the > best order determined by the OS and I/O scheduler. > > For this, taking a vector of multiple ranges would be nice. > Alternatively, issuing parallel fsync_range calls from multiple > threads would approximate the same thing - if (big if) they aren't > serialised by the kernel. That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're planning to sync that data soon. The kernel responds by scheduling the I/O immediately. fsync_range() takes a single range and in this case is just a wait. I think it would be easier for the user as well as more flexible for the kernel than a multi-range fsync_range() or multiple threads. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-20 21:25 ` Bryan Henderson @ 2009-01-20 22:42 ` Jamie Lokier 2009-01-21 19:43 ` Bryan Henderson 2009-01-21 1:36 ` Nick Piggin 1 sibling, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-20 22:42 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin Bryan Henderson wrote: > > For database writes, you typically write a bunch of stuff in various > > regions of a big file (or multiple files), then ideally fdatasync > > some/all of the written ranges - with writes committed to disk in the > > best order determined by the OS and I/O scheduler. > > > > For this, taking a vector of multiple ranges would be nice. > > Alternatively, issuing parallel fsync_range calls from multiple > > threads would approximate the same thing - if (big if) they aren't > > serialised by the kernel. > > That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're > planning to sync that data soon. The kernel responds by scheduling the > I/O immediately. fsync_range() takes a single range and in this case is > just a wait. I think it would be easier for the user as well as more > flexible for the kernel than a multi-range fsync_range() or multiple > threads. FADV_WILLSYNC is already implemented: sync_file_range() with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE. That will block in a few circumstances, but maybe that's inevitable. If you called FADV_WILLSYNC on a few ranges to mean "soon", how do you wait until those ranges are properly committed? How do you ensure the right low-level I/O barriers are sent for those ranges before you start writing post-barrier data? I think you're saying call FADV_WILLSYNC first on all the ranges, then call fsync_range() on each range in turn to wait for the I/O to be complete - although that will cause unnecessary I/O barriers, one per fsync_range(). You can do something like that with sync_file_range() at the moment, except no way to ask for the barrier. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-20 22:42 ` Jamie Lokier @ 2009-01-21 19:43 ` Bryan Henderson 2009-01-21 21:08 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Bryan Henderson @ 2009-01-21 19:43 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin > Bryan Henderson wrote: > > > For this, taking a vector of multiple ranges would be nice. > > > Alternatively, issuing parallel fsync_range calls from multiple > > > threads would approximate the same thing - if (big if) they aren't > > > serialised by the kernel. > > > > That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're > > planning to sync that data soon. The kernel responds by scheduling the > > I/O immediately. fsync_range() takes a single range and in this case is > > just a wait. I think it would be easier for the user as well as more > > flexible for the kernel than a multi-range fsync_range() or multiple > > threads. > > FADV_WILLSYNC is already implemented: sync_file_range() with > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE. That will block in > a few circumstances, but maybe that's inevitable. There's actually a basic difference between a system call that says "initiate writeout" and one that says, "I plan to sync this soon," even if they have the same results in practice. (And Nick's "I won't write any time soon" idea is yet another). Though reasonable minds differ, the advice-to-kernel approach to managing file caches seems to be winning over instructions-to-kernel, and I personally much prefer it. > I think you're saying call FADV_WILLSYNC first on all the ranges, then > call fsync_range() on each range in turn to wait for the I/O to be > complete Right. The later calls tend to return immediately, of course. >- although that will cause unnecessary I/O barriers, one per >fsync_range(). What do I/O barriers have to do with it? An I/O barrier says, "don't harden later writes before these have hardened," whereas fsync_range() says, "harden these writes now." Does Linux these days send an I/O barrier to the block subsystem and/or device as part of fsync()? Or are we talking about the command to the device to harden all earlier writes (now) against a device power loss? Does fsync() do that? Either way, I can see that multiple fsync_ranges's in a row would be a little worse than just one, but it's pretty bad problem anyway, so I don't know if you could tell the difference. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 19:43 ` Bryan Henderson @ 2009-01-21 21:08 ` Jamie Lokier 2009-01-21 22:44 ` Bryan Henderson 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 21:08 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin Bryan Henderson wrote: > >- although that will cause unnecessary I/O barriers, one per > >fsync_range(). > > What do I/O barriers have to do with it? An I/O barrier says, "don't > harden later writes before these have hardened," whereas fsync_range() > says, "harden these writes now." Does Linux these days send an I/O > barrier to the block subsystem and/or device as part of fsync()? For better or worse, I/O barriers and I/O flushes are the same thing in the Linux block layer. I've argued for treating them distinctly, because there are different I/O scheduling opportunities around each of them, but there wasn't much interest. > Or are we talking about the command to the device to harden all earlier > writes (now) against a device power loss? Does fsync() do that? Ultimately that's what we're talking about, yes. Imho fsync() should do that, because a userspace database/filesystem should have access to the same integrity guarantees as an in-kernel filesystem. Linux fsync() doesn't always send the command - it's a bit unpredictable last time I looked. There are other opinions. MacOSX fsync() doesn't - because it has an fcntl() which is a stronger version of fsync() documented for that case. They preferred reduced integrity of fsync() to keep benchmarks on par with other OSes which don't send the command. Interestingly, Windows _does_ have the option to send the command to the device, controlled by userspace. If you set the Windows equivalents to O_DSYNC and O_DIRECT at the same time, then calls to the Windows equivalent to fdatasync() cause an I/O barrier command to be sent to the disk if necessary. The Windows documentation even explain the different between OS caching and device caching and when each one occurs, too. Wow - it looks like Windows (later versions) has the edge in doing the right thing here for quite some time... http://www.microsoft.com/sql/alwayson/storage-requirements.mspx http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx > Either way, I can see that multiple fsync_ranges's in a row would be a > little worse than just one, but it's pretty bad problem anyway, so I don't > know if you could tell the difference. A little? It's the difference between letting the disk schedule 100 scattered writes itself, and forcing the disk to write them in the order you sent them from userspace, aside from the doubling the rate of device commands... -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 21:08 ` Jamie Lokier @ 2009-01-21 22:44 ` Bryan Henderson 2009-01-21 23:31 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Bryan Henderson @ 2009-01-21 22:44 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 01:08:55 PM: > For better or worse, I/O barriers and I/O flushes are the same thing > in the Linux block layer. I've argued for treating them distinctly, > because there are different I/O scheduling opportunities around each > of them, but there wasn't much interest. It's hard to see how they could be combined -- flushing (waiting for the queue of writes to drain) is what you do -- at great performance cost -- when you don't have barriers available. The point of a barrier is to avoid having the queue run dry. But I don't suppose it matters for this discussion. > > Or are we talking about the command to the device to harden all earlier > > writes (now) against a device power loss? Does fsync() do that? > > Ultimately that's what we're talking about, yes. Imho fsync() should > do that, because a userspace database/filesystem should have access to > the same integrity guarantees as an in-kernel filesystem. Linux > fsync() doesn't always send the command - it's a bit unpredictable > last time I looked. Yes, it's the old performance vs integrity issue. Drives long ago came out with features to defeat operating system integrity efforts, in exchange for performance, by doing write caching by default, ignoring explicit demands to write through, etc. Obviously, some people want that, but I _have_ seen Linux developers escalate the battle for control of the disk drive. I can just never remember where it stands at any moment. But it doesn't matter in this discussion because my point is that if you accept the performance hit for integrity (I suppose we're saying that in current Linux, in some configurations, if a process does frequent fsyncs of a file, every process writing to every drive that file touches will slow to write-through speed), it will be about the same with 100 fsync_ranges in quick succession as for 1. > A little? It's the difference between letting the disk schedule 100 > scattered writes itself, and forcing the disk to write them in the > order you sent them from userspace, aside from the doubling the rate > of device commands... Again, in the scenario I'm talking about, all the writes were in the Linux I/O queue before the first fsync_range() (thanks to fadvises) , so this doesn't happen. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 22:44 ` Bryan Henderson @ 2009-01-21 23:31 ` Jamie Lokier 0 siblings, 0 replies; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 23:31 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin Bryan Henderson wrote: > Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 01:08:55 PM: > > > For better or worse, I/O barriers and I/O flushes are the same thing > > in the Linux block layer. I've argued for treating them distinctly, > > because there are different I/O scheduling opportunities around each > > of them, but there wasn't much interest. > > It's hard to see how they could be combined -- flushing (waiting for the > queue of writes to drain) is what you do -- at great performance cost -- > when you don't have barriers available. The point of a barrier is to > avoid having the queue run dry. Linux has a combined flush+barrier primitve in the block layer. Actually it's not a primitive op, it's a flag on a write meaning "do flush+barrier before and after this write", but that dates from fs transaction commits, and isn't appropriate for fsync. > Yes, it's the old performance vs integrity issue. Drives long ago came > out with features to defeat operating system integrity efforts, in > exchange for performance, by doing write caching by default, ignoring > explicit demands to write through, etc. Obviously, some people want that, > but I _have_ seen Linux developers escalate the battle for control of the > disk drive. I can just never remember where it stands at any moment. Last time I read about it, a few drives did it for a little while, then they stopped doing it and such drives are rare, if they exist at all, now. Forget about "Linux battling for control". Windows does this barrier stuff too, as does every other major OS, and Microsoft documents it in some depth. Upmarket systems use battery-backed disk controllers of course, to get speed and integrity together. Or increasingly SSDs. Certain downmarket (= cheapo) systems benefit noticably from the right barriers. Pull the power on a cheap Linux-based media player with a hard disk inside, and if it's using ext3 with barriers off, expect filesystem corruption from time to time. I and others working on such things have seen it. With barriers on, never see any corruption. This is with the cheapest small consumer disks you can find. > But it doesn't matter in this discussion because my point is that if you > accept the performance hit for integrity (I suppose we're saying that in > current Linux, in some configurations, if a process does frequent fsyncs > of a file, every process writing to every drive that file touches will > slow to write-through speed), it will be about the same with 100 > fsync_ranges in quick succession as for 1. Write-through speed depends _heavily_ on head seeking with a rotational disk. 100 fsync_ranges _for one commited app-level transaction_ is different from a succession of 100 transactions to commit. If an app requires one transaction which happens to modify 100 different places in a database file, you want those written in the best head seeking order. > > A little? It's the difference between letting the disk schedule 100 > > scattered writes itself, and forcing the disk to write them in the > > order you sent them from userspace, aside from the doubling the rate > > of device commands... > > Again, in the scenario I'm talking about, all the writes were in the Linux > I/O queue before the first fsync_range() (thanks to fadvises) , so this > doesn't happen. Maybe you're right about this. :-) (Persuaded). fadvise() which blocks is rather overloading the "hint" meaning of fadvise(). It could work though. It smells more like sync_file_range(), where userspace is responsible for deciding what order to submit the ranges in (because of the blocking), than fsyncv(), where the kernel uses any heuristic it likes including knowledge of filesystem block layout (higher level than elevator, but lower level than plain file offset). For userspace, that's not much different from what databases using O_DIRECT have to do _already_. They_ have to decide what order to submit I/O ranges in, one range at a time, and with AIO they get about the same amount of block elevator flexibility. Which is exactly one full block queue's worth of sorting at the head of a streaming pump of file offsets. So maybe the fadvise() method is ok... It does mean two system calls per file range, though. One fadvise() per range to submit I/O, one fsync_range() to wait for all of it afterwards. That smells like sync_file_range() too. Back to fsyncv() again? Which does have the benefit of being easy to understand too :-) -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-20 21:25 ` Bryan Henderson 2009-01-20 22:42 ` Jamie Lokier @ 2009-01-21 1:36 ` Nick Piggin 2009-01-21 19:58 ` Bryan Henderson 1 sibling, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-21 1:36 UTC (permalink / raw) To: Bryan Henderson; +Cc: Jamie Lokier, linux-fsdevel On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote: > > For database writes, you typically write a bunch of stuff in various > > regions of a big file (or multiple files), then ideally fdatasync > > some/all of the written ranges - with writes committed to disk in the > > best order determined by the OS and I/O scheduler. > > > > For this, taking a vector of multiple ranges would be nice. > > Alternatively, issuing parallel fsync_range calls from multiple > > threads would approximate the same thing - if (big if) they aren't > > serialised by the kernel. > > That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're > planning to sync that data soon. The kernel responds by scheduling the > I/O immediately. fsync_range() takes a single range and in this case is > just a wait. I think it would be easier for the user as well as more > flexible for the kernel than a multi-range fsync_range() or multiple > threads. A problem is that the kernel will not always be able to schedule the IO without blocking (various mutexes or block device queues full etc). And it takes multiple system calls. If this is an important functionality, I think we could do an fsyncv. Having an FADV_ for asynchronous writeout wouldn't hurt either. POSIX_FADV_DONTNEED basically does that except it also drops the cache afterwards, wheras FADV_WONTDIRTY or something doesn't necessarily want that. It would be easy to add one to DTRT. (and I notice FADV_DONTNEED is not taking notice of the given range when starting writeout) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 1:36 ` Nick Piggin @ 2009-01-21 19:58 ` Bryan Henderson 2009-01-21 20:53 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Bryan Henderson @ 2009-01-21 19:58 UTC (permalink / raw) To: Nick Piggin; +Cc: Jamie Lokier, linux-fsdevel Nick Piggin <npiggin@suse.de> wrote on 01/20/2009 05:36:06 PM: > On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote: > > > For this, taking a vector of multiple ranges would be nice. > > > Alternatively, issuing parallel fsync_range calls from multiple > > > threads would approximate the same thing - if (big if) they aren't > > > serialised by the kernel. > > > > That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're > > planning to sync that data soon. The kernel responds by scheduling the > > I/O immediately. fsync_range() takes a single range and in this case is > > just a wait. I think it would be easier for the user as well as more > > flexible for the kernel than a multi-range fsync_range() or multiple > > threads. > > A problem is that the kernel will not always be able to schedule the > IO without blocking (various mutexes or block device queues full etc). I don't really see the problem with that. We're talking about a program that is doing device-synchronous I/O. Blocking is a way of life. Plus, the beauty of advice is that if it's hard occasionally, the kernel can just ignore it. > And it takes multiple system calls. When you're reading, the system call overhead is significant in an operation that just copies from memory to memory, so we have readv() and accept the added complexity and harder to use interface. When you're syncing a file, you're running so slowly that I doubt the overhead of multiple system calls is noticeable. There are a lot of other multiple system call sequences I would try to replace with one complex one before worrying about multi-range file sync. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 19:58 ` Bryan Henderson @ 2009-01-21 20:53 ` Jamie Lokier 2009-01-21 22:14 ` Bryan Henderson 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 20:53 UTC (permalink / raw) To: Bryan Henderson; +Cc: Nick Piggin, linux-fsdevel Bryan Henderson wrote: > Nick Piggin <npiggin@suse.de> wrote on 01/20/2009 05:36:06 PM: > > > On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote: > > > > For this, taking a vector of multiple ranges would be nice. > > > > Alternatively, issuing parallel fsync_range calls from multiple > > > > threads would approximate the same thing - if (big if) they aren't > > > > serialised by the kernel. > > > > > > That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're > > > > planning to sync that data soon. The kernel responds by scheduling > the > > > I/O immediately. fsync_range() takes a single range and in this case > is > > > just a wait. I think it would be easier for the user as well as more > > > flexible for the kernel than a multi-range fsync_range() or multiple > > > threads. > > > > A problem is that the kernel will not always be able to schedule the > > IO without blocking (various mutexes or block device queues full etc). > > I don't really see the problem with that. We're talking about a program > that is doing device-synchronous I/O. Blocking is a way of life. Plus, > the beauty of advice is that if it's hard occasionally, the kernel can > just ignore it. If you have 100 file regions, each one a few pages in size, and you do 100 fsync_range() calls, that results in potentally far from optimal I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache flushes (I/O barriers) instead of just one at the end. 100 head seeks and 100 cache flush ops can be very expensive. This is the point of taking a vector of ranges to flush - or some other way to "plug" the I/O and only wait for it after submitting it all. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 20:53 ` Jamie Lokier @ 2009-01-21 22:14 ` Bryan Henderson 2009-01-21 22:30 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Bryan Henderson @ 2009-01-21 22:14 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 12:53:56 PM: > Bryan Henderson wrote: > > Nick Piggin <npiggin@suse.de> wrote on 01/20/2009 05:36:06 PM: > > > > > On Tue, Jan 20, 2009 at 01:25:59PM -0800, Bryan Henderson wrote: > > > > > For this, taking a vector of multiple ranges would be nice. > > > > > Alternatively, issuing parallel fsync_range calls from multiple > > > > > threads would approximate the same thing - if (big if) they aren't > > > > > serialised by the kernel. > > > > > > > > That sounds like a job for fadvise(). A new FADV_WILLSYNC says you're > > > > > > planning to sync that data soon. The kernel responds by scheduling > > the > > > > I/O immediately. fsync_range() takes a single range and in this case > > is > > > > just a wait. I think it would be easier for the user as well as more > > > > flexible for the kernel than a multi-range fsync_range() or multiple > > > > threads. > > > > > > A problem is that the kernel will not always be able to schedule the > > > IO without blocking (various mutexes or block device queues full etc). > > > > I don't really see the problem with that. We're talking about a program > > that is doing device-synchronous I/O. Blocking is a way of life. Plus, > > the beauty of advice is that if it's hard occasionally, the kernel can > > just ignore it. > > If you have 100 file regions, each one a few pages in size, and you do > 100 fsync_range() calls, that results in potentally far from optimal > I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache > flushes (I/O barriers) instead of just one at the end. 100 head seeks > and 100 cache flush ops can be very expensive. You got lost in the thread here. I proposed a fadvise() that would result in I/O scheduling; Nick said the fadvise() might have to block; I said so what? Now you seem to be talking about 100 fsync_range() calls, each of which starts and then waits for a sync of one range. Getting back to I/O scheduled as a result of an fadvise(): if it blocks because the block queue is full, then it's going to block with a multi-range fsync_range() as well. The other blocks are kind of vague, but I assume they're rare and about the same as for multi-range fsync_range(). > This is the point of taking a vector of ranges to flush - or some > other way to "plug" the I/O and only wait for it after submitting it > all. My fadvise-based proposal waits for I/O only after it has all been submitted. But plugging (delaying the start of I/O even though it is ready to go and the device is idle) is rarely a good idea. It can help for short bursts to a mostly idle device (typically saves half a seek per burst), but a busy device provides a natural plug. It thus can't help throughput, but can improve the response time of a burst. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 22:14 ` Bryan Henderson @ 2009-01-21 22:30 ` Jamie Lokier 2009-01-22 1:52 ` Bryan Henderson 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 22:30 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin Bryan Henderson wrote: > > If you have 100 file regions, each one a few pages in size, and you do > > 100 fsync_range() calls, that results in potentally far from optimal > > I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache > > flushes (I/O barriers) instead of just one at the end. 100 head seeks > > and 100 cache flush ops can be very expensive. > > You got lost in the thread here. I proposed a fadvise() that would result > in I/O scheduling; Nick said the fadvise() might have to block; I said so > what? Now you seem to be talking about 100 fsync_range() calls, each of > which starts and then waits for a sync of one range. > > Getting back to I/O scheduled as a result of an fadvise(): if it blocks > because the block queue is full, then it's going to block with a > multi-range fsync_range() as well. No, why would it block? The block queue has room for (say) 100 small file ranges. If you submit 1000 ranges, sure the first 900 may block, then you've got 100 left in the queue. Then you call fsync_range() 1000 times, the first 900 are NOPs as you say because the data has been written. The remaining 100 (size of the block queue) are forced to write serially. They're even written to the disk platter in order. > My fadvise-based proposal waits for I/O only after it has all been > submitted. Are you saying one call to fsync_range() should wait for all the writes which have been queued by the fadvice to different ranges? > But plugging (delaying the start of I/O even though it is ready to go and > the device is idle) is rarely a good idea. It can help for short bursts > to a mostly idle device (typically saves half a seek per burst), but a > busy device provides a natural plug. It thus can't help throughput, but > can improve the response time of a burst. I agree, plugging doesn't make a big difference. However, letting the disk or elevator reorder the writes it has room for does sometimes make a big difference. That's the point. We're not talking about forcibly _delaying_ I/O, we're talking about giving the block elevator, and disk's own elevator, freedom to do their job by not forcibly _flushing_ and _waiting_ between each individual request for the length of the queue. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 22:30 ` Jamie Lokier @ 2009-01-22 1:52 ` Bryan Henderson 2009-01-22 3:41 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Bryan Henderson @ 2009-01-22 1:52 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Nick Piggin Jamie Lokier <jamie@shareable.org> wrote on 01/21/2009 02:30:03 PM: > > Getting back to I/O scheduled as a result of an fadvise(): if it blocks > > because the block queue is full, then it's going to block with a > > multi-range fsync_range() as well. > > No, why would it block? The block queue has room for (say) 100 small > file ranges. If you submit 1000 ranges, sure the first 900 may block, > then you've got 100 left in the queue. Yes, those are the blocks Nick mentioned. They're the same as with multi-range fsync_range(), in which the one system call submits 1000 ranges. > Then you call fsync_range() 1000 times, the first 900 are NOPs as you > say because the data has been written. The remaining 100 (size of the > block queue) are forced to write serially. They're even written to > the disk platter in order. I don't see why they would go serially or in any particular order. They're in the Linux queue in sorted, coalesced form and go down to the disk in batches for the drive to do its own coalescing and ordering. Same as with multi-range fsync_range(). The Linux I/O scheduler isn't going to wait for the forthcoming fsync_range() to start any I/O that's in its queue. >Linux has a combined flush+barrier primitve in the block layer. >Actually it's not a primitive op, it's a flag on a write meaning "do >flush+barrier before and after this write", I think you said that wrong, because a barrier isn't something you do. The flag says, "put a barrier before and after this write," and I think you're saying it also implies that as the barrier passes, Linux does a device flush (e.g. SCSI Synchronize Cache) command. That would make sense as a poor man's way of propagating the barrier into the device. SCSI devices have barriers too, but they would be harder for Linux to use than a Synchronize Cache, so maybe Linux doesn't yet. I can also see that it makes sense for fsync() to use this combination. I was confused before because both the device and Linux block layer have barriers and flushes and I didn't know which ones we were talking about. >> Yes, it's the old performance vs integrity issue. Drives long ago came >> out with features to defeat operating system integrity efforts, in >> exchange for performance, by doing write caching by default, ignoring >> explicit demands to write through, etc. Obviously, some people want that, >> but I _have_ seen Linux developers escalate the battle for control of the >> disk drive. I can just never remember where it stands at any moment. > ... >Forget about "Linux battling for control". Windows does this barrier >stuff too, as does every other major OS, and Microsoft documents it in >some depth. Not sure what you want me to forget about; you seem to be confirming that Linux, as well as all other OSes are engaged in this battle (with disk designers), and it seems like a natural state of engineering practice to me. >fadvise() which blocks is rather overloading the "hint" meaning of fadvise(). Yeah, I'm not totally comfortable with that either. I've been pretty much assuming that all the ranges from this database transaction generally fit in the I/O queue. I wonder what existing fadvise(FADV_DONTNEED) does, since Linux has the same "schedule the I/O right now" response to that. Just ignore the hint after the queue is full? >It does mean two system calls per file range, though. One fadvise() >per range to submit I/O, one fsync_range() to wait for all of it >afterwards. That smells like sync_file_range() too. > >Back to fsyncv() again? I said in another subthread that I don't think system call overhead is at all noticeable in a program that is doing device-synchronous I/O. Not enough to justify a fsyncv() like we justify readv(). Hey, here's a small example of how the flexibility of the single range fadvise plus single range fsync_range beats a multi-range fsyncv/fsync_range: Early in this thread, we noted the value of feeding multiple ranges of the file to the block layer at once for syncing. Then we noticed that it would also be useful to feed multiple ranges of multiple files, requiring a different interface. With the two system calls per range, that latter requirement was already met without thinking about it. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-22 1:52 ` Bryan Henderson @ 2009-01-22 3:41 ` Jamie Lokier 0 siblings, 0 replies; 42+ messages in thread From: Jamie Lokier @ 2009-01-22 3:41 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel, Nick Piggin Bryan Henderson wrote: > > No, why would it block? The block queue has room for (say) 100 small > > file ranges. If you submit 1000 ranges, sure the first 900 may block, > > then you've got 100 left in the queue. > > Yes, those are the blocks Nick mentioned. They're the same as with > multi-range fsync_range(), > in which the one system call submits 1000 ranges. Yes, except that fsync_range() theoretically has flexibility to order them prior to the block queue with filesystem internal knowledge. I doubt if that would ever be implemented, but you never know. > > Then you call fsync_range() 1000 times, the first 900 are NOPs as you > > say because the data has been written. The remaining 100 (size of the > > block queue) are forced to write serially. They're even written to > > the disk platter in order. > > I don't see why they would go serially or in any particular order. You're right, please ignore my brain fart. > >Linux has a combined flush+barrier primitve in the block layer. > >Actually it's not a primitive op, it's a flag on a write meaning "do > >flush+barrier before and after this write", > > I think you said that wrong, because a barrier isn't something you do. The > flag says, "put a barrier before and after this write," and I think you're > saying it also implies that as the barrier passes, Linux does a device > flush (e.g. SCSI Synchronize Cache) command. That would make sense as a > poor man's way of propagating the barrier into the device. SCSI devices > have barriers too, but they would be harder for Linux to use than a > Synchronize Cache, so maybe Linux doesn't yet. That's all correct. Linux does a device flush on PATA if the device write cache is enabled; I don't know if it does one on SCSI. Two flushes are done, before and after the flagged write I/O. There's a "softbarrier" aspect too: other writes cannot be reordered around these I/Os, and on devices which accept overlapping commands, the device queue is drained around the softbarrier. On PATA that's really all you can do. On SATA with NCQ, and on SCSI, if the device accepts1 enough commands in flight at once, it's cheaper to disable the device write cache. It's a balancing act, depending on how often you flush. I don't think Linux has ever used the SCSI barrier capabilities. One other thing it can do is synchronous write, called FUA on SATA, so flush+write+flush becomes flush+syncwrite. The only thing Linux would gain from separating flush ops from barrier ops in the block request queue, is different reordering and coalescing opportunities. It's not permitted to move writes in either direction around a barrier, but it is permitted to move writes earlier past a flush, and that may allow flushes to coalesce. > Yeah, I'm not totally comfortable with that either. I've been pretty much > assuming that all the ranges from this database transaction generally fit > in the I/O queue. I wouldn't assume that. It's legitimate to write gigabytes of data in a transaction, then want to fsync it all before writing a commit block. Only about 1 x system RAM's worth of dirty data will need flushing at that point :-) > I said in another subthread that I don't think system call overhead is at > all noticeable in a program that is doing device-synchronous I/O. Not > enough to justify a fsyncv() like we justify readv(). Btw, historically the justification for readv() was for sockets, not files. Separate reads don't work the same. Yes, system call overhead is quite small. But I saw recently on the QEMU list that they want to add preadv() and pwritev() to Linux, because of the difference it makes to performance compared with a sequence of pread() and pwrite() calls. That surprises me. (I wonder if they measured it). fsync_range() does less work per-page than read/write. In some scenarios, fsync_range() is scanning over large numbers of pages as quickly as possible, skipping the clean+not-writing pages. I wonder if that justifies fsyncv() :-) > Hey, here's a small example of how the flexibility of the single range > fadvise plus single range fsync_range beats a multi-range > fsyncv/fsync_range: Early in this thread, we noted the value of feeding > multiple ranges of the file to the block layer at once for syncing. Then > we noticed that it would also be useful to feed multiple ranges of > multiple files, requiring a different interface. With the two system > calls per range, that latter requirement was already met without thinking > about it. That's why Nick proposed fsyncv take (file, start, length) tuples, to sync multiple files :-) If you do it the blocking-fadvise way, the blocking bits. You'll block while feeding requests for the first file, until you get started on the second, and so on. No chance for parallelism - e.g. what if the files are on different devices in btrfs? :-) (Same for different extents in the same file actually). That said, I'll be very surprised if fsyncv() is implemented smarter than that. As an API allowing the possibility, it sort of works though. Who knows, it might just pass the work on to btrfs or Tux3 to optimise cleverly :-) (That said #2, an AIO based API would _in principle_ provide yet more freedom to convery what the app wants without overconstraining. Overlap, parallelism, and not having to batch things up, but submit them as needs come in. Is that realistic?) -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-20 18:31 ` Jamie Lokier 2009-01-20 21:25 ` Bryan Henderson @ 2009-01-21 1:29 ` Nick Piggin 2009-01-21 3:15 ` Jamie Lokier 2009-01-21 3:25 ` Jamie Lokier 1 sibling, 2 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-21 1:29 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Tue, Jan 20, 2009 at 06:31:21PM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > Just wondering if we should add an fsync_range syscall like AIX and > > some BSDs have? It's pretty simple for the pagecache since it > > already implements the full sync with range syncs anyway. For > > filesystems and user programs, I imagine it is a bit easier to > > convert to fsync_range from fsync rather than use the sync_file_range > > syscall. > > > > Having a flags argument is nice, but AIX seems to use O_SYNC as a > > flag, I wonder if we should follow? > > I like the idea. It's much easier to understand than sync_file_range, > whose man page doesn't really explain how to use it correctly. > > But how is fsync_range different from the sync_file_range syscall with > all its flags set? sync_file_range would have to wait, then write, then wait. It also does not call into the filesystem's ->fsync function, I don't know what the wider consequences of that are for all filesystems, but for some it means that metadata required to read back the data is not synced properly, and often it means that metadata sync will not work. Filesystems could also much more easily get converted to a ->fsync_range function if that would be beneficial to any of them. > For database writes, you typically write a bunch of stuff in various > regions of a big file (or multiple files), then ideally fdatasync > some/all of the written ranges - with writes committed to disk in the > best order determined by the OS and I/O scheduler. Do you know which databases do this? It will be nice to ask their input and see whether it helps them (I presume it is an OSS database because the "big" ones just use direct IO and manage their own buffers, right?) Today, they will have to just fsync the whole file. So they first must identify which parts of the file need syncing, and then gather those parts as a vector. > For this, taking a vector of multiple ranges would be nice. > Alternatively, issuing parallel fsync_range calls from multiple > threads would approximate the same thing - if (big if) they aren't > serialised by the kernel. I was thinking about doing something like that, but I just wanted to get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc could implement fsync_range on top of that? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 1:29 ` Nick Piggin @ 2009-01-21 3:15 ` Jamie Lokier 2009-01-21 3:48 ` Nick Piggin 2009-01-21 4:16 ` Nick Piggin 2009-01-21 3:25 ` Jamie Lokier 1 sibling, 2 replies; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 3:15 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > > I like the idea. It's much easier to understand than sync_file_range, > > whose man page doesn't really explain how to use it correctly. > > > > But how is fsync_range different from the sync_file_range syscall with > > all its flags set? > > sync_file_range would have to wait, then write, then wait. It also > does not call into the filesystem's ->fsync function, I don't know > what the wider consequences of that are for all filesystems, but > for some it means that metadata required to read back the data is > not synced properly, and often it means that metadata sync will not > work. fsync_range() must also wait, write, then wait again. The reason is this sequence of events: 1. App calls write() on a page, dirtying it. 2. Data writeout is initiated by usual kernel task. 3. App calls write() on the page again, dirtying it again. 4. App calls fsync_range() on the page. 5. ... Dum de dum, time passes ... 6. Writeout from step 2 completes. 7. fsync_range() initiates another writeout, because the in-progress writeout from step 2 might not include the changes from step 3. 7. fsync_range() waits for writout from step 7. 8. fsync_range() requests a device cache flush if needed (we hope!). 9. Returns to app. Therefore fsync_range() must wait for in-progress writeout to complete, before initiating more writeout and waiting again. This is the reason sync_file_range() has all those flags. As I said, the man page doesn't really explain how to use it properly. An optimisation would be to detect I/O that's been queued on an elevator, but where the page has not actually been read (i.e. no DMA or bounce buffer copy done yet). Most queued I/O presumably falls into this category, and the second writeout would not be required. But perhaps this doesn't happen much in real life? Also the kernel is in a better position to decide which order to do everything in, and how best to batch it. Also, during the first wait (for in-progress writeout) the kernel could skip ahead to queuing some of the other pages for writeout as long as there is room in the request queue, and come back to the other pages later. > Filesystems could also much more easily get converted to a ->fsync_range > function if that would be beneficial to any of them. > > > > For database writes, you typically write a bunch of stuff in various > > regions of a big file (or multiple files), then ideally fdatasync > > some/all of the written ranges - with writes committed to disk in the > > best order determined by the OS and I/O scheduler. > > Do you know which databases do this? It will be nice to ask their > input and see whether it helps them (I presume it is an OSS database > because the "big" ones just use direct IO and manage their own > buffers, right?) I don't know if anyone uses sync_file_range(), or if it even works reliably, since it's not going to get much testing. I don't use it myself yet. My interest is in developing (yet another?) high performance but reliable database engine, not an SQL one though. That's why I keep noticing the issues with fsync, sync_file_range, barriers etc. Take a look at this, though: http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html "The results show fadvise + sync_file_range is on par or better than O_DIRECT. Detailed results are attached." By the way, direct I/O is nice but (a) not always possible, and (b) you don't get the integrity barriers, do you? > Today, they will have to just fsync the whole file. So they first must > identify which parts of the file need syncing, and then gather those > parts as a vector. Having to fsync the whole file is one reason that some databases use separate journal files - so fsync only flushes the journal file, not the big data file which can sometimes be more relaxed. It's also a reason some databases recommend splitting the database into multiple files of limited size - so the hit from fsync is reduced. When a single file is used for journal and data (like e.g. ext3-in-a-file), every transaction (actually coalesced set of transactions) forces the disk head back and forth between two data areas. If the journal can be synced by itself, the disk head doesn't need to move back and forth as much. Identifying which parts to sync isn't much different than a modern filesystem needs to do with its barriers, journals and journal-trees. They have a lot in common. This is bread and butter stuff for database engines. fsync_range would remove those reasons for using separate files, making the database-in-a-single-file implementations more efficient. That is administratively much nicer, imho. Similar for userspace filesystem-in-a-file, which is basically the same. > > For this, taking a vector of multiple ranges would be nice. > > Alternatively, issuing parallel fsync_range calls from multiple > > threads would approximate the same thing - if (big if) they aren't > > serialised by the kernel. > > I was thinking about doing something like that, but I just wanted to > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc > could implement fsync_range on top of that? Rather than fsyncv, is there some way to separate the fsync into parts? 1. A sequence of system calls to designate ranges. 2. A call to say "commit and wait on all those ranges given in step 1". It seems sync_file_range() isn't _that_ far off doing that, except it doesn't get the metadata right, as you say, and it doesn't have a place for the I/O barrier either. An additional couple of flags to sync_file_range() would sort out the API: SYNC_FILE_RANGE_METADATA Commit the file metadata such as modification time and attributes. Think fsync() versus fdatasync(). SYNC_FILE_RANGE_IO_BARRIER Include a block device cache flush if needed, same as normal fsync() and fdatasync() are expected to. The flag gives the syscall some flexibility to not do so. For the filesystem metadata, which you noticed is needed to access the data on some filesystems, that should _always_ be committed. Not doing so is a bug in sync_file_range() to be fixed. fdatasync() must commit the metadata needed to access the file data, by the way. In case it wasn't obvious. :-) This includes the file size, if that's grown. Many OSes have an O_DSYNC which is equivalent to fdatasync() after each write, and is documented to write the inode and other metadata needed to access flushed data if the file size has increased. With sync_file_range() fixed, all the other syscalls fsync(), fdatasync() and fsync_range() could be implemented in terms of it - possibly simplifying the code. Maybe O_SYNC and O_DSYNC could use it too. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 3:15 ` Jamie Lokier @ 2009-01-21 3:48 ` Nick Piggin 2009-01-21 5:24 ` Jamie Lokier 2009-01-21 4:16 ` Nick Piggin 1 sibling, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-21 3:48 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > > I like the idea. It's much easier to understand than sync_file_range, > > > whose man page doesn't really explain how to use it correctly. > > > > > > But how is fsync_range different from the sync_file_range syscall with > > > all its flags set? > > > > sync_file_range would have to wait, then write, then wait. It also > > does not call into the filesystem's ->fsync function, I don't know > > what the wider consequences of that are for all filesystems, but > > for some it means that metadata required to read back the data is > > not synced properly, and often it means that metadata sync will not > > work. > > fsync_range() must also wait, write, then wait again. > > The reason is this sequence of events: > > 1. App calls write() on a page, dirtying it. > 2. Data writeout is initiated by usual kernel task. > 3. App calls write() on the page again, dirtying it again. > 4. App calls fsync_range() on the page. > 5. ... Dum de dum, time passes ... > 6. Writeout from step 2 completes. > > 7. fsync_range() initiates another writeout, because the > in-progress writeout from step 2 might not include the changes from > step 3. > > 7. fsync_range() waits for writout from step 7. > 8. fsync_range() requests a device cache flush if needed (we hope!). > 9. Returns to app. > > Therefore fsync_range() must wait for in-progress writeout to > complete, before initiating more writeout and waiting again. That's only in rare cases where writeout is started but not completed before we last dirty it and before we call the next fsync. I'd say in most cases, we won't have to wait (it should often remain clean). > This is the reason sync_file_range() has all those flags. As I said, > the man page doesn't really explain how to use it properly. Well, one can read what the code does. Aside from that extra wait, and the problem of not syncing metadata, one thing I dislike about it is that it exposes the new concept of "writeout" to the userspace ABI. Previously all we cared about is whether something is safe on disk or not. So I think it is reasonable to augment the traditional data integrity APIs which will probably be more easily used by existing apps. > An optimisation would be to detect I/O that's been queued on an > elevator, but where the page has not actually been read (i.e. no DMA > or bounce buffer copy done yet). Most queued I/O presumably falls > into this category, and the second writeout would not be required. > > But perhaps this doesn't happen much in real life? I doubt it would be worth the complexity. It would probably be pretty fiddly and ugly change to the pagecache. > Also the kernel is in a better position to decide which order to do > everything in, and how best to batch it. Better position than what? I proposed fsync_range (or fsyncv) to be in-kernel too, of course. > Also, during the first wait (for in-progress writeout) the kernel > could skip ahead to queuing some of the other pages for writeout as > long as there is room in the request queue, and come back to the other > pages later. Sure it could. That adds yet more complexity and opens possibility for livelock (you go back to the page you were waiting for to find it was since redirtied and under writeout again). > > > For database writes, you typically write a bunch of stuff in various > > > regions of a big file (or multiple files), then ideally fdatasync > > > some/all of the written ranges - with writes committed to disk in the > > > best order determined by the OS and I/O scheduler. > > > > Do you know which databases do this? It will be nice to ask their > > input and see whether it helps them (I presume it is an OSS database > > because the "big" ones just use direct IO and manage their own > > buffers, right?) > > I don't know if anyone uses sync_file_range(), or if it even works > reliably, since it's not going to get much testing. The problem is that it is hard to verify. Even if it is getting lots of testing, it is not getting enough testing with the block device being shut off or throwing errors at exactly the right time. In 2.6.29 I just fixed a handful of data integrity and error reporting bugs in sync that have been there for basically all of 2.6. > I don't use it myself yet. My interest is in developing (yet > another?) high performance but reliable database engine, not an SQL > one though. That's why I keep noticing the issues with fsync, > sync_file_range, barriers etc. > > Take a look at this, though: > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html > > "The results show fadvise + sync_file_range is on par or better than > O_DIRECT. Detailed results are attached." That's not to say fsync would be any worse. And it's just a microbenchmark anyway. > By the way, direct I/O is nice but (a) not always possible, and (b) > you don't get the integrity barriers, do you? It should. But I wasn't advocating it versus pagecache + syncing, just wondering what databases could use fsyncv so we can see if they can test. > > Today, they will have to just fsync the whole file. So they first must > > identify which parts of the file need syncing, and then gather those > > parts as a vector. > > Having to fsync the whole file is one reason that some databases use > separate journal files - so fsync only flushes the journal file, not > the big data file which can sometimes be more relaxed. > > It's also a reason some databases recommend splitting the database > into multiple files of limited size - so the hit from fsync is reduced. > > When a single file is used for journal and data (like > e.g. ext3-in-a-file), every transaction (actually coalesced set of > transactions) forces the disk head back and forth between two data > areas. If the journal can be synced by itself, the disk head doesn't > need to move back and forth as much. > > Identifying which parts to sync isn't much different than a modern > filesystem needs to do with its barriers, journals and journal-trees. > They have a lot in common. This is bread and butter stuff for > database engines. > > fsync_range would remove those reasons for using separate files, > making the database-in-a-single-file implementations more efficient. > That is administratively much nicer, imho. > > Similar for userspace filesystem-in-a-file, which is basically the same. Although I think a large part is IOPs rather than data throughput, so cost of fsync_range often might not be much better. > > > For this, taking a vector of multiple ranges would be nice. > > > Alternatively, issuing parallel fsync_range calls from multiple > > > threads would approximate the same thing - if (big if) they aren't > > > serialised by the kernel. > > > > I was thinking about doing something like that, but I just wanted to > > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc > > could implement fsync_range on top of that? > > Rather than fsyncv, is there some way to separate the fsync into parts? > > 1. A sequence of system calls to designate ranges. > 2. A call to say "commit and wait on all those ranges given in step 1". What's the problem with fsyncv? The problem with your proposal is that it takes multiple syscalls and that it requires the kernel to build up state over syscalls which is nasty. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 3:48 ` Nick Piggin @ 2009-01-21 5:24 ` Jamie Lokier 2009-01-21 6:16 ` Nick Piggin 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 5:24 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > > > sync_file_range would have to wait, then write, then wait. It also > > > does not call into the filesystem's ->fsync function, I don't know > > > what the wider consequences of that are for all filesystems, but > > > for some it means that metadata required to read back the data is > > > not synced properly, and often it means that metadata sync will not > > > work. > > > > fsync_range() must also wait, write, then wait again. > > > > The reason is this sequence of events: > > > > 1. App calls write() on a page, dirtying it. > > 2. Data writeout is initiated by usual kernel task. > > 3. App calls write() on the page again, dirtying it again. > > 4. App calls fsync_range() on the page. > > 5. ... Dum de dum, time passes ... > > 6. Writeout from step 2 completes. > > > > 7. fsync_range() initiates another writeout, because the > > in-progress writeout from step 2 might not include the changes from > > step 3. > > > > 7. fsync_range() waits for writout from step 7. > > 8. fsync_range() requests a device cache flush if needed (we hope!). > > 9. Returns to app. > > > > Therefore fsync_range() must wait for in-progress writeout to > > complete, before initiating more writeout and waiting again. > > That's only in rare cases where writeout is started but not completed > before we last dirty it and before we call the next fsync. I'd say in > most cases, we won't have to wait (it should often remain clean). Agreed it's rare. In those cases, sync_file_range() doesn't wait twice either. Both functions are the same in this part. > > This is the reason sync_file_range() has all those flags. As I said, > > the man page doesn't really explain how to use it properly. > > Well, one can read what the code does. Aside from that extra wait, There shouldn't be an extra wait. > and the problem of not syncing metadata, A bug. > one thing I dislike about it is that it exposes the new concept of > "writeout" to the userspace ABI. Previously all we cared about is > whether something is safe on disk or not. So I think it is > reasonable to augment the traditional data integrity APIs which will > probably be more easily used by existing apps. I agree entirely. Everyone knows what fsync_range() does, just from the name. Was there some reason, perhaps for performance or flexibility, for exposing the "writeout" concept to userspace? > > Also the kernel is in a better position to decide which order to do > > everything in, and how best to batch it. > > Better position than what? I proposed fsync_range (or fsyncv) to be > in-kernel too, of course. I mean the kernel is in a better position than userspace's lame attempts to call sync_file_range() in a clever way for optimal performance :-) > > Also, during the first wait (for in-progress writeout) the kernel > > could skip ahead to queuing some of the other pages for writeout as > > long as there is room in the request queue, and come back to the other > > pages later. > > Sure it could. That adds yet more complexity and opens possibility for > livelock (you go back to the page you were waiting for to find it was > since redirtied and under writeout again). Didn't you have a patch that fix a similar livelock against other apps in fsync()? I agree about the complexity. It's probably such a rare case. It must be handled correctly, though - two waits when needed, one wait usually. > > > > For database writes, you typically write a bunch of stuff in various > > > > regions of a big file (or multiple files), then ideally fdatasync > > > > some/all of the written ranges - with writes committed to disk in the > > > > best order determined by the OS and I/O scheduler. > > > > > > Do you know which databases do this? It will be nice to ask their > > > input and see whether it helps them (I presume it is an OSS database > > > because the "big" ones just use direct IO and manage their own > > > buffers, right?) > > > > I don't know if anyone uses sync_file_range(), or if it even works > > reliably, since it's not going to get much testing. > > The problem is that it is hard to verify. Even if it is getting lots > of testing, it is not getting enough testing with the block device > being shut off or throwing errors at exactly the right time. QEMU would be good for testing this sort of thing, but it doesn't sound like an easy test to write. > In 2.6.29 I just fixed a handful of data integrity and error reporting > bugs in sync that have been there for basically all of 2.6. Thank you so much! When I started work on a database engine, I cared about storage integrity a lot. I looked into fsync integrity on Linux and came out running because the smell was so bad. > > Take a look at this, though: > > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html > > > > "The results show fadvise + sync_file_range is on par or better than > > O_DIRECT. Detailed results are attached." > > That's not to say fsync would be any worse. And it's just a microbenchmark > anyway. In the end he was using O_DIRECT synchronously. You have to overlap O_DIRECT with AIO (the only time AIO on Linux really works) to get sensible performance. So ignore that result. > > By the way, direct I/O is nice but (a) not always possible, and (b) > > you don't get the integrity barriers, do you? > > It should. O_DIRECT can't do an I/O barrier after every write because performance would suck. Really badly. However, a database engine with any self-respect would want I/O barriers at certain points for data integrity. I suggest fdatasync() et al. should issue the barrier if there have been any writes, including O_DIRECT writes, since the last barrier. That could be a file-wide single flag "there have been writes since last barrier". > > fsync_range would remove those reasons for using separate files, > > making the database-in-a-single-file implementations more efficient. > > That is administratively much nicer, imho. > > > > Similar for userspace filesystem-in-a-file, which is basically the same. > > Although I think a large part is IOPs rather than data throughput, > so cost of fsync_range often might not be much better. IOPs are affected by head seeking. If the head is forced to seek between journal area and main data on every serial transaction, IOPs drops substantially. fsync_range() would reduce that seeking, for databases (and filesystems) which store both in the same file. > > > > For this, taking a vector of multiple ranges would be nice. > > > > Alternatively, issuing parallel fsync_range calls from multiple > > > > threads would approximate the same thing - if (big if) they aren't > > > > serialised by the kernel. > > > > > > I was thinking about doing something like that, but I just wanted to > > > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc > > > could implement fsync_range on top of that? > > > > Rather than fsyncv, is there some way to separate the fsync into parts? > > > > 1. A sequence of system calls to designate ranges. > > 2. A call to say "commit and wait on all those ranges given in step 1". > > What's the problem with fsyncv? The problem with your proposal is that > it takes multiple syscalls and that it requires the kernel to build up > state over syscalls which is nasty. I guess I'm coming back to sync_file_range(), which sort of does that separation :-) Also, see the other mail, about the PostgreSQL folks wanting to sync optimally multiple files at once, not serialised. I don't have a problem with fsyncv() per se. Should it take a single file descriptor and list of file-ranges, or a list of file descriptors with ranges? The latter is more general, but too vectory without justification is a good way to get syscalls NAKd by Linus. In theory, pluggable Linux-AIO would be a great multiple-request submission mechanism. There's IOCB_CMD_FDSYNC (AIO request), just add IOCB_CMD_FDSYNC_RANGE. There's room under the hood of that API for batching sensibly, and putting the waits and barriers in the best places. But Linux-AIO does not have a reputation for actually working, though the API looks good in theory. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 5:24 ` Jamie Lokier @ 2009-01-21 6:16 ` Nick Piggin 2009-01-21 11:18 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-21 6:16 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Jan 21, 2009 at 05:24:01AM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > That's only in rare cases where writeout is started but not completed > > before we last dirty it and before we call the next fsync. I'd say in > > most cases, we won't have to wait (it should often remain clean). > > Agreed it's rare. In those cases, sync_file_range() doesn't wait > twice either. Both functions are the same in this part. > > > > This is the reason sync_file_range() has all those flags. As I said, > > > the man page doesn't really explain how to use it properly. > > > > Well, one can read what the code does. Aside from that extra wait, > > There shouldn't be an extra wait. Of course there is becaues it has to wait on writeout of clean pages, then writeout dirty pages, then wait on writeout of dirty pages. > > one thing I dislike about it is that it exposes the new concept of > > "writeout" to the userspace ABI. Previously all we cared about is > > whether something is safe on disk or not. So I think it is > > reasonable to augment the traditional data integrity APIs which will > > probably be more easily used by existing apps. > > I agree entirely. > > Everyone knows what fsync_range() does, just from the name. > > Was there some reason, perhaps for performance or flexibility, for > exposing the "writeout" concept to userspace? I don't think I ever saw actual numbers to justify it. The async writeout part of it I guess is one aspect, but one could just add an async flag to fsync (like msync) to get mostly the same result. > > > Also the kernel is in a better position to decide which order to do > > > everything in, and how best to batch it. > > > > Better position than what? I proposed fsync_range (or fsyncv) to be > > in-kernel too, of course. > > I mean the kernel is in a better position than userspace's lame > attempts to call sync_file_range() in a clever way for optimal > performance :-) OK, agreed. In which case, fsyncv is a winner because you'd be able to sync multiple files and multiple ranges within each file. > > > Also, during the first wait (for in-progress writeout) the kernel > > > could skip ahead to queuing some of the other pages for writeout as > > > long as there is room in the request queue, and come back to the other > > > pages later. > > > > Sure it could. That adds yet more complexity and opens possibility for > > livelock (you go back to the page you were waiting for to find it was > > since redirtied and under writeout again). > > Didn't you have a patch that fix a similar livelock against other apps > in fsync()? Well, that was more of "really slow progress". This could actually be a real livelock because progress may never be made. > > The problem is that it is hard to verify. Even if it is getting lots > > of testing, it is not getting enough testing with the block device > > being shut off or throwing errors at exactly the right time. > > QEMU would be good for testing this sort of thing, but it doesn't > sound like an easy test to write. > > > In 2.6.29 I just fixed a handful of data integrity and error reporting > > bugs in sync that have been there for basically all of 2.6. > > Thank you so much! > > When I started work on a database engine, I cared about storage > integrity a lot. I looked into fsync integrity on Linux and came out > running because the smell was so bad. I guess that abruptly shutting down the block device queue could be used to pick up some bugs. That could be done using a real host and brd quite easily. The problem with some of those bugs I fixed is that some could take quite a rare and transient situation before the window even opens for possible data corruption. Then you have to crash the machine at that time, and hope the pattern that was written out is in fact one that will cause corruption. I tried to write some debug infrastructure; basically putting sequence counts in the struct page and going bug if the page is found to be still dirty after the last fsync event but before the next dirty page event... that kind of handles the simple case of the pagecache, but not really the filesystem or block device parts of the equation, which seem to be more difficult. > > > Take a look at this, though: > > > > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html > > > > > > "The results show fadvise + sync_file_range is on par or better than > > > O_DIRECT. Detailed results are attached." > > > > That's not to say fsync would be any worse. And it's just a microbenchmark > > anyway. > > In the end he was using O_DIRECT synchronously. You have to overlap > O_DIRECT with AIO (the only time AIO on Linux really works) to get > sensible performance. So ignore that result. Ah OK. > > > By the way, direct I/O is nice but (a) not always possible, and (b) > > > you don't get the integrity barriers, do you? > > > > It should. > > O_DIRECT can't do an I/O barrier after every write because performance > would suck. Really badly. However, a database engine with any > self-respect would want I/O barriers at certain points for data integrity. Hmm, I don't follow why that should be the case. Doesn't any self respecting storage controller tell us the data is safe when it hits its non volatile RAM? > I suggest fdatasync() et al. should issue the barrier if there have > been any writes, including O_DIRECT writes, since the last barrier. > That could be a file-wide single flag "there have been writes since > last barrier". Well, I'd say the less that simpler applications have to care about, the better. For Oracle and DB2 etc. I think we could have a mode that turns off intermediate block device barriers and give them a syscall or ioctl to issue the barrier manually. If that helps them significantly. > > > fsync_range would remove those reasons for using separate files, > > > making the database-in-a-single-file implementations more efficient. > > > That is administratively much nicer, imho. > > > > > > Similar for userspace filesystem-in-a-file, which is basically the same. > > > > Although I think a large part is IOPs rather than data throughput, > > so cost of fsync_range often might not be much better. > > IOPs are affected by head seeking. If the head is forced to seek > between journal area and main data on every serial transaction, IOPs > drops substantially. fsync_range() would reduce that seeking, for > databases (and filesystems) which store both in the same file. OK I see your point. But that's not to say you couldn't have two files or partitions laied out next to one another. But yes no question that fsync_range is more flexible.> > > What's the problem with fsyncv? The problem with your proposal is that > > it takes multiple syscalls and that it requires the kernel to build up > > state over syscalls which is nasty. > > I guess I'm coming back to sync_file_range(), which sort of does that > separation :-) > > Also, see the other mail, about the PostgreSQL folks wanting to sync > optimally multiple files at once, not serialised. > > I don't have a problem with fsyncv() per se. Should it take a single > file descriptor and list of file-ranges, or a list of file descriptors > with ranges? The latter is more general, but too vectory without > justification is a good way to get syscalls NAKd by Linus. The latter, I think. It is indeed much more useful (you could sync a hundred files and have them share a lot of the block device flushes / barriers). > In theory, pluggable Linux-AIO would be a great multiple-request > submission mechanism. There's IOCB_CMD_FDSYNC (AIO request), just add > IOCB_CMD_FDSYNC_RANGE. There's room under the hood of that API for > batching sensibly, and putting the waits and barriers in the best > places. But Linux-AIO does not have a reputation for actually > working, though the API looks good in theory. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 6:16 ` Nick Piggin @ 2009-01-21 11:18 ` Jamie Lokier 2009-01-21 11:41 ` Nick Piggin 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 11:18 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > > > That's only in rare cases where writeout is started but not completed > > > before we last dirty it and before we call the next fsync. I'd say in > > > most cases, we won't have to wait (it should often remain clean). > > > There shouldn't be an extra wait. [in sync_file_range] > > Of course there is becaues it has to wait on writeout of clean pages, > then writeout dirty pages, then wait on writeout of dirty pages. Eh? How is that different from the "only in rare cases where writeout is started but not completed" in your code? Oh, let me guess. sync_file_range() will wait for writeout to complete on pages where the dirty bit was cleared when they were queued for writout and have not been dirtied since, while fsync_range() will not wait for those? I distinctly remember someone... yes, Andrew Morton, explaining why the double wait is needed for integrity. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272270.html That's how I learned what (at least one person thinks) is the intended semantics of sync_file_range(). I'll just quote one line from Andrew's post: >> It's an interesting problem, with potentially high payback. Back to that subtlety of waiting, and integrity. If fsync_range does not wait at all on a page which is under writeout and clean (not dirtied since the writeout was queued), it will not achieve integrity. That can happen due to the following events: 1. App calls write(), dirties page. 2. Background dirty flushing starts writeout, clears dirty bit. 3. App calls fsync_range() on the page. 4. fsync_range() doesn't wait on it because it's clean. 5. Bang, app things the write is committed when it isn't. On the other hand, if I've misunderstood and it will wait on that page, but not twice, then I think it's the same as what sync_file_range() is _supposed_ to do. sync_file_range() is misunderstood. Possibly due to the man page, hand-waving and implementation. I don't think the flags mean "wait on all writeouts" _then_ "initiate all dirty writeouts" _then_ "wait on all writeouts". I think they mean *for each page in parallel* do that, or at least do its best with those constraints. In other words, no double-waiting or excessive serialisation. Don't get me wrong, I think fsync_range() is a much cleaner idea, and much more likely to be used. If fsync_range() is coming, it wouldn't do any harm, imho, to delete sync_file_range() completely, and replace it with a stub which calls fsync_range(). Or ENOSYS, then we'll find out if anyone used it :-) Your implementation will obviously be better, since all your kind attention to fsync integrity generally. Andrew Morton did write, though: >>The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that >>userspace can get as much data into the queue as possible, to permit the >>kernel to optimise IO scheduling better. I wonder if there is something to that, or if it was just wishful thinking. -- Jamie doesn't want to share. it's supposed to be > > > > > one thing I dislike about it is that it exposes the new concept of > > > "writeout" to the userspace ABI. Previously all we cared about is > > > whether something is safe on disk or not. So I think it is > > > reasonable to augment the traditional data integrity APIs which will > > > probably be more easily used by existing apps. > > > > I agree entirely. > > > > Everyone knows what fsync_range() does, just from the name. > > > > Was there some reason, perhaps for performance or flexibility, for > > exposing the "writeout" concept to userspace? > > I don't think I ever saw actual numbers to justify it. The async > writeout part of it I guess is one aspect, but one could just add > an async flag to fsync (like msync) to get mostly the same result. > > > > > > Also the kernel is in a better position to decide which order to do > > > > everything in, and how best to batch it. > > > > > > Better position than what? I proposed fsync_range (or fsyncv) to be > > > in-kernel too, of course. > > > > I mean the kernel is in a better position than userspace's lame > > attempts to call sync_file_range() in a clever way for optimal > > performance :-) > > OK, agreed. In which case, fsyncv is a winner because you'd be able > to sync multiple files and multiple ranges within each file. > > > > > > Also, during the first wait (for in-progress writeout) the kernel > > > > could skip ahead to queuing some of the other pages for writeout as > > > > long as there is room in the request queue, and come back to the other > > > > pages later. > > > > > > Sure it could. That adds yet more complexity and opens possibility for > > > livelock (you go back to the page you were waiting for to find it was > > > since redirtied and under writeout again). > > > > Didn't you have a patch that fix a similar livelock against other apps > > in fsync()? > > Well, that was more of "really slow progress". This could actually be > a real livelock because progress may never be made. > > > > > The problem is that it is hard to verify. Even if it is getting lots > > > of testing, it is not getting enough testing with the block device > > > being shut off or throwing errors at exactly the right time. > > > > QEMU would be good for testing this sort of thing, but it doesn't > > sound like an easy test to write. > > > > > In 2.6.29 I just fixed a handful of data integrity and error reporting > > > bugs in sync that have been there for basically all of 2.6. > > > > Thank you so much! > > > > When I started work on a database engine, I cared about storage > > integrity a lot. I looked into fsync integrity on Linux and came out > > running because the smell was so bad. > > I guess that abruptly shutting down the block device queue could be > used to pick up some bugs. That could be done using a real host and brd > quite easily. > > The problem with some of those bugs I fixed is that some could take > quite a rare and transient situation before the window even opens for > possible data corruption. Then you have to crash the machine at that > time, and hope the pattern that was written out is in fact one that > will cause corruption. > > I tried to write some debug infrastructure; basically putting sequence > counts in the struct page and going bug if the page is found to be > still dirty after the last fsync event but before the next dirty page > event... that kind of handles the simple case of the pagecache, but > not really the filesystem or block device parts of the equation, which > seem to be more difficult. > > > > > > Take a look at this, though: > > > > > > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html > > > > > > > > "The results show fadvise + sync_file_range is on par or better than > > > > O_DIRECT. Detailed results are attached." > > > > > > That's not to say fsync would be any worse. And it's just a microbenchmark > > > anyway. > > > > In the end he was using O_DIRECT synchronously. You have to overlap > > O_DIRECT with AIO (the only time AIO on Linux really works) to get > > sensible performance. So ignore that result. > > Ah OK. > > > > > > By the way, direct I/O is nice but (a) not always possible, and (b) > > > > you don't get the integrity barriers, do you? > > > > > > It should. > > > > O_DIRECT can't do an I/O barrier after every write because performance > > would suck. Really badly. However, a database engine with any > > self-respect would want I/O barriers at certain points for data integrity. > > Hmm, I don't follow why that should be the case. Doesn't any self > respecting storage controller tell us the data is safe when it > hits its non volatile RAM? > > > > I suggest fdatasync() et al. should issue the barrier if there have > > been any writes, including O_DIRECT writes, since the last barrier. > > That could be a file-wide single flag "there have been writes since > > last barrier". > > Well, I'd say the less that simpler applications have to care about, > the better. For Oracle and DB2 etc. I think we could have a mode that > turns off intermediate block device barriers and give them a syscall > or ioctl to issue the barrier manually. If that helps them significantly. > > > > > > fsync_range would remove those reasons for using separate files, > > > > making the database-in-a-single-file implementations more efficient. > > > > That is administratively much nicer, imho. > > > > > > > > Similar for userspace filesystem-in-a-file, which is basically the same. > > > > > > Although I think a large part is IOPs rather than data throughput, > > > so cost of fsync_range often might not be much better. > > > > IOPs are affected by head seeking. If the head is forced to seek > > between journal area and main data on every serial transaction, IOPs > > drops substantially. fsync_range() would reduce that seeking, for > > databases (and filesystems) which store both in the same file. > > OK I see your point. But that's not to say you couldn't have two > files or partitions laied out next to one another. But yes no > question that fsync_range is more flexible.> > > > > > What's the problem with fsyncv? The problem with your proposal is that > > > it takes multiple syscalls and that it requires the kernel to build up > > > state over syscalls which is nasty. > > > > I guess I'm coming back to sync_file_range(), which sort of does that > > separation :-) > > > > Also, see the other mail, about the PostgreSQL folks wanting to sync > > optimally multiple files at once, not serialised. > > > > I don't have a problem with fsyncv() per se. Should it take a single > > file descriptor and list of file-ranges, or a list of file descriptors > > with ranges? The latter is more general, but too vectory without > > justification is a good way to get syscalls NAKd by Linus. > > The latter, I think. It is indeed much more useful (you could sync > a hundred files and have them share a lot of the block device > flushes / barriers). > > > > In theory, pluggable Linux-AIO would be a great multiple-request > > submission mechanism. There's IOCB_CMD_FDSYNC (AIO request), just add > > IOCB_CMD_FDSYNC_RANGE. There's room under the hood of that API for > > batching sensibly, and putting the waits and barriers in the best > > places. But Linux-AIO does not have a reputation for actually > > working, though the API looks good in theory. > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 11:18 ` Jamie Lokier @ 2009-01-21 11:41 ` Nick Piggin 2009-01-21 12:09 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-21 11:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Jan 21, 2009 at 11:18:02AM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > > > That's only in rare cases where writeout is started but not completed > > > > before we last dirty it and before we call the next fsync. I'd say in > > > > most cases, we won't have to wait (it should often remain clean). > > > > > There shouldn't be an extra wait. [in sync_file_range] > > > > Of course there is becaues it has to wait on writeout of clean pages, > > then writeout dirty pages, then wait on writeout of dirty pages. > > Eh? How is that different from the "only in rare cases where writeout > is started but not completed" in your code? No, in my code it is where writeout is started and the page has been redirtied. If writeout has started and the page is still clean (which should be the more common case of the two), then it doesn't have to. > Oh, let me guess. sync_file_range() will wait for writeout to > complete on pages where the dirty bit was cleared when they were > queued for writout and have not been dirtied since, while > fsync_range() will not wait for those? > > I distinctly remember someone... yes, Andrew Morton, explaining why > the double wait is needed for integrity. > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272270.html > > That's how I learned what (at least one person thinks) is the > intended semantics of sync_file_range(). The double wait is needed by sync_file_range, firstly because the first flag is simply defined to wait for writeout, but also because the "start writeout" flag is defined not to writeout on pages which are dirty but already under writeout. > I'll just quote one line from Andrew's post: > >> It's an interesting problem, with potentially high payback. > > Back to that subtlety of waiting, and integrity. > > If fsync_range does not wait at all on a page which is under writeout > and clean (not dirtied since the writeout was queued), it will not > achieve integrity. > > That can happen due to the following events: > > 1. App calls write(), dirties page. > 2. Background dirty flushing starts writeout, clears dirty bit. > 3. App calls fsync_range() on the page. > 4. fsync_range() doesn't wait on it because it's clean. > 5. Bang, app things the write is committed when it isn't. No, because fsync_range still has to wait for writeout pages *after* it has submitted dirty pages for writeout. This includes all pages, not just ones it has submitted just now. > I don't think the flags mean "wait on all writeouts" _then_ "initiate > all dirty writeouts" _then_ "wait on all writeouts". They do. It is explicitly stated and that is exactly how it is implemented (except "initiate writeout against all dirty pages" is "initiate writeout against all dirty pages not already under writeout"). > I think they mean *for each page in parallel* do that, or at least do > its best with those constraints. > > In other words, no double-waiting or excessive serialisation. Well you can do it for each page in parallel, yes. This is what we discussed about starting writeout against *other* pages if we find a page under writeout that we have to wait for. And then coming back to that page to process it. This opens the whole livelock and complexity thing. > Don't get me wrong, I think fsync_range() is a much cleaner idea, and > much more likely to be used. > > If fsync_range() is coming, it wouldn't do any harm, imho, to delete > sync_file_range() completely, and replace it with a stub which calls > fsync_range(). Or ENOSYS, then we'll find out if anyone used it :-) > Your implementation will obviously be better, since all your kind > attention to fsync integrity generally. Well, given that postgresql post that they need to sync multiple files, I think fsyncv is a nice way forward. It can be used to implement fsync_range too, which is slightly portable. > Andrew Morton did write, though: > >>The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that > >>userspace can get as much data into the queue as possible, to permit the > >>kernel to optimise IO scheduling better. > > I wonder if there is something to that, or if it was just wishful > thinking. Another problem with that, precisely because it is tied up with idea of writeout mixed with data integrity semantics, is that the kernel is *not* free to do what it thinks best. It has to start writeout on all those pages and not return until writeout is started. If the queue fills up it has to block. It cannot schedule a thread to write out asynchronously, etc. Because userspace is directing how the implementation should work rather than the high level intention. Andrew and I have well archived difference of opinion on this ;) I have no interest in ripping out sync_file_range. I can't say it is wrong or always going to be suboptimal. But I think it is fine to extend the more traditional fsync APIs too. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 11:41 ` Nick Piggin @ 2009-01-21 12:09 ` Jamie Lokier 0 siblings, 0 replies; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 12:09 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > Well, given that postgresql post that they need to sync multiple > files, I think fsyncv is a nice way forward. It can be used to > implement fsync_range too, which is slightly portable. Also, fsyncv on multiple files could issue just the one disk cache flush, if they're all to the same disk... [about sync_file_range] > If the queue fills up it has to block. It cannot schedule a thread > to write out asynchronously, etc. Because userspace is directing > how the implementation should work rather than the high level > intention. I agree that it's overly constraining, and pushes unnecessary tuning work into userspace. All these calls, btw, would be much more "optimisable" in the kernel if they were AIOs. Let the kernel decide things like how much to batch, how much to parallelise, and still have the hint which comes from AIO submission order (userspace threads doing synchronous I/O loses this bit). But that doesn't seem likely to happen because it's really quite hard. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 3:15 ` Jamie Lokier 2009-01-21 3:48 ` Nick Piggin @ 2009-01-21 4:16 ` Nick Piggin 2009-01-21 4:59 ` Jamie Lokier 1 sibling, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-21 4:16 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > An additional couple of flags to sync_file_range() would sort out the > API: > > SYNC_FILE_RANGE_METADATA > > Commit the file metadata such as modification time and > attributes. Think fsync() versus fdatasync(). Note that the problems with sync_file_range is not that it lacks a metadata flag like fsync vs fdatasync. It is that it does not even sync the metadata required to retrieve the data (which of course fdatasync must do, otherwise it would be useless). This is just another reason why I prefer to just try to evolve the traditional fsync interface slowly. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 4:16 ` Nick Piggin @ 2009-01-21 4:59 ` Jamie Lokier 2009-01-21 6:23 ` Nick Piggin 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 4:59 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote: > > Nick Piggin wrote: > > An additional couple of flags to sync_file_range() would sort out the > > API: > > > > SYNC_FILE_RANGE_METADATA > > > > Commit the file metadata such as modification time and > > attributes. Think fsync() versus fdatasync(). > > Note that the problems with sync_file_range is not that it lacks a > metadata flag like fsync vs fdatasync. It is that it does not even > sync the metadata required to retrieve the data (which of course > fdatasync must do, otherwise it would be useless). Oh, I agree about that. (Different meaning of metadata, btw. That's the term used in O_SYNC vs. O_DSYNC documentation for other unixes that I've read, that's why I used it in that flag, for consistency with other unixes.) > This is just another reason why I prefer to just try to evolve the > traditional fsync interface slowly. But sync_file_range() has a bug, which you've pointed out - the missing _data-retrieval_ metadata isn't synced. In other words, it's completely useless. If that bug isn't going to be fixed, delete sync_file_range() altogether. There's no point keeping it if it's broken. And if it's fixed, it'll do what your fsync_range() does, so why have both? -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 4:59 ` Jamie Lokier @ 2009-01-21 6:23 ` Nick Piggin 2009-01-21 12:02 ` Jamie Lokier 2009-01-21 12:13 ` Theodore Tso 0 siblings, 2 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-21 6:23 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Jan 21, 2009 at 04:59:21AM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > This is just another reason why I prefer to just try to evolve the > > traditional fsync interface slowly. > > But sync_file_range() has a bug, which you've pointed out - the > missing _data-retrieval_ metadata isn't synced. In other words, it's > completely useless. I don't know. I don't think this is a newly discovered problem. I think it's been known for a while, so I don't know what's going on. > If that bug isn't going to be fixed, delete sync_file_range() > altogether. There's no point keeping it if it's broken. And if it's > fixed, it'll do what your fsync_range() does, so why have both? Well the thing is it doesn't. Again it comes back to the whole writeout thing, which makes it more constraining on the kernel to optimise. For example, my fsync "livelock" avoidance patches did the following: 1. find all pages which are dirty or under writeout first. 2. write out the dirty pages. 3. wait for our set of pages. Simple, obvious, and the kernel can optimise this well because the userspace has asked for a high level request "make this data safe" rather than low level directives. We can't do this same nice simple sequence with sync_file_range because SYNC_FILE_RANGE_WAIT_AFTER means we have to wait for all writeout pages in the range, including unrelated ones, after the dirty writeout. SYNC_FILE_RANGE_WAIT_BEFORE means we have to wait for clean writeout pages before we even start doing real work. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 6:23 ` Nick Piggin @ 2009-01-21 12:02 ` Jamie Lokier 2009-01-21 12:13 ` Theodore Tso 1 sibling, 0 replies; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 12:02 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > Again it comes back to the whole writeout thing, which makes it more > constraining on the kernel to optimise. Cute :-) It was intended to make it easier to optimise, but maybe it failed. > For example, my fsync "livelock" avoidance patches did the following: > > 1. find all pages which are dirty or under writeout first. > 2. write out the dirty pages. > 3. wait for our set of pages. > > Simple, obvious, and the kernel can optimise this well because the > userspace has asked for a high level request "make this data safe" > rather than low level directives. We can't do this same nice simple > sequence with sync_file_range because SYNC_FILE_RANGE_WAIT_AFTER > means we have to wait for all writeout pages in the range, including > unrelated ones, after the dirty writeout. SYNC_FILE_RANGE_WAIT_BEFORE > means we have to wait for clean writeout pages before we even start > doing real work. As noted in my other mail just now, although sync_file_range() is described as though it does the three bulk operations consecutively, I think it wouldn't be too shocking to think the intended semantics _could_ be: "wait and initiate writeous _as if_ we did, for each page _in parallel_ { if (SYNC_FILE_RANGE_WAIT_BEFORE && page->writeout) wait(page) if (SYNC_FILE_RANGE_WRITE) start_writeout(page) if (SYNC_FILE_RANGE_WAIT_AFTER && writeout) wait(page) }" That permits many strategies, and I think one of them is the nice livelock-avoiding fsync you describe up above. You might be able to squeeze the sync_file_range() flags into that by chopping it up like this. Btw, you omitted step 1.5 "wait for dirty pages which are already under writeout", but it's made explicit here: 1. find all pages which are dirty or under writeout first, and remember which of them are dirty _and_ under writeout (DW). 2. if (SYNC_FILE_RANGE_WRITE) write out the dirty pages not in DW. 3. if (SYNC_FILE_RANGE_WAIT_BEFORE) { wait for the set of pages in DW. write out the pages in DW. } 4. if (SYNC_FILE_RANGE_WAIT_BEFORE || SYNC_FILE_RANGE_WAIT_AFTER) wait for our set of pages. However, maybe the flags aren't all that useful really, and maybe sync_file_range() could be replaced by a stub which ignores the flags and calls fsync_range(). -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 6:23 ` Nick Piggin 2009-01-21 12:02 ` Jamie Lokier @ 2009-01-21 12:13 ` Theodore Tso 2009-01-21 12:37 ` Jamie Lokier 1 sibling, 1 reply; 42+ messages in thread From: Theodore Tso @ 2009-01-21 12:13 UTC (permalink / raw) To: Nick Piggin; +Cc: Jamie Lokier, linux-fsdevel On Wed, Jan 21, 2009 at 07:23:06AM +0100, Nick Piggin wrote: > > > > But sync_file_range() has a bug, which you've pointed out - the > > missing _data-retrieval_ metadata isn't synced. In other words, it's > > completely useless. > > I don't know. I don't think this is a newly discovered problem. > I think it's been known for a while, so I don't know what's > going on. We should ask if anyone is actually using sync_file_range (cough, <Oracle>, cough, cough). But if I had to guess, for those people who are using it, they don't much care, because 99% of the time they are overwriting data blocks within a file which isn't changing in size, so there is no data-retrieval metadata to sync. That is, the database file is only rarely grown in size, and when they do that, they can either preallocate via pre-filling or via posix_fallocate(), and then follow it up with a normal fsync(); but most of the time, they aren't mucking with the data-retrieval metadata, so it simply isn't an issue for them.... It's not general purpose, but the question is whether or not any of the primary users of this interface require the more general-purpose functionality. - Ted ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 12:13 ` Theodore Tso @ 2009-01-21 12:37 ` Jamie Lokier 2009-01-21 14:12 ` Theodore Tso 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 12:37 UTC (permalink / raw) To: Theodore Tso; +Cc: Nick Piggin, linux-fsdevel Theodore Tso wrote: > On Wed, Jan 21, 2009 at 07:23:06AM +0100, Nick Piggin wrote: > > > > > > But sync_file_range() has a bug, which you've pointed out - the > > > missing _data-retrieval_ metadata isn't synced. In other words, it's > > > completely useless. > > > > I don't know. I don't think this is a newly discovered problem. > > I think it's been known for a while, so I don't know what's > > going on. > > We should ask if anyone is actually using sync_file_range (cough, > <Oracle>, cough, cough). But if I had to guess, for those people who > are using it, they don't much care, because 99% of the time they are > overwriting data blocks within a file which isn't changing in size, so > there is no data-retrieval metadata to sync. What about btrfs with data checksums? Doesn't that count among data-retrieval metadata? What about nilfs, which always writes data to a new place? Etc. I'm wondering what exactly sync_file_range() definitely writes, and what it doesn't write. If it's just in use by Oracle, and nobody's sure what it does, that smacks of those secret APIs in Windows that made Word run a bit faster than everyone else's word processer... sort of. :-) -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 12:37 ` Jamie Lokier @ 2009-01-21 14:12 ` Theodore Tso 2009-01-21 14:35 ` Chris Mason 2009-01-22 21:18 ` Florian Weimer 0 siblings, 2 replies; 42+ messages in thread From: Theodore Tso @ 2009-01-21 14:12 UTC (permalink / raw) To: Jamie Lokier; +Cc: Nick Piggin, linux-fsdevel On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote: > > What about btrfs with data checksums? Doesn't that count among > data-retrieval metadata? What about nilfs, which always writes data > to a new place? Etc. > > I'm wondering what exactly sync_file_range() definitely writes, and > what it doesn't write. > > If it's just in use by Oracle, and nobody's sure what it does, that > smacks of those secret APIs in Windows that made Word run a bit faster > than everyone else's word processer... sort of. :-) Actually, I take that back; Oracle (and most other enterprise databases; the world is not just Oracle --- there's also DB2, for example) generally uses Direct I/O, so I wonder if they are using sync_file_range() at all. I do wonder though how well or poorly Oracle will work on btrfs, or indeed any filesystem that uses WAFL-like or log-structutred filesystem-like algorithms. Most of the enterprise databases have been optimized for use on block devices and filesystems where you do write-in-place acesses; and some enterprise databases do their own data checksumming. So if I had to guess, I suspect the answer to the question I posed is "disastrously". :-) After all, such db's generally are happiest when the OS acts as a program loader than then gets the heck out of the way of the filesystem, hence their use of DIO. Which again brings me back to the question --- I wonder who is actually using sync_file_range, and what for? I would assume it is some database, most likely; so maybe we should check with MySQL or Postgres? - Ted ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 14:12 ` Theodore Tso @ 2009-01-21 14:35 ` Chris Mason 2009-01-21 15:58 ` Eric Sandeen 2009-01-21 20:41 ` Jamie Lokier 2009-01-22 21:18 ` Florian Weimer 1 sibling, 2 replies; 42+ messages in thread From: Chris Mason @ 2009-01-21 14:35 UTC (permalink / raw) To: Theodore Tso; +Cc: Jamie Lokier, Nick Piggin, linux-fsdevel, Eric Sandeen On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote: > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote: > > > > What about btrfs with data checksums? Doesn't that count among > > data-retrieval metadata? What about nilfs, which always writes data > > to a new place? Etc. > > > > I'm wondering what exactly sync_file_range() definitely writes, and > > what it doesn't write. > > > > If it's just in use by Oracle, and nobody's sure what it does, that > > smacks of those secret APIs in Windows that made Word run a bit faster > > than everyone else's word processer... sort of. :-) > > Actually, I take that back; Oracle (and most other enterprise > databases; the world is not just Oracle --- there's also DB2, for > example) generally uses Direct I/O, so I wonder if they are using > sync_file_range() at all. Usually if they don't use O_DIRECT, they use O_SYNC. > > I do wonder though how well or poorly Oracle will work on btrfs, or > indeed any filesystem that uses WAFL-like or log-structutred > filesystem-like algorithms. Most of the enterprise databases have > been optimized for use on block devices and filesystems where you do > write-in-place acesses; and some enterprise databases do their own > data checksumming. So if I had to guess, I suspect the answer to the > question I posed is "disastrously". :-) Yes, I think btrfs' nodatacow option is pretty important for database use. > After all, such db's > generally are happiest when the OS acts as a program loader than then > gets the heck out of the way of the filesystem, hence their use of > DIO. > > Which again brings me back to the question --- I wonder who is > actually using sync_file_range, and what for? I would assume it is > some database, most likely; so maybe we should check with MySQL or > Postgres? Eric, didn't you have a magic script for grepping the sources/binaries in fedora for syscalls? -chris ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 14:35 ` Chris Mason @ 2009-01-21 15:58 ` Eric Sandeen 2009-01-21 20:41 ` Jamie Lokier 1 sibling, 0 replies; 42+ messages in thread From: Eric Sandeen @ 2009-01-21 15:58 UTC (permalink / raw) To: Chris Mason; +Cc: Theodore Tso, Jamie Lokier, Nick Piggin, linux-fsdevel Chris Mason wrote: ... >> Which again brings me back to the question --- I wonder who is >> actually using sync_file_range, and what for? I would assume it is >> some database, most likely; so maybe we should check with MySQL or >> Postgres? > > Eric, didn't you have a magic script for grepping the sources/binaries > in fedora for syscalls? > > -chris Yep (binaries) - http://sandeen.fedorapeople.org/utilities/summarise-stat64.pl Thanks to Greg Banks! I don't currently have an exploded fedora tree to run it over, but could do so after I un-busy myself again... -Eric ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 14:35 ` Chris Mason 2009-01-21 15:58 ` Eric Sandeen @ 2009-01-21 20:41 ` Jamie Lokier 2009-01-21 21:23 ` jim owens 1 sibling, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 20:41 UTC (permalink / raw) To: Chris Mason; +Cc: Theodore Tso, Nick Piggin, linux-fsdevel, Eric Sandeen Chris Mason wrote: > On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote: > > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote: > > > > > > What about btrfs with data checksums? Doesn't that count among > > > data-retrieval metadata? What about nilfs, which always writes data > > > to a new place? Etc. > > > > > > I'm wondering what exactly sync_file_range() definitely writes, and > > > what it doesn't write. > > > > > > If it's just in use by Oracle, and nobody's sure what it does, that > > > smacks of those secret APIs in Windows that made Word run a bit faster > > > than everyone else's word processer... sort of. :-) > > > > Actually, I take that back; Oracle (and most other enterprise > > databases; the world is not just Oracle --- there's also DB2, for > > example) generally uses Direct I/O, so I wonder if they are using > > sync_file_range() at all. > > Usually if they don't use O_DIRECT, they use O_SYNC. There's a case for using both together. An O_DIRECT write convert to non-direct in some conditions. When that happens, you want the properties of O_SYNC. It is documented to happen on some other OSes - and maybe for VxFS on Linux. Linux is nicer than some other platforms in returning EINVAL usually for O_DIRECT whose alignment isn't satisfactory, but it can still fall back to buffered I/O in some circumstances. I think current kernels do a sync in that case, but some earlier 2.6 kernels failed to. Oh, you'd use O_DSYNC instead of course... No point committing inode updates all the time, only size increases, and most OSes document that O_DSYNC does commit size increases. By the way, emulators/VMs like QEMU and KVM use much the same methods to access virtual disk images as databases do, for the same reasons. > > I do wonder though how well or poorly Oracle will work on btrfs, or > > indeed any filesystem that uses WAFL-like or log-structutred > > filesystem-like algorithms. Most of the enterprise databases have > > been optimized for use on block devices and filesystems where you do > > write-in-place acesses; and some enterprise databases do their own > > data checksumming. So if I had to guess, I suspect the answer to the > > question I posed is "disastrously". :-) > > Yes, I think btrfs' nodatacow option is pretty important for database > use. Does O_DIRECT on btrfs still allocate new data blocks? That's not very direct :-) I'm thinking if O_DIRECT is set, considering what's likely to request it, it may be reasonable for it to mean "overwrite in place" too (except for files which are actually COW-shared with others of course). > > After all, such db's > > generally are happiest when the OS acts as a program loader than then > > gets the heck out of the way of the filesystem, hence their use of > > DIO. > > > > Which again brings me back to the question --- I wonder who is > > actually using sync_file_range, and what for? I would assume it is > > some database, most likely; so maybe we should check with MySQL or > > Postgres? > > Eric, didn't you have a magic script for grepping the sources/binaries > in fedora for syscalls? sync_file_range does not appear anywhere in db-4.7.25 mysql-dfsg-5.0.67 postgresql-8.3.5 sqlite3-3.5.9 (On Ubuntu; presumably the same in other distros). -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 20:41 ` Jamie Lokier @ 2009-01-21 21:23 ` jim owens 2009-01-21 21:59 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: jim owens @ 2009-01-21 21:23 UTC (permalink / raw) To: Jamie Lokier Cc: Chris Mason, Theodore Tso, Nick Piggin, linux-fsdevel, Eric Sandeen Jamie Lokier wrote: > > Does O_DIRECT on btrfs still allocate new data blocks? > That's not very direct :-) > > I'm thinking if O_DIRECT is set, considering what's likely to request > it, it may be reasonable for it to mean "overwrite in place" too > (except for files which are actually COW-shared with others of course). O_DIRECT for databases is to bypass the OS file data cache. Those (oracle) who have long experience with it on unix know that the physical storage location can change on a filesystem. I do not think we want to make a special case, it should be up to the db admin to choose cow/nocow because if they want SNAPSHOTS they need cow. jim ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 21:23 ` jim owens @ 2009-01-21 21:59 ` Jamie Lokier 2009-01-21 23:08 ` btrfs O_DIRECT was " jim owens 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 21:59 UTC (permalink / raw) To: jim owens Cc: Chris Mason, Theodore Tso, Nick Piggin, linux-fsdevel, Eric Sandeen jim owens wrote: > Jamie Lokier wrote: > > > >Does O_DIRECT on btrfs still allocate new data blocks? > >That's not very direct :-) > > > >I'm thinking if O_DIRECT is set, considering what's likely to request > >it, it may be reasonable for it to mean "overwrite in place" too > >(except for files which are actually COW-shared with others of course). > > O_DIRECT for databases is to bypass the OS file data cache. > > Those (oracle) who have long experience with it on unix > know that the physical storage location can change on > a filesystem. > > I do not think we want to make a special case, > it should be up to the db admin to choose cow/nocow > because if they want SNAPSHOTS they need cow. SNAPSHOTS is what "except for files which are actually COW-shared with others of course" refers to. An option to "choose" to corrupt snapshots would be very silly. Writing in place or new-place on a *non-shared* (i.e. non-snapshotted) file is the choice which is useful. It's a filesystem implementation detail, not a semantic difference. I'm suggesting writing in place may do no harm and be more like the expected behaviour with programs that use O_DIRECT, which are usually databases. How about a btrfs mount option? in_place_write=never/always/direct_only. (Default direct_only). -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: btrfs O_DIRECT was [rfc] fsync_range? 2009-01-21 21:59 ` Jamie Lokier @ 2009-01-21 23:08 ` jim owens 2009-01-22 0:06 ` Jamie Lokier 0 siblings, 1 reply; 42+ messages in thread From: jim owens @ 2009-01-21 23:08 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chris Mason, linux-fsdevel Jamie Lokier wrote: > > Writing in place or new-place on a *non-shared* (i.e. non-snapshotted) > file is the choice which is useful. It's a filesystem implementation > detail, not a semantic difference. I'm suggesting writing in place > may do no harm and be more like the expected behaviour with programs > that use O_DIRECT, which are usually databases. > > How about a btrfs mount option? > in_place_write=never/always/direct_only. (Default direct_only). The harm is creating a special guarantee for just one case of "don't move my data" based on a transient file open mode. What about defragmenting or moving the extent to another device for performance or for (failing) device removal? We are on a slippery slope for presumed expectations. jim ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: btrfs O_DIRECT was [rfc] fsync_range? 2009-01-21 23:08 ` btrfs O_DIRECT was " jim owens @ 2009-01-22 0:06 ` Jamie Lokier 2009-01-22 13:50 ` jim owens 0 siblings, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-22 0:06 UTC (permalink / raw) To: jim owens; +Cc: Chris Mason, linux-fsdevel jim owens wrote: > Jamie Lokier wrote: > > > >Writing in place or new-place on a *non-shared* (i.e. non-snapshotted) > >file is the choice which is useful. It's a filesystem implementation > >detail, not a semantic difference. I'm suggesting writing in place > >may do no harm and be more like the expected behaviour with programs > >that use O_DIRECT, which are usually databases. > > > >How about a btrfs mount option? > >in_place_write=never/always/direct_only. (Default direct_only). > > The harm is creating a special guarantee for just one case > of "don't move my data" based on a transient file open mode. > > What about defragmenting or moving the extent to another > device for performance or for (failing) device removal? > > We are on a slippery slope for presumed expectations. Don't make it a guarantee, just a hint to filesystem write strategy. It's ok to move data around when useful, we're not talking about a hard requirement, but a performance knob. The question is just what performance and fragmentation characteristics do programs that use O_DIRECT have? They are nearly all databases, filesystems-in-a-file, or virtual machine disks. I'm guessing virtually all of those _particular_ applications programs would perform significantly differently with a write-in-place strategy for most writes, although you'd still want access to the bells and whistles of snapshots and COW and so on when requested. Note I said differently :-) I'm not sure write-in-place performs better for those sort of applications. It's just a guess. Oracle probably has a really good idea how it performs on ZFS compared with a block device (which is always in place) - and knows whether ZFS does in-place writes with O_DIRECT or not. Chris? -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: btrfs O_DIRECT was [rfc] fsync_range? 2009-01-22 0:06 ` Jamie Lokier @ 2009-01-22 13:50 ` jim owens 0 siblings, 0 replies; 42+ messages in thread From: jim owens @ 2009-01-22 13:50 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chris Mason, linux-fsdevel Jamie Lokier wrote: > jim owens wrote: >> Jamie Lokier wrote: >>> Writing in place or new-place on a *non-shared* (i.e. non-snapshotted) >>> file is the choice which is useful. It's a filesystem implementation >>> detail, not a semantic difference. I'm suggesting writing in place >>> may do no harm and be more like the expected behaviour with programs >>> that use O_DIRECT, which are usually databases. >>> >>> How about a btrfs mount option? >>> in_place_write=never/always/direct_only. (Default direct_only). >> The harm is creating a special guarantee for just one case >> of "don't move my data" based on a transient file open mode. >> >> What about defragmenting or moving the extent to another >> device for performance or for (failing) device removal? >> >> We are on a slippery slope for presumed expectations. > > Don't make it a guarantee, just a hint to filesystem write strategy. > > It's ok to move data around when useful, we're not talking about a > hard requirement, but a performance knob. > > The question is just what performance and fragmentation > characteristics do programs that use O_DIRECT have? > > They are nearly all databases, filesystems-in-a-file, or virtual > machine disks. I'm guessing virtually all of those _particular_ > applications programs would perform significantly differently with a > write-in-place strategy for most writes, although you'd still want > access to the bells and whistles of snapshots and COW and so on when > requested. > > Note I said differently :-) I'm not sure write-in-place performs > better for those sort of applications. It's just a guess. I'm very certain that write-in-place performs much better than cow because as we all know, doing storage allocation is expensive. So many databases preallocate their files. > Oracle probably has a really good idea how it performs on ZFS compared > with a block device (which is always in place) - and knows whether ZFS > does in-place writes with O_DIRECT or not. Chris? We only disagree how the rule to write-in-place is defined and more importantly documented so it is easy to understand. Btrfs allows each individual file to have "nodatacow" set as an attribute. That is an easy rule to document for the db admin. Much easier than "if nothing else takes precedence to make it cow, O_DIRECT will write-in-place". jim ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 14:12 ` Theodore Tso 2009-01-21 14:35 ` Chris Mason @ 2009-01-22 21:18 ` Florian Weimer 2009-01-22 21:23 ` Florian Weimer 1 sibling, 1 reply; 42+ messages in thread From: Florian Weimer @ 2009-01-22 21:18 UTC (permalink / raw) To: Theodore Tso; +Cc: Jamie Lokier, Nick Piggin, linux-fsdevel * Theodore Tso: > Actually, I take that back; Oracle (and most other enterprise > databases; the world is not just Oracle --- there's also DB2, for > example) generally uses Direct I/O, so I wonder if they are using > sync_file_range() at all. Recent PostgreSQL might use it because it has got a single-threaded background writer which benefits from non-blocking fsync(). I'll have to check to be sure, though. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-22 21:18 ` Florian Weimer @ 2009-01-22 21:23 ` Florian Weimer 0 siblings, 0 replies; 42+ messages in thread From: Florian Weimer @ 2009-01-22 21:23 UTC (permalink / raw) To: Theodore Tso; +Cc: Jamie Lokier, Nick Piggin, linux-fsdevel * Florian Weimer: > * Theodore Tso: > >> Actually, I take that back; Oracle (and most other enterprise >> databases; the world is not just Oracle --- there's also DB2, for >> example) generally uses Direct I/O, so I wonder if they are using >> sync_file_range() at all. > > Recent PostgreSQL might use it because it has got a single-threaded > background writer which benefits from non-blocking fsync(). I'll have > to check to be sure, though. Uhm, it doesn't. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 1:29 ` Nick Piggin 2009-01-21 3:15 ` Jamie Lokier @ 2009-01-21 3:25 ` Jamie Lokier 2009-01-21 3:52 ` Nick Piggin 1 sibling, 1 reply; 42+ messages in thread From: Jamie Lokier @ 2009-01-21 3:25 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel Nick Piggin wrote: > > For database writes, you typically write a bunch of stuff in various > > regions of a big file (or multiple files), then ideally fdatasync > > some/all of the written ranges - with writes committed to disk in the > > best order determined by the OS and I/O scheduler. > > Do you know which databases do this? It will be nice to ask their > input and see whether it helps them (I presume it is an OSS database > because the "big" ones just use direct IO and manage their own > buffers, right?) I just found this: http://markmail.org/message/injyo7coein7o3xz (Postgresql) Tom Lane writes (on org.postgreql.pgsql-hackets): >Greg Stark <gsst...@mit.edu> writes: >> Come to think of it I wonder whether there's anything to be gained by >> using smaller files for tables. Instead of 1G files maybe 256M files >> or something like that to reduce the hit of fsyncing a file. >> >> Actually probably not. The weak part of our current approach is that >> we tell the kernel "sync this file", then "sync that file", etc, in a >> more or less random order. This leads to a probably non-optimal >> sequence of disk accesses to complete a checkpoint. What we would >> really like is a way to tell the kernel "sync all these files, and let >> me know when you're done" --- then the kernel and hardware have some >> shot at scheduling all the writes in an intelligent fashion. >> >> sync_file_range() is not that exactly, but since it lets you request >> syncing and then go back and wait for the syncs later, we could get >> the desired effect with two passes over the file list. (If the file >> list is longer than our allowed number of open files, though, the >> extra opens/closes could hurt.) >> >> Smaller files would make the I/O scheduling problem worse not better. So if you can make commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that would be even better ;-) Seems to me the Postgresql thing could be improved by issuing parallel fdatasync() calls each in their own thread. Not optimal, exactly, but more parallelism to schedule around. (But limited by the I/O request queue being full with big flushes, so potentially one fdatasync() starving the others. -- Jamie ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [rfc] fsync_range? 2009-01-21 3:25 ` Jamie Lokier @ 2009-01-21 3:52 ` Nick Piggin 0 siblings, 0 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-21 3:52 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Jan 21, 2009 at 03:25:20AM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > > For database writes, you typically write a bunch of stuff in various > > > regions of a big file (or multiple files), then ideally fdatasync > > > some/all of the written ranges - with writes committed to disk in the > > > best order determined by the OS and I/O scheduler. > > > > Do you know which databases do this? It will be nice to ask their > > input and see whether it helps them (I presume it is an OSS database > > because the "big" ones just use direct IO and manage their own > > buffers, right?) > > I just found this: > > http://markmail.org/message/injyo7coein7o3xz > (Postgresql) > > Tom Lane writes (on org.postgreql.pgsql-hackets): > >Greg Stark <gsst...@mit.edu> writes: > >> Come to think of it I wonder whether there's anything to be gained by > >> using smaller files for tables. Instead of 1G files maybe 256M files > >> or something like that to reduce the hit of fsyncing a file. > >> > >> Actually probably not. The weak part of our current approach is that > >> we tell the kernel "sync this file", then "sync that file", etc, in a > >> more or less random order. This leads to a probably non-optimal > >> sequence of disk accesses to complete a checkpoint. What we would > >> really like is a way to tell the kernel "sync all these files, and let > >> me know when you're done" --- then the kernel and hardware have some > >> shot at scheduling all the writes in an intelligent fashion. > >> > >> sync_file_range() is not that exactly, but since it lets you request > >> syncing and then go back and wait for the syncs later, we could get > >> the desired effect with two passes over the file list. (If the file > >> list is longer than our allowed number of open files, though, the > >> extra opens/closes could hurt.) > >> > >> Smaller files would make the I/O scheduling problem worse not better. Interesting. > So if you can make > commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that > would be even better ;-) fsyncv? Send multiple inode,range tuples to the kernel to sync. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2009-01-22 21:55 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-20 16:47 [rfc] fsync_range? Nick Piggin 2009-01-20 18:31 ` Jamie Lokier 2009-01-20 21:25 ` Bryan Henderson 2009-01-20 22:42 ` Jamie Lokier 2009-01-21 19:43 ` Bryan Henderson 2009-01-21 21:08 ` Jamie Lokier 2009-01-21 22:44 ` Bryan Henderson 2009-01-21 23:31 ` Jamie Lokier 2009-01-21 1:36 ` Nick Piggin 2009-01-21 19:58 ` Bryan Henderson 2009-01-21 20:53 ` Jamie Lokier 2009-01-21 22:14 ` Bryan Henderson 2009-01-21 22:30 ` Jamie Lokier 2009-01-22 1:52 ` Bryan Henderson 2009-01-22 3:41 ` Jamie Lokier 2009-01-21 1:29 ` Nick Piggin 2009-01-21 3:15 ` Jamie Lokier 2009-01-21 3:48 ` Nick Piggin 2009-01-21 5:24 ` Jamie Lokier 2009-01-21 6:16 ` Nick Piggin 2009-01-21 11:18 ` Jamie Lokier 2009-01-21 11:41 ` Nick Piggin 2009-01-21 12:09 ` Jamie Lokier 2009-01-21 4:16 ` Nick Piggin 2009-01-21 4:59 ` Jamie Lokier 2009-01-21 6:23 ` Nick Piggin 2009-01-21 12:02 ` Jamie Lokier 2009-01-21 12:13 ` Theodore Tso 2009-01-21 12:37 ` Jamie Lokier 2009-01-21 14:12 ` Theodore Tso 2009-01-21 14:35 ` Chris Mason 2009-01-21 15:58 ` Eric Sandeen 2009-01-21 20:41 ` Jamie Lokier 2009-01-21 21:23 ` jim owens 2009-01-21 21:59 ` Jamie Lokier 2009-01-21 23:08 ` btrfs O_DIRECT was " jim owens 2009-01-22 0:06 ` Jamie Lokier 2009-01-22 13:50 ` jim owens 2009-01-22 21:18 ` Florian Weimer 2009-01-22 21:23 ` Florian Weimer 2009-01-21 3:25 ` Jamie Lokier 2009-01-21 3:52 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).