From: Christoph Hellwig <hch@infradead.org>
To: Jamie Lokier <jamie@shareable.org>
Cc: Christoph Hellwig <hch@infradead.org>,
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>,
Heinrich Schuchardt <xypron.glpk@gmx.de>,
linux-man@vger.kernel.org, Dave Chinner <david@fromorbit.com>,
Theodore T'so <tytso@mit.edu>,
Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
Miklos Szeredi <miklos@szeredi.hu>
Subject: [PATCH] fsync_range, was: Re: munmap, msync: synchronization
Date: Tue, 22 Apr 2014 02:28:37 -0700 [thread overview]
Message-ID: <20140422092837.GA6191@infradead.org> (raw)
In-Reply-To: <20140422070421.GI30215@jl-vm1.vm.bytemark.co.uk>
[-- Attachment #1: Type: text/plain, Size: 2198 bytes --]
On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> Hi Christoph,
>
> Hardly research, I just did a quick Google and was surprised to find
> some results. AIX API differs from the BSDs; the BSDs seem to agree
> with each other. fsync_range(), with a flag parameter saying what type
> of sync, and whether it flushes the storage device write cache as well
> (because they couldn't agree that was good - similar to the barriers
> debate).
There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.
The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace. Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.
> As for me doing it, no, sorry, I haven't touched the kernel in a few
> years, life's been complicated for non-technical reasons, and I don't
> have time to get back into it now.
I've cooked up a patch, but I really need someone to test it and promote
it. Find the patch attached. There are two differences to the NetBSD
one:
1) It doesn't fail for read-only FDs. fsync doesn't, and while
standards used to have fdatasync and aio_fsync fail for them,
Linux never did and the standards are catching up:
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
and we wouldn't even have the infrastructure for it. It might make
sense to provide it defined to 0 so that we have the identifier but
make it a no-op.
> In the kernel, I was always under the impression the simple part of
> fsync_range - writing out data pages - was solved years ago, but being
> sure the filesystem's updated its metadata in the proper way, that
> begs for a little research into what filesystems do when asked,
> doesn't it?
The filesystems I care about handle it fine, and while I don't know
the details of others they better handle it properly, given that we
use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
from the nfs server.
[-- Attachment #2: 0001-fs-implement-fsync_range.patch --]
[-- Type: text/plain, Size: 5898 bytes --]
>From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Apr 2014 11:24:51 +0200
Subject: fs: implement fsync_range
Implement a fsync/fdatasync variant that takes a range to sync. This follow the
NetBSD implementation:
http://www.freebsd.org/cgi/man.cgi?query=fsync&apropos=0&sektion=0&manpath=NetBSD+5.0&format=html
and is fairly close the AIX implementation that the NetBSD one is based on:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.basetechref%2Fdoc%2Fbasetrf1%2Ffsync.htm
The implementation is very simple because the VFS already offers a ranged
fsync infrastrucute, which is most prominently used to implement O_SYNC
and O_DSYNC writes.
Differences from NetBSD are:
1) It doesn't fail for read-only FDs. fsync doesn't, and while standards
used require fdatasync and aio_fsync to fail for read-only file
descriptors Linux never did and the standards are catching up:
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
2) It doesn't implement the FDISKSYNC. Requiring a flag to actuall make
data persistant is completely broken, and the Linux infrastructure
doesn't support it anyway. We could provide it as a no-op if we
really need to.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/sync.c | 101 ++++++++++++++++++++++++--------------
include/uapi/linux/fs.h | 6 ++-
4 files changed, 72 insertions(+), 37 deletions(-)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..e239d46 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
353 i386 renameat2 sys_renameat2
+354 i386 fsync_range sys_fsync_range
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..006d57f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common fsync_range sys_fsync_range
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/sync.c b/fs/sync.c
index b28d1dd..58f9ca7 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -197,13 +197,13 @@ int vfs_fsync(struct file *file, int datasync)
}
EXPORT_SYMBOL(vfs_fsync);
-static int do_fsync(unsigned int fd, int datasync)
+static int do_fsync(unsigned int fd, loff_t start, loff_t end, int datasync)
{
struct fd f = fdget(fd);
int ret = -EBADF;
if (f.file) {
- ret = vfs_fsync(f.file, datasync);
+ ret = vfs_fsync_range(f.file, start, end, datasync);
fdput(f);
}
return ret;
@@ -211,12 +211,69 @@ static int do_fsync(unsigned int fd, int datasync)
SYSCALL_DEFINE1(fsync, unsigned int, fd)
{
- return do_fsync(fd, 0);
+ return do_fsync(fd, 0, LLONG_MAX, 0);
}
SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
{
- return do_fsync(fd, 1);
+ return do_fsync(fd, 0, LLONG_MAX, 1);
+}
+
+static loff_t end_offset(loff_t offset, loff_t nbytes)
+{
+ loff_t endbyte = offset + nbytes;
+
+ if ((s64)offset < 0)
+ return -EINVAL;
+ if ((s64)endbyte < 0)
+ return -EINVAL;
+ if (endbyte < offset)
+ return -EINVAL;
+
+ if (sizeof(pgoff_t) == 4) {
+ if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+ /*
+ * The range starts outside a 32 bit machine's
+ * pagecache addressing capabilities. Let it "succeed"
+ */
+ return 0;
+ }
+ if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+ /*
+ * Out to EOF
+ */
+ return LLONG_MAX;
+ }
+ }
+
+ if (nbytes == 0)
+ endbyte = LLONG_MAX;
+ else
+ endbyte--; /* inclusive */
+
+ return endbyte;
+}
+
+SYSCALL_DEFINE4(fsync_range, unsigned int, fd, int, how,
+ loff_t, start, loff_t, length)
+{
+ int datasync = 0;
+ loff_t end;
+
+ switch (how) {
+ case FDATASYNC:
+ datasync = 1;
+ break;
+ case FFILESYNC:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ end = end_offset(start, length);
+ if (end <= 0)
+ return end;
+ return do_fsync(fd, start, end, datasync);
}
/*
@@ -275,40 +332,12 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
loff_t endbyte; /* inclusive */
umode_t i_mode;
- ret = -EINVAL;
if (flags & ~VALID_FLAGS)
- goto out;
-
- endbyte = offset + nbytes;
-
- if ((s64)offset < 0)
- goto out;
- if ((s64)endbyte < 0)
- goto out;
- if (endbyte < offset)
- goto out;
-
- if (sizeof(pgoff_t) == 4) {
- if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
- /*
- * The range starts outside a 32 bit machine's
- * pagecache addressing capabilities. Let it "succeed"
- */
- ret = 0;
- goto out;
- }
- if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
- /*
- * Out to EOF
- */
- nbytes = 0;
- }
- }
+ return -EINVAL;
- if (nbytes == 0)
- endbyte = LLONG_MAX;
- else
- endbyte--; /* inclusive */
+ endbyte = end_offset(offset, nbytes);
+ if (endbyte <= 0)
+ return endbyte;
ret = -EBADF;
f = fdget(fd);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ca1a11b..491d9fe 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -199,9 +199,13 @@ struct inodes_stat_t {
#define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
#define FS_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */
-
+/* flags for sync_file_range */
#define SYNC_FILE_RANGE_WAIT_BEFORE 1
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4
+/* flags for fsync_range */
+#define FDATASYNC 0x0010
+#define FFILESYNC 0x0020
+
#endif /* _UAPI_LINUX_FS_H */
--
1.7.10.4
next prev parent reply other threads:[~2014-04-22 9:28 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <5353A158.9050009@gmx.de>
2014-04-21 10:16 ` munmap, msync: synchronization Michael Kerrisk (man-pages)
[not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-04-21 18:14 ` Christoph Hellwig
2014-04-21 19:54 ` Michael Kerrisk (man-pages)
2014-04-21 21:34 ` Jamie Lokier
[not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-22 6:03 ` Christoph Hellwig
2014-04-22 7:04 ` Jamie Lokier
2014-04-22 9:28 ` Christoph Hellwig [this message]
2014-04-23 14:33 ` [PATCH] fsync_range, was: " Michael Kerrisk (man-pages)
2014-04-23 15:45 ` Christoph Hellwig
[not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:20 ` Jamie Lokier
[not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25 6:07 ` Christoph Hellwig
2014-04-24 9:34 ` Michael Kerrisk (man-pages)
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:15 ` Jamie Lokier
[not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25 6:26 ` Christoph Hellwig
2014-04-24 1:34 ` Dave Chinner
2014-04-25 6:06 ` Christoph Hellwig
2014-04-23 14:03 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140422092837.GA6191@infradead.org \
--to=hch@infradead.org \
--cc=david@fromorbit.com \
--cc=jamie@shareable.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=mtk.manpages@gmail.com \
--cc=tytso@mit.edu \
--cc=xypron.glpk@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).