* Re: munmap, msync: synchronization [not found] <5353A158.9050009@gmx.de> @ 2014-04-21 10:16 ` Michael Kerrisk (man-pages) [not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Michael Kerrisk (man-pages) @ 2014-04-21 10:16 UTC (permalink / raw) To: Heinrich Schuchardt, linux-man Cc: mtk.manpages, Christoph Hellwig, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi, jamie [CCing a few people who may correct my errors; perhaps there are some improvements that are needed for the mmap() and msync() man pages ] Hello Heinrich, On 04/20/2014 12:28 PM, Heinrich Schuchardt wrote: > Hello Michael, > > when analyzing how the fanotify API interacts with mmap(2) I stumbled > over the following issues in the manpages: > > > The manpage of msync(2) says: > "msync() flushes changes made to the in-core copy of a file that was > mapped into memory using mmap(2) back to disk." > > "back to disk" implies that the file system is forced to actually write > to the hard disk, somewhat equivalent to invoking sync(1). Is that > guaranteed for all file systems? > > Not all file systems are necessarily disk based (e.g. davfs, tmpfs). > > So shouldn't we write: > "... back to the file system." Yes, that seems better to me. Done. > http://pubs.opengroup.org/onlinepubs/007904875/functions/msync.html > says > "... to permanent storage locations, if any," > > > The manpage of munmap(2) leaves it unclear, if copying back to the > filesystem is synchronous or asynchronous. In fact, the page says nearly nothing about whether it synchs at all. That is (I think) more or less deliberate. See below. > This bit of information is important, because, if munmap is > asynchronous, applications might want to call msync(,,MS_SYNC), before > calling munmap. If munmap is synchronous it might block until the file > system responds (think of waiting for a tape to be loaded, or a webdav > server to respond). > > > What happens to an unfinished prior asynchronous update by > mmap(,,MS_ASYNC) when munmap is called? I believe the answer is: On Linux, nothing special; the asynchronous update will still be done. (I'm not sure that anything needs to be said in the man page... But, if you have a good argument about why something should be said, I'm open to hearing it.) > Will munmap "invalidate other mappings of the same file (so that they > can be updated with the fresh values just written)" like > msync(,,MS_INVALIDATE) does? I don't believe there's any requirement that it does. (Again, I'm not sure that anything needs to be said in the man page... But, if you have a good argument...) So, here's how things are as I understand them. 1. In the bad old days (even on Linux, AFAIK, but that was in days before I looked closely at what goes on), the page cache and the buffer cache were not unified. That meant that a page from a file might both be in the buffer cache (because of file I/O syscalls) and in the page cache (because of mmap()). 2. In a non-unified cache system, pages can naturally get out of synch in the two locations. Before it had a unified cache, Linux used to jump some hoops to ensure that contents in the two locations remained consistent. 3. Nowadays Linux--like most (all?) UNIX systems--has a unified cache: file I/O, mmap(), and the paging system all use the same cache. If a file is mmap()-ed and also subject to file I?/, there will be only one copy of each file page in the cache. Ergo, the inconsistency problem goes away. 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE) exist only because of the bad old non-unified cache days. MS_INVALIDATE was a way of saying: make sure that writes to the file by other processes are visible in this mapping. msync() without the MS_INVALIDATE flags was a way of saying: make sure that read()s from the file see the changes made via this mapping. Using either MS_SYNC or MS_ASYNC was the way of saying: "I either want to wait until the file updates have been completed", or "please start the updates now, but I don't want to wait until they're completed". 5. On systems with a unified cache, msync(MS_INVALIDATE) is a no-op. (That is so on Linux.) 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified cache system. Filesystem I/O always sees a consistent view, and MS_ASYNC never undertook to give a guarantee about *when* the update would occur. (The Linux buffer cache logic will ensure that it is flushed out sometime in the near future.) 7. On Linux (and probably many other modern systems), the only call that has any real use is msync(MS_SYNC), meaning "flush the buffers *now*, and I want to wait for that to complete, so that I can then continue safe in the knowledge that my data has landed on a device". That's useful if we want insurance for our data in the event of a system crash. 8. POSIX make no mandate for a unified cache system. Thus, we have MS_ASYNC and MS_INVALIDATE in the standard, and the standard says nothing (AFAIK) about whether munmap() will flush data. On Linux (and probably most modern systems), we're fine. but portable applications that care about standards and nonunified caches need to use msync(). My advice: To ensure that the contents of a shared file mapping are written to the underlying file--even on bad old implementations--a call to msync() should be made before unmapping a mapping with munmap(). 9. The mmap() man page says this: MAP_SHARED Share this mapping. Updates to the mapping are vis‐ ible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or mun‐ map() is called. I believe the piece "or munmap()" is misleading. It implies that munmap() must trigger a sync action. I don't think this is true. All that it is required to do is remove some range of pages from the process's virtual address space. I'm inclined to remove those words, but I'd like to see if any FS person has a correction to my understanding first. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: munmap, msync: synchronization [not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2014-04-21 18:14 ` Christoph Hellwig 2014-04-21 19:54 ` Michael Kerrisk (man-pages) 2014-04-23 14:03 ` Matthew Wilcox 0 siblings, 2 replies; 17+ messages in thread From: Christoph Hellwig @ 2014-04-21 18:14 UTC (permalink / raw) To: Michael Kerrisk (man-pages) Cc: Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi, jamie-yetKDKU6eevNLxjTenLetw On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote: > 1. In the bad old days (even on Linux, AFAIK, but that was in days > before I looked closely at what goes on), the page cache and > the buffer cache were not unified. That meant that a page from > a file might both be in the buffer cache (because of file I/O > syscalls) and in the page cache (because of mmap()). Correct. > 2. In a non-unified cache system, pages can naturally get out of > synch in the two locations. Before it had a unified cache, Linux > used to jump some hoops to ensure that contents in the two > locations remained consistent. Yeah. > 3. Nowadays Linux--like most (all?) UNIX systems--has a > unified cache: file I/O, mmap(), and the paging system all > use the same cache. If a file is mmap()-ed and also subject > to file I?/, there will be only one copy of each file page > in the cache. Ergo, the inconsistency problem goes away. Mostly true, except for FreeBSD and Solaris when they use ZFS, which has it's own file cache that is not coherent with the VM cache at the implementation level. Not sure how much of this leaks to userspace, though. > 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE) > exist only because of the bad old non-unified cache days. > MS_INVALIDATE was a way of saying: make sure that writes > to the file by other processes are visible in this mapping. > msync() without the MS_INVALIDATE flags was a way of saying: > make sure that read()s from the file see the changes made > via this mapping. Using either MS_SYNC or MS_ASYNC > was the way of saying: "I either want to wait until the file > updates have been completed", or "please start the updates > now, but I don't want to wait until they're completed". Right. > 5. On systems with a unified cache, msync(MS_INVALIDATE) > is a no-op. (That is so on Linux.) Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me why, though.. > 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified > cache system. Filesystem I/O always sees a consistent view, > and MS_ASYNC never undertook to give a guarantee about *when* > the update would occur. (The Linux buffer cache logic will > ensure that it is flushed out sometime in the near future.) Right. It's a fairly inefficient noop, though - it actually loops over all vmas to do nothing with them. > 7. On Linux (and probably many other modern systems), the only > call that has any real use is msync(MS_SYNC), meaning > "flush the buffers *now*, and I want to wait for that to > complete, so that I can then continue safe in the knowledge > that my data has landed on a device". That's useful if we > want insurance for our data in the event of a system crash. Right. It's basically another way to call fsync, which is used to implement it underneath. It actually should be a ranged-fdatasync but right it's it's implemented horribly inefficiently in that it does a fsync call for each vma that it encounters in the range specified. > 8. POSIX make no mandate for a unified cache system. Thus, > we have MS_ASYNC and MS_INVALIDATE in the standard, and > the standard says nothing (AFAIK) about whether munmap() > will flush data. On Linux (and probably most modern systems), > we're fine. but portable applications that care about > standards and nonunified caches need to use msync(). > > My advice: To ensure that the contents of a shared file > mapping are written to the underlying file--even on bad old > implementations--a call to msync() should be made before > unmapping a mapping with munmap(). Agreed. > 9. The mmap() man page says this: > > MAP_SHARED > Share this mapping. Updates to the mapping are vis??? > ible to other processes that map this file, and are > carried through to the underlying file. The file > may not actually be updated until msync(2) or mun??? > map() is called. > > I believe the piece "or munmap()" is misleading. It implies > that munmap() must trigger a sync action. I don't think this > is true. All that it is required to do is remove some range > of pages from the process's virtual address space. I'm > inclined to remove those words, but I'd like to see if any > FS person has a correction to my understanding first. I would expect non-coherent systems to update their caches on munmap, Posix does not seem to require this, and I can't find any language towards that in the HP-UX man page, which was a system that I remember as non-coherent until the end. -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: munmap, msync: synchronization 2014-04-21 18:14 ` Christoph Hellwig @ 2014-04-21 19:54 ` Michael Kerrisk (man-pages) 2014-04-21 21:34 ` Jamie Lokier 2014-04-23 14:03 ` Matthew Wilcox 1 sibling, 1 reply; 17+ messages in thread From: Michael Kerrisk (man-pages) @ 2014-04-21 19:54 UTC (permalink / raw) To: Christoph Hellwig Cc: mtk.manpages, Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi, jamie Christoph, On 04/21/2014 08:14 PM, Christoph Hellwig wrote: > On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote: >> 1. In the bad old days (even on Linux, AFAIK, but that was in days >> before I looked closely at what goes on), the page cache and >> the buffer cache were not unified. That meant that a page from >> a file might both be in the buffer cache (because of file I/O >> syscalls) and in the page cache (because of mmap()). > > Correct. > >> 2. In a non-unified cache system, pages can naturally get out of >> synch in the two locations. Before it had a unified cache, Linux >> used to jump some hoops to ensure that contents in the two >> locations remained consistent. > > Yeah. > >> 3. Nowadays Linux--like most (all?) UNIX systems--has a >> unified cache: file I/O, mmap(), and the paging system all >> use the same cache. If a file is mmap()-ed and also subject >> to file I?/, there will be only one copy of each file page >> in the cache. Ergo, the inconsistency problem goes away. > > Mostly true, except for FreeBSD and Solaris when they use ZFS, which has > it's own file cache that is not coherent with the VM cache at the > implementation level. Not sure how much of this leaks to userspace, > though. Thanks for that detail. >> 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE) >> exist only because of the bad old non-unified cache days. >> MS_INVALIDATE was a way of saying: make sure that writes >> to the file by other processes are visible in this mapping. >> msync() without the MS_INVALIDATE flags was a way of saying: >> make sure that read()s from the file see the changes made >> via this mapping. Using either MS_SYNC or MS_ASYNC >> was the way of saying: "I either want to wait until the file >> updates have been completed", or "please start the updates >> now, but I don't want to wait until they're completed". > > Right. > >> 5. On systems with a unified cache, msync(MS_INVALIDATE) >> is a no-op. (That is so on Linux.) > > Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me > why, though.. Ahhh yes, I was aware of that detail, but overlooked it in the point above. >> 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified >> cache system. Filesystem I/O always sees a consistent view, >> and MS_ASYNC never undertook to give a guarantee about *when* >> the update would occur. (The Linux buffer cache logic will >> ensure that it is flushed out sometime in the near future.) > > Right. It's a fairly inefficient noop, though - it actually loops > over all vmas to do nothing with them. > >> 7. On Linux (and probably many other modern systems), the only >> call that has any real use is msync(MS_SYNC), meaning >> "flush the buffers *now*, and I want to wait for that to >> complete, so that I can then continue safe in the knowledge >> that my data has landed on a device". That's useful if we >> want insurance for our data in the event of a system crash. > > Right. It's basically another way to call fsync, which is used to > implement it underneath. It actually should be a ranged-fdatasync > but right it's it's implemented horribly inefficiently in that it > does a fsync call for each vma that it encounters in the range > specified. > >> 8. POSIX make no mandate for a unified cache system. Thus, >> we have MS_ASYNC and MS_INVALIDATE in the standard, and >> the standard says nothing (AFAIK) about whether munmap() >> will flush data. On Linux (and probably most modern systems), >> we're fine. but portable applications that care about >> standards and nonunified caches need to use msync(). >> >> My advice: To ensure that the contents of a shared file >> mapping are written to the underlying file--even on bad old >> implementations--a call to msync() should be made before >> unmapping a mapping with munmap(). > > Agreed. Thanks for checking all of this over and thanks also for confirming that I learned my lessens well in the "Jamie Lokier school of tough technical reviewing" ;-). >> 9. The mmap() man page says this: >> >> MAP_SHARED >> Share this mapping. Updates to the mapping are vis??? >> ible to other processes that map this file, and are >> carried through to the underlying file. The file >> may not actually be updated until msync(2) or mun??? >> map() is called. >> >> I believe the piece "or munmap()" is misleading. It implies >> that munmap() must trigger a sync action. I don't think this >> is true. All that it is required to do is remove some range >> of pages from the process's virtual address space. I'm >> inclined to remove those words, but I'd like to see if any >> FS person has a correction to my understanding first. > > I would expect non-coherent systems to update their caches on munmap, > Posix does not seem to require this, and I can't find any language > towards that in the HP-UX man page, which was a system that I remember > as non-coherent until the end. Yes, that's how I read it too. POSIX seems to have no requirements here, so I assume it was catering to to the lowest common denominator. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: munmap, msync: synchronization 2014-04-21 19:54 ` Michael Kerrisk (man-pages) @ 2014-04-21 21:34 ` Jamie Lokier [not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2014-04-21 21:34 UTC (permalink / raw) To: Michael Kerrisk (man-pages) Cc: Christoph Hellwig, Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi Michael Kerrisk (man-pages) wrote: > >> 7. On Linux (and probably many other modern systems), the only > >> call that has any real use is msync(MS_SYNC), meaning > >> "flush the buffers *now*, and I want to wait for that to > >> complete, so that I can then continue safe in the knowledge > >> that my data has landed on a device". That's useful if we > >> want insurance for our data in the event of a system crash. > > > > Right. It's basically another way to call fsync, which is used to > > implement it underneath. It actually should be a ranged-fdatasync > > but right it's it's implemented horribly inefficiently in that it > > does a fsync call for each vma that it encounters in the range > > specified. A ranged-fdatasync, for databases with little logs inside the big data file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any likelihood of that ever appearing in Linux? sync_file_range() comes with its Warning in the man page which basically means "don't trust me unless you know the filesystem exactly". > Thanks for checking all of this over and thanks also > for confirming that I learned my lessens well in the > "Jamie Lokier school of tough technical reviewing" ;-). Hi! That was a long time ago :) > >> 9. The mmap() man page says this: > >> > >> MAP_SHARED > >> Share this mapping. Updates to the mapping are vis??? > >> ible to other processes that map this file, and are > >> carried through to the underlying file. The file > >> may not actually be updated until msync(2) or mun??? > >> map() is called. > >> > >> I believe the piece "or munmap()" is misleading. It implies > >> that munmap() must trigger a sync action. I don't think this > >> is true. All that it is required to do is remove some range > >> of pages from the process's virtual address space. I'm > >> inclined to remove those words, but I'd like to see if any > >> FS person has a correction to my understanding first. > > > > I would expect non-coherent systems to update their caches on munmap, > > Posix does not seem to require this, and I can't find any language > > towards that in the HP-UX man page, which was a system that I remember > > as non-coherent until the end. > > Yes, that's how I read it too. POSIX seems to have no requirements here, > so I assume it was catering to to the lowest common denominator. According to this: http://h30499.www3.hp.com/t5/System-Administration/2-second-delays-in-fsync-msync-munmap/td-p/3092785/page/2#.U1WBw8dSI1- and the conclusion of the following page: - munmap() does _something_ on HP-UX, but it might be just a poorly implemented artifact rather than equivalent to msync. - While we're there, the lowest common denominator for HP-UX was that pwrite() followed by mmap() does not provide the data recently written, even with fsync() between. The thread ended there, but I would guess either it's a bug _or_ perhaps write+mmap+msync(MS_INVALIDATE) are needed in that order despite the write being before the mmap, perhaps if the shared segment was maintained by another process. - To keep it exciting, if you look at the HP-UX man page, 32-bit and 64-bit processes have separate mmap caches - writing to shared memory in one of them won't be seen immediately by the other. Then there's this, about Linux NFS incoherency with msync() and O_DIRECT: - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ I don't know if any of the above are _true_ though :) Best, -- Jamie ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>]
* Re: munmap, msync: synchronization [not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> @ 2014-04-22 6:03 ` Christoph Hellwig 2014-04-22 7:04 ` Jamie Lokier 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2014-04-22 6:03 UTC (permalink / raw) To: Jamie Lokier Cc: Michael Kerrisk (man-pages), Christoph Hellwig, Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote: > A ranged-fdatasync, for databases with little logs inside the big data > file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any > likelihood of that ever appearing in Linux? sync_file_range() comes > with its Warning in the man page which basically means "don't trust me > unless you know the filesystem exactly". We have the infrastructure for range fsync and fdatasync in the kernel, it's just not exposed. Given that you've already done the research how about you send a patch to wire it up? Do the above implementations at least agree on an API for it? sync_file_range() unfortunately only writes out pagecache data and never the needed metadata to actually find it. While we could multiplex a range fsync over it that seems to be very confusing (and would be more complicated than just adding new syscalls) > Then there's this, about Linux NFS incoherency with msync() and O_DIRECT: > > - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ That mail is utterly confused. Yes, NFS has less coherency than normal filesystems (google for close to open), but msync actually does it's proper job on NFS. -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: munmap, msync: synchronization 2014-04-22 6:03 ` Christoph Hellwig @ 2014-04-22 7:04 ` Jamie Lokier 2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig 0 siblings, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2014-04-22 7:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi Christoph Hellwig wrote: > On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote: > > A ranged-fdatasync, for databases with little logs inside the big data > > file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any > > likelihood of that ever appearing in Linux? sync_file_range() comes > > with its Warning in the man page which basically means "don't trust me > > unless you know the filesystem exactly". > > We have the infrastructure for range fsync and fdatasync in the kernel, > it's just not exposed. Given that you've already done the research > how about you send a patch to wire it up? Do the above implementations > at least agree on an API for it? Hi Christoph, Hardly research, I just did a quick Google and was surprised to find some results. AIX API differs from the BSDs; the BSDs seem to agree with each other. fsync_range(), with a flag parameter saying what type of sync, and whether it flushes the storage device write cache as well (because they couldn't agree that was good - similar to the barriers debate). As for me doing it, no, sorry, I haven't touched the kernel in a few years, life's been complicated for non-technical reasons, and I don't have time to get back into it now. > sync_file_range() unfortunately only writes out pagecache data and never > the needed metadata to actually find it. While we could multiplex a > range fsync over it that seems to be very confusing (and would be more > complicated than just adding new syscalls) I agree. I never saw the point in sync_file_range() except to mislead, whereas fsync_range() always seemed obvious! In the kernel, I was always under the impression the simple part of fsync_range - writing out data pages - was solved years ago, but being sure the filesystem's updated its metadata in the proper way, that begs for a little research into what filesystems do when asked, doesn't it? For example, imagine two dirty pages 0 and 1, two disk blocks A and B, and a non-overwriting filesystem (similar to btrfs) which knows about the dirty flags and has formulated a plan to journal a single metadata change containing two pointers, from [0->A,1->B] to [0->C,1->D] when it flushes metadata _after_ pages 0 and 1 are written to new disk blocks C and D. And you do fsync_range just on block 1. Now if only page 1 gets written and page 0 does not, it's important that a different metadata change is journalled: [0->A,1->D] (or just [1->D]). Now hopefully, all filesystems are sane enough to just do that, by calculating what to journal as a response to only data I/O that's in flight and behind a barrier. But I wouldn't like to _assume_ that no filesystems algorithms don't queue up the joint [0->C,1-D] metadata change somehow, having seem the dirty flags, in a way that gets confused by a forced metadata flush after partial dirty data flush. After all it might be a legitimate thing to do in the current scheme. (Similar things apply to converting preallocated-but-unwritten regions to written.) So I have this weird idea that to do it carefully needs a little checking what filesystems do with carefully ordered block-pointer metadata writes. > > Then there's this, about Linux NFS incoherency with msync() and O_DIRECT: > > > > - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ > > That mail is utterly confused. Yes, NFS has less coherency than normal > filesystems (google for close to open), but msync actually does it's > proper job on NFS. Good to know :) -- Jamie ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] fsync_range, was: Re: munmap, msync: synchronization 2014-04-22 7:04 ` Jamie Lokier @ 2014-04-22 9:28 ` Christoph Hellwig 2014-04-23 14:33 ` Michael Kerrisk (man-pages) [not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 2 replies; 17+ messages in thread From: Christoph Hellwig @ 2014-04-22 9:28 UTC (permalink / raw) To: Jamie Lokier Cc: Christoph Hellwig, Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi [-- Attachment #1: Type: text/plain, Size: 2198 bytes --] On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote: > Hi Christoph, > > Hardly research, I just did a quick Google and was surprised to find > some results. AIX API differs from the BSDs; the BSDs seem to agree > with each other. fsync_range(), with a flag parameter saying what type > of sync, and whether it flushes the storage device write cache as well > (because they couldn't agree that was good - similar to the barriers > debate). There is no FreeBSD implementation, I think you were confused by FreeBSD also hosting NetBSD man pages on their site, just as I initially was. The APIs are mostly the same, except that AIX reuses O_ flags as argument and NetBSD has a separate namespace. Following the latter seems more sensible, and also allows developer to define the separate name to the O_ flag for portability. > As for me doing it, no, sorry, I haven't touched the kernel in a few > years, life's been complicated for non-technical reasons, and I don't > have time to get back into it now. I've cooked up a patch, but I really need someone to test it and promote it. Find the patch attached. There are two differences to the NetBSD one: 1) It doesn't fail for read-only FDs. fsync doesn't, and while standards used to have fdatasync and aio_fsync fail for them, Linux never did and the standards are catching up: http://austingroupbugs.net/view.php?id=501 http://austingroupbugs.net/view.php?id=671 2) I don't implement the FDISKSYNC. Requiring it is utterly broken, and we wouldn't even have the infrastructure for it. It might make sense to provide it defined to 0 so that we have the identifier but make it a no-op. > In the kernel, I was always under the impression the simple part of > fsync_range - writing out data pages - was solved years ago, but being > sure the filesystem's updated its metadata in the proper way, that > begs for a little research into what filesystems do when asked, > doesn't it? The filesystems I care about handle it fine, and while I don't know the details of others they better handle it properly, given that we use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits from the nfs server. [-- Attachment #2: 0001-fs-implement-fsync_range.patch --] [-- Type: text/plain, Size: 5898 bytes --] >From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <hch@lst.de> Date: Tue, 22 Apr 2014 11:24:51 +0200 Subject: fs: implement fsync_range Implement a fsync/fdatasync variant that takes a range to sync. This follow the NetBSD implementation: http://www.freebsd.org/cgi/man.cgi?query=fsync&apropos=0&sektion=0&manpath=NetBSD+5.0&format=html and is fairly close the AIX implementation that the NetBSD one is based on: http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.basetechref%2Fdoc%2Fbasetrf1%2Ffsync.htm The implementation is very simple because the VFS already offers a ranged fsync infrastrucute, which is most prominently used to implement O_SYNC and O_DSYNC writes. Differences from NetBSD are: 1) It doesn't fail for read-only FDs. fsync doesn't, and while standards used require fdatasync and aio_fsync to fail for read-only file descriptors Linux never did and the standards are catching up: http://austingroupbugs.net/view.php?id=501 http://austingroupbugs.net/view.php?id=671 2) It doesn't implement the FDISKSYNC. Requiring a flag to actuall make data persistant is completely broken, and the Linux infrastructure doesn't support it anyway. We could provide it as a no-op if we really need to. Signed-off-by: Christoph Hellwig <hch@lst.de> --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + fs/sync.c | 101 ++++++++++++++++++++++++-------------- include/uapi/linux/fs.h | 6 ++- 4 files changed, 72 insertions(+), 37 deletions(-) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index d6b8679..e239d46 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -360,3 +360,4 @@ 351 i386 sched_setattr sys_sched_setattr 352 i386 sched_getattr sys_sched_getattr 353 i386 renameat2 sys_renameat2 +354 i386 fsync_range sys_fsync_range diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 04376ac..006d57f 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -323,6 +323,7 @@ 314 common sched_setattr sys_sched_setattr 315 common sched_getattr sys_sched_getattr 316 common renameat2 sys_renameat2 +317 common fsync_range sys_fsync_range # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/sync.c b/fs/sync.c index b28d1dd..58f9ca7 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -197,13 +197,13 @@ int vfs_fsync(struct file *file, int datasync) } EXPORT_SYMBOL(vfs_fsync); -static int do_fsync(unsigned int fd, int datasync) +static int do_fsync(unsigned int fd, loff_t start, loff_t end, int datasync) { struct fd f = fdget(fd); int ret = -EBADF; if (f.file) { - ret = vfs_fsync(f.file, datasync); + ret = vfs_fsync_range(f.file, start, end, datasync); fdput(f); } return ret; @@ -211,12 +211,69 @@ static int do_fsync(unsigned int fd, int datasync) SYSCALL_DEFINE1(fsync, unsigned int, fd) { - return do_fsync(fd, 0); + return do_fsync(fd, 0, LLONG_MAX, 0); } SYSCALL_DEFINE1(fdatasync, unsigned int, fd) { - return do_fsync(fd, 1); + return do_fsync(fd, 0, LLONG_MAX, 1); +} + +static loff_t end_offset(loff_t offset, loff_t nbytes) +{ + loff_t endbyte = offset + nbytes; + + if ((s64)offset < 0) + return -EINVAL; + if ((s64)endbyte < 0) + return -EINVAL; + if (endbyte < offset) + return -EINVAL; + + if (sizeof(pgoff_t) == 4) { + if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) { + /* + * The range starts outside a 32 bit machine's + * pagecache addressing capabilities. Let it "succeed" + */ + return 0; + } + if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) { + /* + * Out to EOF + */ + return LLONG_MAX; + } + } + + if (nbytes == 0) + endbyte = LLONG_MAX; + else + endbyte--; /* inclusive */ + + return endbyte; +} + +SYSCALL_DEFINE4(fsync_range, unsigned int, fd, int, how, + loff_t, start, loff_t, length) +{ + int datasync = 0; + loff_t end; + + switch (how) { + case FDATASYNC: + datasync = 1; + break; + case FFILESYNC: + break; + default: + return -EINVAL; + } + + end = end_offset(start, length); + if (end <= 0) + return end; + return do_fsync(fd, start, end, datasync); } /* @@ -275,40 +332,12 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes, loff_t endbyte; /* inclusive */ umode_t i_mode; - ret = -EINVAL; if (flags & ~VALID_FLAGS) - goto out; - - endbyte = offset + nbytes; - - if ((s64)offset < 0) - goto out; - if ((s64)endbyte < 0) - goto out; - if (endbyte < offset) - goto out; - - if (sizeof(pgoff_t) == 4) { - if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) { - /* - * The range starts outside a 32 bit machine's - * pagecache addressing capabilities. Let it "succeed" - */ - ret = 0; - goto out; - } - if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) { - /* - * Out to EOF - */ - nbytes = 0; - } - } + return -EINVAL; - if (nbytes == 0) - endbyte = LLONG_MAX; - else - endbyte--; /* inclusive */ + endbyte = end_offset(offset, nbytes); + if (endbyte <= 0) + return endbyte; ret = -EBADF; f = fdget(fd); diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index ca1a11b..491d9fe 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -199,9 +199,13 @@ struct inodes_stat_t { #define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */ #define FS_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */ - +/* flags for sync_file_range */ #define SYNC_FILE_RANGE_WAIT_BEFORE 1 #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* flags for fsync_range */ +#define FDATASYNC 0x0010 +#define FFILESYNC 0x0020 + #endif /* _UAPI_LINUX_FS_H */ -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization 2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig @ 2014-04-23 14:33 ` Michael Kerrisk (man-pages) 2014-04-23 15:45 ` Christoph Hellwig [not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 1 sibling, 1 reply; 17+ messages in thread From: Michael Kerrisk (man-pages) @ 2014-04-23 14:33 UTC (permalink / raw) To: Christoph Hellwig, Jamie Lokier Cc: mtk.manpages, Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On 04/22/2014 11:28 AM, Christoph Hellwig wrote: > On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote: >> Hi Christoph, >> >> Hardly research, I just did a quick Google and was surprised to find >> some results. AIX API differs from the BSDs; the BSDs seem to agree >> with each other. fsync_range(), with a flag parameter saying what type >> of sync, and whether it flushes the storage device write cache as well >> (because they couldn't agree that was good - similar to the barriers >> debate). > > There is no FreeBSD implementation, I think you were confused by FreeBSD > also hosting NetBSD man pages on their site, just as I initially was. > > The APIs are mostly the same, except that AIX reuses O_ flags as > argument and NetBSD has a separate namespace. Following the latter > seems more sensible, and also allows developer to define the separate > name to the O_ flag for portability. > >> As for me doing it, no, sorry, I haven't touched the kernel in a few >> years, life's been complicated for non-technical reasons, and I don't >> have time to get back into it now. > > I've cooked up a patch, but I really need someone to test it and promote > it. Find the patch attached. There are two differences to the NetBSD > one: > > 1) It doesn't fail for read-only FDs. fsync doesn't, and while > standards used to have fdatasync and aio_fsync fail for them, > Linux never did and the standards are catching up: > > http://austingroupbugs.net/view.php?id=501 > http://austingroupbugs.net/view.php?id=671 > > 2) I don't implement the FDISKSYNC. Requiring it is utterly broken, > and we wouldn't even have the infrastructure for it. It might make > sense to provide it defined to 0 so that we have the identifier but > make it a no-op. > >> In the kernel, I was always under the impression the simple part of >> fsync_range - writing out data pages - was solved years ago, but being >> sure the filesystem's updated its metadata in the proper way, that >> begs for a little research into what filesystems do when asked, >> doesn't it? > > The filesystems I care about handle it fine, and while I don't know > the details of others they better handle it properly, given that we > use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits > from the nfs server. The functionality sounds like it would be worthwhile. I've applied the patch against 3.15-rc2, and employed the test program below, with test files on standard laptop HDD (ext4). The test program repeatedly a) overwrites a specified region of a file b) does an fsync_range() on a specified range of the file (need not be the same region that was written). The CLI is crude, but the arguments are: 1: pathname 2: number of loops 3: Starting point for writes each time round loop 4: Length of region to write 5: Either 'f' for or 'd' for FDATASYNC 6: start offset for fsync_range() 7: length for fsync_range() It seems that the patch does roughly what it says on the tin: # Precreate a 1MB file $ sync; time ./t_fsync_range /testfs/f 100 0 1000000 d 0 1000000^C $ dd of=/testfs/f bs=1000 count=1000 if=/dev/full 1000+0 records in 1000+0 records out 1000000 bytes (1.0 MB) copied, 0.00575843 s, 174 MB/s # Take journaling and atime out of the equation: $ sudo umount /dev/sdb6 $ sudo tune2fs -O ^has_journal /dev/sdb6$ [sudo] password for mtk: tune2fs 1.42.8 (20-Jun-2013) $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs # Filesystem unmounted and remounted (with above options) before # each of the following tests === # 1000 loops, writing 1 MB, syncing entire 1MB range, with FFILESYNC: $ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000 fsync_range(3, 0x20, 0, 1000000) Performed 16000 writes Performed 1000 sync operations real 0m10.677s user 0m0.011s sys 0m0.816s # 1000 loops, writing 1MB, syncing entire 1MB range, with FDATASYNC: # (Takes less time, as expected) $ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000 fsync_range(3, 0x10, 0, 1000000) Performed 16000 writes Performed 1000 sync operations real 0m8.685s user 0m0.017s sys 0m0.825s === # 1000 loops, writing 1 MB, syncing just 100kB, with FFILESYNC: # (Take less time than syncing entire 1MB range, as expected) $ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 100000 fsync_range(3, 0x20, 0, 100000) Performed 16000 writes Performed 1000 sync operations real 0m1.501s user 0m0.005s sys 0m0.339s # 1000 loops, writing 1 MB, syncing just 10kB, with FFILESYNC: $ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 10000 fsync_range(3, 0x20, 0, 10000) Performed 16000 writes Performed 1000 sync operations real 0m0.616s user 0m0.004s sys 0m0.240s ======= But I have a question: When I precreate a 10MB file, and repeat the tests (this time with 100 loops), I no longer see any significant difference between FFILESYNC and FDATASYNC. What am I missing? Sample runs here, though I did the tests repeatedly with broadly similar results each time: #FFILESYNC $ time ./t_fsync_range /testfs/f 100 0 10000000 f 0 10000000 fsync_range(3, 0x20, 0, 10000000) Performed 15300 writes Performed 100 sync operations real 0m17.575s user 0m0.001s sys 0m0.656s # FDATASYNC $ time ./t_fsync_range /testfs/f 100 0 10000000 d 0 10000000 fsync_range(3, 0x10, 0, 10000000) Performed 15300 writes Performed 100 sync operations real 0m17.228s user 0m0.005s sys 0m0.624s ====== Add another question: is there any piece of sync_file_range() functionality that could or should be incorporated in this API? ====== Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization 2014-04-23 14:33 ` Michael Kerrisk (man-pages) @ 2014-04-23 15:45 ` Christoph Hellwig [not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2014-04-24 9:34 ` Michael Kerrisk (man-pages) 0 siblings, 2 replies; 17+ messages in thread From: Christoph Hellwig @ 2014-04-23 15:45 UTC (permalink / raw) To: Michael Kerrisk (man-pages) Cc: Christoph Hellwig, Jamie Lokier, Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On Wed, Apr 23, 2014 at 04:33:06PM +0200, Michael Kerrisk (man-pages) wrote: > # Take journaling and atime out of the equation: > > $ sudo umount /dev/sdb6 > $ sudo tune2fs -O ^has_journal /dev/sdb6$ > [sudo] password for mtk: > tune2fs 1.42.8 (20-Jun-2013) > $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs The second strictatime argument overrides the earlier norelatime, so you put it into the picture. > > But I have a question: > > When I precreate a 10MB file, and repeat the tests (this time with > 100 loops), I no longer see any significant difference between > FFILESYNC and FDATASYNC. What am I missing? Sample runs here, > though I did the tests repeatedly with broadly similar results > each time: Not sure. Do you also see this on other filesystems? > Add another question: is there any piece of sync_file_range() > functionality that could or should be incorporated in this API? I don't think so. sync_file_range is a complete mess and impossible to use correctly for data integrity operations. Especially the whole notion that submitting I/O and waiting for it are separate operations is incompatible with a data integrity call. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization [not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2014-04-23 22:20 ` Jamie Lokier [not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2014-04-23 22:20 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi > > Add another question: is there any piece of sync_file_range() > > functionality that could or should be incorporated in this API? > > I don't think so. sync_file_range is a complete mess and impossible > to use correctly for data integrity operations. Especially the whole > notion that submitting I/O and waiting for it are separate operations > is incompatible with a data integrity call. I guess it's also to give the application a way to nudge a preferred asynchronous writeback order, prior to a synchronous wait. If the application knows there's a lot of dirty data being generated over time prior to needing a short fdatasync, it might see it as beneficial to tell the kernel to start writing that data sooner, so the fdatasync delay will be shorter. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>]
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization [not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> @ 2014-04-25 6:07 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2014-04-25 6:07 UTC (permalink / raw) To: Jamie Lokier Cc: Christoph Hellwig, Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On Wed, Apr 23, 2014 at 11:20:11PM +0100, Jamie Lokier wrote: > I guess it's also to give the application a way to nudge a preferred > asynchronous writeback order, prior to a synchronous wait. If the > application knows there's a lot of dirty data being generated over > time prior to needing a short fdatasync, it might see it as beneficial > to tell the kernel to start writing that data sooner, so the fdatasync > delay will be shorter. If they want to do an async writeback pass first they can just use sync_file_range for it, that's the only thing it's actually useful for. -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization 2014-04-23 15:45 ` Christoph Hellwig [not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2014-04-24 9:34 ` Michael Kerrisk (man-pages) 1 sibling, 0 replies; 17+ messages in thread From: Michael Kerrisk (man-pages) @ 2014-04-24 9:34 UTC (permalink / raw) To: Christoph Hellwig Cc: mtk.manpages, Jamie Lokier, Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi Oops -- I see that I forgot to attach the test program in my last mail. Appended below, now.) On 04/23/2014 05:45 PM, Christoph Hellwig wrote: > On Wed, Apr 23, 2014 at 04:33:06PM +0200, Michael Kerrisk (man-pages) wrote: >> # Take journaling and atime out of the equation: >> >> $ sudo umount /dev/sdb6 >> $ sudo tune2fs -O ^has_journal /dev/sdb6$ >> [sudo] password for mtk: >> tune2fs 1.42.8 (20-Jun-2013) >> $ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs > > The second strictatime argument overrides the earlier norelatime, > so you put it into the picture. Oh -- have I misunderstood something? I was wanting classical behavior: atime always updated (but only synced to disk by FILESYNC). Is that not what I should get with norelatime+strictatime? >> But I have a question: >> >> When I precreate a 10MB file, and repeat the tests (this time with >> 100 loops), I no longer see any significant difference between >> FFILESYNC and FDATASYNC. What am I missing? Sample runs here, >> though I did the tests repeatedly with broadly similar results >> each time: > > Not sure. Do you also see this on other filesystems? ======= So, here's some results from XFS: # 1000 loops. 1MB file, 1MB fsync_range() # As with ext4, FDATASYNC is faster than FFILESYNC (as expected) $ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs $ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000 fsync_range(3, 0x20, 0, 1000000) Performed 16000 writes Performed 1000 sync operations real 0m52.264s user 0m0.018s sys 0m0.926s $ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs $ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000 fsync_range(3, 0x10, 0, 1000000) Performed 16000 writes Performed 1000 sync operations real 0m33.689s user 0m0.002s sys 0m0.915s # (Note that I did not disable XFS journalling--it's not possible to # do so, right?) ==== # 100 loops, 100MB file, 100MB fsync_range() # FDATASYNC and FFIFLESYNC times are again similar $ time ./t_fsync_range /testfs/f 100 0 100000000 f 0 100000000 fsync_range(3, 0x20, 0, 100000000) Performed 152600 writes Performed 100 sync operations real 4m45.257s user 0m0.004s sys 0m5.607s $ time ./t_fsync_range /testfs/f 100 0 100000000 d 0 100000000 fsync_range(3, 0x10, 0, 100000000) Performed 152600 writes Performed 100 sync operations real 4m43.925s user 0m0.010s sys 0m3.824s # Again, the same pattern: no difference between FFILESYNC and FDATASYNC ===== On JFS, I get 1000 loops, 1MB file, 1MB fsync_range, FFILESYNC: * Quite a lot of variability (11.3 to 16.5 secs) 1000 loops, 1MB file, 1MB fsync_range, FDATASYNC: * Quite a lot of variability (8.6 to 10.9 secs) ==> FDATASYNC is on average faster than FFILESYNC 100 loops, 100 MB file, 100MB fsync_range, FFILESYNC: 281 seconds (just a single test) 100 loops, 100 MB file, 100MB fsync_range, FDATASYNC: 280 seconds (just a single test) So, again, it seems like for a large file sync, there's no difference between FFILESYNC and FDATASYNC >> Add another question: is there any piece of sync_file_range() >> functionality that could or should be incorporated in this API? > > I don't think so. sync_file_range is a complete mess and impossible > to use correctly for data integrity operations. Especially the whole > notion that submitting I/O and waiting for it are separate operations > is incompatible with a data integrity call. Okay -- I just thought it worth checking. Cheers, Michael ======== #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) /* flags for fsync_range */ #define FDATASYNC 0x0010 #define FFILESYNC 0x0020 #define SYS_fsync_range 317 static int fsync_range(unsigned int fd, int how, loff_t start, loff_t length) { return syscall(SYS_fsync_range, fd, how, start, length); } #define BUF_SIZE 65536 static char buf[BUF_SIZE]; int main(int argc, char *argv[]) { int j, fd, nloops, how; size_t writeLen, syncLen, wlen; size_t bufSize; off_t writeOffset, syncOffset; int scnt, wcnt; if (argc != 8 || strcmp(argv[1], "--help") == 0) { fprintf(stderr, "%s pathname nloops write-offset write-length {f|d} " "sync-offset sync-len\n", argv[0]); exit(EXIT_SUCCESS); } fd = open(argv[1], O_RDWR | O_CREAT, S_IRUSR | S_IWUSR); if (fd == -1) errExit("read"); nloops = atoi(argv[2]); writeOffset = atoi(argv[3]); writeLen = atoi(argv[4]); how = (argv[5][0] == 'd') ? FDATASYNC : (argv[5][0] == 'f') ? FFILESYNC : 0; syncOffset = atoi(argv[6]); syncLen = atoi(argv[7]); if (how != 0) fprintf(stderr, "fsync_range(%d, 0x%x, %lld, %zd)\n", fd, how, (long long) syncOffset, syncLen); scnt = 0; wcnt = 0; for (j = 0; j < nloops; j++) { memset(buf, j % 256, BUF_SIZE); if (lseek(fd, writeOffset, SEEK_SET) == -1) errExit("lseek"); wlen = writeLen; while (wlen > 0) { bufSize = (wlen > BUF_SIZE) ? BUF_SIZE : wlen; wlen -= bufSize; if (write(fd, buf, bufSize) != bufSize) { fprintf(stderr, "Write failed\n"); exit(EXIT_FAILURE); } wcnt++; } if (how != 0) { scnt++; if (fsync_range(fd, how, syncOffset, syncLen) == -1) errExit("fsync_range"); } } fprintf(stderr, "Performed %d writes\n", wcnt); fprintf(stderr, "Performed %d sync operations\n", scnt); exit(EXIT_SUCCESS); } -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization [not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2014-04-23 22:15 ` Jamie Lokier [not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> 2014-04-24 1:34 ` Dave Chinner 1 sibling, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2014-04-23 22:15 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi Christoph Hellwig wrote: > On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote: > > Hi Christoph, > > > > Hardly research, I just did a quick Google and was surprised to find > > some results. AIX API differs from the BSDs; the BSDs seem to agree > > with each other. fsync_range(), with a flag parameter saying what type > > of sync, and whether it flushes the storage device write cache as well > > (because they couldn't agree that was good - similar to the barriers > > debate). > > There is no FreeBSD implementation, I think you were confused by FreeBSD > also hosting NetBSD man pages on their site, just as I initially was. Yes, especially with the headings on the man pages saying FreeBSD :) Just checked a FreeBSD 8.2 system, doesn't have it. > The APIs are mostly the same, except that AIX reuses O_ flags as > argument and NetBSD has a separate namespace. Following the latter > seems more sensible, and also allows developer to define the separate > name to the O_ flag for portability. ... > I've cooked up a patch, but I really need someone to test it and promote > it. Find the patch attached. There are two differences to the NetBSD > one: > > 1) It doesn't fail for read-only FDs. fsync doesn't, and while > standards used to have fdatasync and aio_fsync fail for them, > Linux never did and the standards are catching up: > > http://austingroupbugs.net/view.php?id=501 > http://austingroupbugs.net/view.php?id=671 See also for maybe why: http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html > 2) I don't implement the FDISKSYNC. Requiring it is utterly broken, > and we wouldn't even have the infrastructure for it. It might make > sense to provide it defined to 0 so that we have the identifier but > make it a no-op. I presume Linux does the equivalent without needing FDISKSYNC, if and only if the filesystem is mounted with barriers enabled, which is the default nowadays? > > In the kernel, I was always under the impression the simple part of > > fsync_range - writing out data pages - was solved years ago, but being > > sure the filesystem's updated its metadata in the proper way, that > > begs for a little research into what filesystems do when asked, > > doesn't it? > > The filesystems I care about handle it fine, and while I don't know > the details of others they better handle it properly, given that we > use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits > from the nfs server. Excellent. This really looks like it should have gone in as a system call years ago, since vfs_fsync_range was there all along waiting to be used! > Differences from NetBSD are: > > 1) It doesn't fail for read-only FDs. fsync doesn't, and while standards > used require fdatasync and aio_fsync to fail for read-only file > descriptors Linux never did and the standards are catching up: > > http://austingroupbugs.net/view.php?id=501 > http://austingroupbugs.net/view.php?id=671 > > 2) It doesn't implement the FDISKSYNC. Requiring a flag to actuall make > data persistant is completely broken, and the Linux infrastructure > doesn't support it anyway. We could provide it as a no-op if we > really need to. Ah, more differences, which I think should be dropped actually. 3) Does not implement NetBSD's documented behaviour when length == 0. NetBSD says "If the length parameter is zero, fsync_range() will synchronize all of the file data". This path does from offset. 4) Other weird range stuff inherited from sync_file_range() on 32 bit machines only. May not be correct with O_DIRECT or filesystems that don't use page cache. See: > +static loff_t end_offset(loff_t offset, loff_t nbytes) > +{ > + loff_t endbyte = offset + nbytes; > + > + if ((s64)offset < 0) > + return -EINVAL; > + if ((s64)endbyte < 0) > + return -EINVAL; > + if (endbyte < offset) > + return -EINVAL; > + > + if (sizeof(pgoff_t) == 4) { > + if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) { > + /* > + * The range starts outside a 32 bit machine's > + * pagecache addressing capabilities. Let it "succeed" > + */ > + return 0; > + } > + if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) { > + /* > + * Out to EOF > + */ > + return LLONG_MAX; > + } > + } > + > + if (nbytes == 0) > + endbyte = LLONG_MAX; > + else > + endbyte--; /* inclusive */ > + > + return endbyte; > +} That was in sync_file_range(), where I think it might have made more sense as that's obviously tied to the page cache only. So: a) Giving zero length results in sync from offset..LLONG_MAX. (NetBSD would have it be 0..LLONG_MAX, according to man page.) b) If the offset is "too large" for page cache on a 32-bit machine, it won't do anything -- including no metadata side-effects. c) If the length is "too large" for page cache on a 32-bit machine, it extends the length to LLONG_MAX. The desired behaviour with zero length, that's obviously a judgement call. I guess that provided NetBSD applications the option to use FDISKSYNC without a range :) About b) and c) they both look dubious, because it's not a given that a filesystem is using page cache, or only using page cache. For example FUSE using O_DIRECT. (Not that I've checked if you can actually write anything in those ranges though.) b) looks worse because it means side effects are also quietly not done, and a file might legitimately not use the page cache (consider a FUSE-mounted file accessed with O_DIRECT). So, would it not make sense to just check the offset, length and offset+length fit into s64; and if length is zero change the range to 0..LLONG_MAX, and simply match NetBSD that way? (Or, call me crazy, just return if length is zero.) Best, -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>]
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization [not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> @ 2014-04-25 6:26 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2014-04-25 6:26 UTC (permalink / raw) To: Jamie Lokier Cc: Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On Wed, Apr 23, 2014 at 11:15:27PM +0100, Jamie Lokier wrote: > > 1) It doesn't fail for read-only FDs. fsync doesn't, and while > > standards used to have fdatasync and aio_fsync fail for them, > > Linux never did and the standards are catching up: > > > > http://austingroupbugs.net/view.php?id=501 > > http://austingroupbugs.net/view.php?id=671 > > See also for maybe why: > > http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html I don't really see a "why" there, just the observation that fsync and fsync_range behavior different on NetBSD, which is odd but documented behavior. > > 2) I don't implement the FDISKSYNC. Requiring it is utterly broken, > > and we wouldn't even have the infrastructure for it. It might make > > sense to provide it defined to 0 so that we have the identifier but > > make it a no-op. > > I presume Linux does the equivalent without needing FDISKSYNC, if and > only if the filesystem is mounted with barriers enabled, which is the > default nowadays? That's correct, at least for modern mainstream filesystems. Either way the filesystem would have to implement the cache flush, so those that don't support it couldn't support FDISKSYNC either. > Ah, more differences, which I think should be dropped actually. > > 3) Does not implement NetBSD's documented behaviour when length == 0. > NetBSD says "If the length parameter is zero, fsync_range() will > synchronize all of the file data". This path does from offset. Indeed. AIX also documents the same behavior. > 4) Other weird range stuff inherited from sync_file_range() on 32 > bit machines only. May not be correct with O_DIRECT or > filesystems that don't use page cache. It's not really possible to implement a full Linux filesystem without touching the pagecache, but I agree that this probably doesn't belong into the VFS. sync_file_range is one of these odd layering violations that calls straight into the pagecache without going into the filesystem first (readahead is the other one that comes to mind). > The desired behaviour with zero length, that's obviously a judgement > call. I guess that provided NetBSD applications the option to use > FDISKSYNC without a range :) It seems to originate from the earlier AIX version, but I think it's just their way to sync the whole range. I prefer our 0, LLONG_MAX notation, but given the existing user interface we should stick to it. -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization [not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2014-04-23 22:15 ` Jamie Lokier @ 2014-04-24 1:34 ` Dave Chinner 2014-04-25 6:06 ` Christoph Hellwig 1 sibling, 1 reply; 17+ messages in thread From: Dave Chinner @ 2014-04-24 1:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Jamie Lokier, Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On Tue, Apr 22, 2014 at 02:28:37AM -0700, Christoph Hellwig wrote: > On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote: > > Hi Christoph, > > > > Hardly research, I just did a quick Google and was surprised to find > > some results. AIX API differs from the BSDs; the BSDs seem to agree > > with each other. fsync_range(), with a flag parameter saying what type > > of sync, and whether it flushes the storage device write cache as well > > (because they couldn't agree that was good - similar to the barriers > > debate). > > There is no FreeBSD implementation, I think you were confused by FreeBSD > also hosting NetBSD man pages on their site, just as I initially was. > > The APIs are mostly the same, except that AIX reuses O_ flags as > argument and NetBSD has a separate namespace. Following the latter > seems more sensible, and also allows developer to define the separate > name to the O_ flag for portability. > > > As for me doing it, no, sorry, I haven't touched the kernel in a few > > years, life's been complicated for non-technical reasons, and I don't > > have time to get back into it now. > > I've cooked up a patch, but I really need someone to test it and promote > it. Find the patch attached. There are two differences to the NetBSD > one: ..... > From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > Date: Tue, 22 Apr 2014 11:24:51 +0200 > Subject: fs: implement fsync_range Christoph, if this is going into the kernel, can you add support for xfs_io and write a couple of xfstests to test it? I'm not comfortable with adding new data integrity primitives to the kernel without having robust validation infrastructure already in place for it. It might also be worthwhile looking to extend Josef's fsync-tester.c to be able to use ranged fsyncs so to test all the various corner cases that we need to.... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization 2014-04-24 1:34 ` Dave Chinner @ 2014-04-25 6:06 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2014-04-25 6:06 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Jamie Lokier, Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man-u79uwXL29TY76Z2rM5mHXA, Theodore T'so, Linux-Fsdevel, Miklos Szeredi On Thu, Apr 24, 2014 at 11:34:35AM +1000, Dave Chinner wrote: > Christoph, if this is going into the kernel, can you add support for > xfs_io and write a couple of xfstests to test it? I'm not > comfortable with adding new data integrity primitives to the kernel > without having robust validation infrastructure already in place for > it. It might also be worthwhile looking to extend Josef's > fsync-tester.c to be able to use ranged fsyncs so to test all the > various corner cases that we need to.... If we actually want to add it will obviously need test coverage. Seem like I can't really get people excited enough to make this more than a PoC so far, though. -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: munmap, msync: synchronization 2014-04-21 18:14 ` Christoph Hellwig 2014-04-21 19:54 ` Michael Kerrisk (man-pages) @ 2014-04-23 14:03 ` Matthew Wilcox 1 sibling, 0 replies; 17+ messages in thread From: Matthew Wilcox @ 2014-04-23 14:03 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Kerrisk (man-pages), Heinrich Schuchardt, linux-man, Dave Chinner, Theodore T'so, Linux-Fsdevel, Miklos Szeredi, jamie On Mon, Apr 21, 2014 at 11:14:31AM -0700, Christoph Hellwig wrote: > > 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified > > cache system. Filesystem I/O always sees a consistent view, > > and MS_ASYNC never undertook to give a guarantee about *when* > > the update would occur. (The Linux buffer cache logic will > > ensure that it is flushed out sometime in the near future.) > > Right. It's a fairly inefficient noop, though - it actually loops > over all vmas to do nothing with them. This will probably change for Persistent Memory. The reason it works today is that we have a page cache which tracks dirty bits and periodically writes dirty pages to storage. If we bypass the page cache, we have to ensure that everything does still eventually get synced. I don't quite know how this is going to work yet ... I have a number of ideas in my head. It probably won't be asynchronous though! > > 7. On Linux (and probably many other modern systems), the only > > call that has any real use is msync(MS_SYNC), meaning > > "flush the buffers *now*, and I want to wait for that to > > complete, so that I can then continue safe in the knowledge > > that my data has landed on a device". That's useful if we > > want insurance for our data in the event of a system crash. > > Right. It's basically another way to call fsync, which is used to > implement it underneath. It actually should be a ranged-fdatasync > but right it's it's implemented horribly inefficiently in that it > does a fsync call for each vma that it encounters in the range > specified. See also: From: Matthew Wilcox <matthew.r.wilcox@intel.com> To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>, willy@linux.intel.com Subject: [PATCH] Sync only the requested range in msync Date: Thu, 27 Mar 2014 19:02:41 -0400 Message-Id: <1395961361-21307-1-git-send-email-matthew.r.wilcox@intel.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-04-25 6:26 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <5353A158.9050009@gmx.de> 2014-04-21 10:16 ` munmap, msync: synchronization Michael Kerrisk (man-pages) [not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2014-04-21 18:14 ` Christoph Hellwig 2014-04-21 19:54 ` Michael Kerrisk (man-pages) 2014-04-21 21:34 ` Jamie Lokier [not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> 2014-04-22 6:03 ` Christoph Hellwig 2014-04-22 7:04 ` Jamie Lokier 2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig 2014-04-23 14:33 ` Michael Kerrisk (man-pages) 2014-04-23 15:45 ` Christoph Hellwig [not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2014-04-23 22:20 ` Jamie Lokier [not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> 2014-04-25 6:07 ` Christoph Hellwig 2014-04-24 9:34 ` Michael Kerrisk (man-pages) [not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2014-04-23 22:15 ` Jamie Lokier [not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org> 2014-04-25 6:26 ` Christoph Hellwig 2014-04-24 1:34 ` Dave Chinner 2014-04-25 6:06 ` Christoph Hellwig 2014-04-23 14:03 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).