From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: mtk.manpages@gmail.com, Heinrich Schuchardt <xypron.glpk@gmx.de>,
linux-man@vger.kernel.org, Dave Chinner <david@fromorbit.com>,
Theodore T'so <tytso@mit.edu>,
Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
Miklos Szeredi <miklos@szeredi.hu>,
jamie@shareable.org
Subject: Re: munmap, msync: synchronization
Date: Mon, 21 Apr 2014 21:54:16 +0200 [thread overview]
Message-ID: <53557768.5070905@gmail.com> (raw)
In-Reply-To: <20140421181431.GA17125@infradead.org>
Christoph,
On 04/21/2014 08:14 PM, Christoph Hellwig wrote:
> On Mon, Apr 21, 2014 at 12:16:46PM +0200, Michael Kerrisk (man-pages) wrote:
>> 1. In the bad old days (even on Linux, AFAIK, but that was in days
>> before I looked closely at what goes on), the page cache and
>> the buffer cache were not unified. That meant that a page from
>> a file might both be in the buffer cache (because of file I/O
>> syscalls) and in the page cache (because of mmap()).
>
> Correct.
>
>> 2. In a non-unified cache system, pages can naturally get out of
>> synch in the two locations. Before it had a unified cache, Linux
>> used to jump some hoops to ensure that contents in the two
>> locations remained consistent.
>
> Yeah.
>
>> 3. Nowadays Linux--like most (all?) UNIX systems--has a
>> unified cache: file I/O, mmap(), and the paging system all
>> use the same cache. If a file is mmap()-ed and also subject
>> to file I?/, there will be only one copy of each file page
>> in the cache. Ergo, the inconsistency problem goes away.
>
> Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
> it's own file cache that is not coherent with the VM cache at the
> implementation level. Not sure how much of this leaks to userspace,
> though.
Thanks for that detail.
>> 4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
>> exist only because of the bad old non-unified cache days.
>> MS_INVALIDATE was a way of saying: make sure that writes
>> to the file by other processes are visible in this mapping.
>> msync() without the MS_INVALIDATE flags was a way of saying:
>> make sure that read()s from the file see the changes made
>> via this mapping. Using either MS_SYNC or MS_ASYNC
>> was the way of saying: "I either want to wait until the file
>> updates have been completed", or "please start the updates
>> now, but I don't want to wait until they're completed".
>
> Right.
>
>> 5. On systems with a unified cache, msync(MS_INVALIDATE)
>> is a no-op. (That is so on Linux.)
>
> Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me
> why, though..
Ahhh yes, I was aware of that detail, but overlooked it in the point
above.
>> 6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
>> cache system. Filesystem I/O always sees a consistent view,
>> and MS_ASYNC never undertook to give a guarantee about *when*
>> the update would occur. (The Linux buffer cache logic will
>> ensure that it is flushed out sometime in the near future.)
>
> Right. It's a fairly inefficient noop, though - it actually loops
> over all vmas to do nothing with them.
>
>> 7. On Linux (and probably many other modern systems), the only
>> call that has any real use is msync(MS_SYNC), meaning
>> "flush the buffers *now*, and I want to wait for that to
>> complete, so that I can then continue safe in the knowledge
>> that my data has landed on a device". That's useful if we
>> want insurance for our data in the event of a system crash.
>
> Right. It's basically another way to call fsync, which is used to
> implement it underneath. It actually should be a ranged-fdatasync
> but right it's it's implemented horribly inefficiently in that it
> does a fsync call for each vma that it encounters in the range
> specified.
>
>> 8. POSIX make no mandate for a unified cache system. Thus,
>> we have MS_ASYNC and MS_INVALIDATE in the standard, and
>> the standard says nothing (AFAIK) about whether munmap()
>> will flush data. On Linux (and probably most modern systems),
>> we're fine. but portable applications that care about
>> standards and nonunified caches need to use msync().
>>
>> My advice: To ensure that the contents of a shared file
>> mapping are written to the underlying file--even on bad old
>> implementations--a call to msync() should be made before
>> unmapping a mapping with munmap().
>
> Agreed.
Thanks for checking all of this over and thanks also
for confirming that I learned my lessens well in the
"Jamie Lokier school of tough technical reviewing" ;-).
>> 9. The mmap() man page says this:
>>
>> MAP_SHARED
>> Share this mapping. Updates to the mapping are vis???
>> ible to other processes that map this file, and are
>> carried through to the underlying file. The file
>> may not actually be updated until msync(2) or mun???
>> map() is called.
>>
>> I believe the piece "or munmap()" is misleading. It implies
>> that munmap() must trigger a sync action. I don't think this
>> is true. All that it is required to do is remove some range
>> of pages from the process's virtual address space. I'm
>> inclined to remove those words, but I'd like to see if any
>> FS person has a correction to my understanding first.
>
> I would expect non-coherent systems to update their caches on munmap,
> Posix does not seem to require this, and I can't find any language
> towards that in the HP-UX man page, which was a system that I remember
> as non-coherent until the end.
Yes, that's how I read it too. POSIX seems to have no requirements here,
so I assume it was catering to to the lowest common denominator.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
next prev parent reply other threads:[~2014-04-21 19:54 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <5353A158.9050009@gmx.de>
2014-04-21 10:16 ` munmap, msync: synchronization Michael Kerrisk (man-pages)
[not found] ` <5354F00E.8050609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-04-21 18:14 ` Christoph Hellwig
2014-04-21 19:54 ` Michael Kerrisk (man-pages) [this message]
2014-04-21 21:34 ` Jamie Lokier
[not found] ` <20140421213418.GH30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-22 6:03 ` Christoph Hellwig
2014-04-22 7:04 ` Jamie Lokier
2014-04-22 9:28 ` [PATCH] fsync_range, was: " Christoph Hellwig
2014-04-23 14:33 ` Michael Kerrisk (man-pages)
2014-04-23 15:45 ` Christoph Hellwig
[not found] ` <20140423154550.GA21014-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:20 ` Jamie Lokier
[not found] ` <20140423222011.GM30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25 6:07 ` Christoph Hellwig
2014-04-24 9:34 ` Michael Kerrisk (man-pages)
[not found] ` <20140422092837.GA6191-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-04-23 22:15 ` Jamie Lokier
[not found] ` <20140423221402.GL30215-DqlFc3psUjeg7Qil/0GVWOc42C6kRsbE@public.gmane.org>
2014-04-25 6:26 ` Christoph Hellwig
2014-04-24 1:34 ` Dave Chinner
2014-04-25 6:06 ` Christoph Hellwig
2014-04-23 14:03 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53557768.5070905@gmail.com \
--to=mtk.manpages@gmail.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jamie@shareable.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=tytso@mit.edu \
--cc=xypron.glpk@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).