linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	John Stultz <john.stultz@linaro.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>, Dave Hansen <dave@sr71.net>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Mike Hommey <mh@glandium.org>,
	Taras Glek <tglek@mozilla.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
Date: Mon, 7 Apr 2014 15:19:32 +0900	[thread overview]
Message-ID: <20140407061932.GF12144@bbox> (raw)
In-Reply-To: <20140402192744.GU14688@cmpxchg.org>

On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> > Hi everyone,
> > 
> > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > > you have a third option you're thinking of, I'd of course be interested
> > > in hearing it.
> > 
> > I actually thought the way of being notified with a page fault (sigbus
> > or whatever) was the most efficient way of using volatile ranges.
> > 
> > Why having to call a syscall to know if you can still access the
> > volatile range, if there was no VM pressure before the access?
> > syscalls are expensive, accessing the memory direct is not. Only if it
> > page was actually missing and a page fault would fire, you'd take the
> > slowpath.
> 
> Not everybody wants to actually come back for the data in the range,
> allocators and message passing applications just want to be able to
> reuse the memory mapping.
> 
> By tying the volatility to the dirty bit in the page tables, an
> allocator could simply clear those bits once on free().  When malloc()
> hands out this region again, the user is expected to write, which will
> either overwrite the old page, or, if it was purged, fault in a fresh
> zero page.  But there is no second syscall needed to clear volatility.
> 
> > > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > > try to exploit the fact that we get SIGBUS on purged page access (at
> > > least on the user-space side) and will try to access pages that are
> > > volatile until they are purged and try to then handle the SIGBUS to fix
> > > things up. Those folks exploiting that will have to be particularly
> > > careful not to pass volatile data to the kernel, and if they do they'll
> > > have to be smart enough to handle the EFAULT, etc. That's really all
> > > their problem, because they're being clever. :)
> > 
> > I'm actually working on feature that would solve the problem for the
> > syscalls accessing missing volatile pages. So you'd never see a
> > -EFAULT because all syscalls won't return even if they encounters a
> > missing page in the volatile range dropped by the VM pressure.
> > 
> > It's called userfaultfd. You call sys_userfaultfd(flags) and it
> > connects the current mm to a pseudo filedescriptor. The filedescriptor
> > works similarly to eventfd but with a different protocol.
> > 
> > You need a thread that will never access the userfault area with the
> > CPU, that is responsible to poll on the userfaultfd and talk the
> > userfaultfd protocol to fill-in missing pages. The userfault thread
> > after a POLLIN event reads the virtual addresses of the fault that
> > must have happened on some other thread of the same mm, and then
> > writes back an "handled" virtual range into the fd, after the page (or
> > pages if multiple) have been regenerated and mapped in with
> > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> > swapping. Then depending on the "solved" range written back into the
> > fd, the kernel will wakeup the thread or threads that were waiting in
> > kernel mode on the "handled" virtual range, and retry the fault
> > without ever exiting kernel mode.
> > 
> > We need this in KVM for running the guest on memory that is on other
> > nodes or other processes (postcopy live migration is the most common
> > use case but there are others like memory externalization and
> > cross-node KSM in the cloud, to keep a single copy of memory across
> > multiple nodes and externalized to the VM and to the host node).
> > 
> > This thread made me wonder if we could mix the two features and you
> > would then depend on MADV_USERFAULT and userfaultfd to deliver to
> > userland the "faults" happening on the volatile pages that have been
> > purged as result of VM pressure.
> > 
> > I'm just saying this after Johannes mentioned the issue with syscalls
> > returning -EFAULT. Because that is the very issue that the userfaultfd
> > is going to solve for the KVM migration thread.
> > 
> > What I'm thinking now would be to mark the volatile range also
> > MADV_USERFAULT and then calling userfaultfd and instead of having the
> > cache regeneration "slow path" inside the SIGBUS handler, to run it in
> > the userfault thread that polls the userfaultfd. Then you could write
> > the volatile ranges to disk with a write() syscall (or use any other
> > syscall on the volatile ranges), without having to worry about -EFAULT
> > being returned because one page was discarded. And if MADV_USERFAULT
> > is not called in combination with vrange syscalls, then it'd still
> > work without the userfault, but with the vrange syscalls only.
> > 
> > In short the idea would be to let the userfault code solve the fault
> > delivery to userland for you, and make the vrange syscalls only focus
> > on the page purging problem, without having to worry about what
> > happens when something access a missing page.
> 
> Yes, the two seem certainly combinable to me.
> 
> madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
> fault handling.  In the fault slowpath, you can then regenerate any
> missing data and do MADV_FREE again if it should remain volatile.  And
> again, any actual writes to the region would clear volatility because
> now the cache copy changed and discarding it would mean losing state.

Another scenario that above can't cover.
Someone might put volatility permanently until unmarking so they can
generate cache pages on that range freely without further syscall.

I mean above sugguestion can cover those pages were already mapped
when syscall was called but couldn't cover upcoming fault-in pages
so I think vrange syscall is still needed.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-04-07  6:19 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-21 21:17 [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder John Stultz
2014-03-21 21:17 ` [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
2014-03-23 12:20   ` Jan Kara
2014-03-23 20:34     ` John Stultz
2014-03-23 16:50   ` KOSAKI Motohiro
2014-04-08 18:52     ` John Stultz
2014-03-21 21:17 ` [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile John Stultz
2014-03-23 12:29   ` Jan Kara
2014-03-23 20:21     ` John Stultz
2014-03-23 17:42   ` KOSAKI Motohiro
2014-04-07 18:37     ` John Stultz
2014-04-07 22:14       ` KOSAKI Motohiro
2014-04-08  3:09         ` John Stultz
2014-03-23 17:50   ` KOSAKI Motohiro
2014-03-23 20:26     ` John Stultz
2014-03-23 21:50       ` KOSAKI Motohiro
2014-04-09 18:29         ` John Stultz
2014-03-21 21:17 ` [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap John Stultz
2014-03-23 23:44   ` KOSAKI Motohiro
2014-04-10 18:49     ` John Stultz
2014-03-21 21:17 ` [PATCH 4/5] vrange: Set affected pages referenced when marking volatile John Stultz
2014-03-24  0:01   ` KOSAKI Motohiro
2014-03-21 21:17 ` [PATCH 5/5] vmscan: Age anonymous memory even when swap is off John Stultz
2014-03-24 17:33   ` Rik van Riel
2014-03-24 18:04     ` John Stultz
2014-04-01 21:21 ` [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder Johannes Weiner
2014-04-01 21:34   ` H. Peter Anvin
2014-04-01 21:35   ` H. Peter Anvin
2014-04-01 23:01     ` Dave Hansen
2014-04-02  4:12       ` John Stultz
2014-04-02 16:36         ` Johannes Weiner
2014-04-02 17:40           ` John Stultz
2014-04-02 17:58             ` Johannes Weiner
2014-04-02 19:01               ` John Stultz
2014-04-02 19:47                 ` Johannes Weiner
2014-04-02 20:13                   ` John Stultz
2014-04-02 22:44                     ` Jan Kara
2014-04-11 19:32                     ` John Stultz
2014-04-07  5:48             ` Minchan Kim
2014-04-08  4:32             ` Kevin Easton
2014-04-08  3:38               ` John Stultz
2014-04-07  5:24           ` Minchan Kim
2014-04-02  4:03   ` John Stultz
2014-04-02  4:07     ` H. Peter Anvin
2014-04-02 16:30     ` Johannes Weiner
2014-04-02 16:32       ` H. Peter Anvin
2014-04-02 16:37         ` H. Peter Anvin
2014-04-02 17:18           ` Johannes Weiner
2014-04-02 17:40             ` Dave Hansen
2014-04-02 17:48               ` John Stultz
2014-04-02 18:07                 ` Johannes Weiner
2014-04-02 19:37                   ` John Stultz
2014-04-02 18:31     ` Andrea Arcangeli
2014-04-02 19:27       ` Johannes Weiner
2014-04-07  6:19         ` Minchan Kim [this message]
2014-04-02 19:51       ` John Stultz
2014-04-07  6:11       ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140407061932.GF12144@bbox \
    --to=minchan@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@sr71.net \
    --cc=dmitry.adamushko@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=john.stultz@linaro.org \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mh@glandium.org \
    --cc=neilb@suse.de \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).