Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>, Dave Hansen <dave@sr71.net>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Mike Hommey <mh@glandium.org>,
	Taras Glek <tglek@mozilla.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
Date: Mon, 7 Apr 2014 15:11:01 +0900	[thread overview]
Message-ID: <20140407061101.GE12144@bbox> (raw)
In-Reply-To: <20140402183113.GL1500@redhat.com>

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

True.

> 
> The usages I see for this are plenty, like for maintaining caches in
> memory that may be big and would be nice to discard if there's VM
> pressure, jpeg uncompressed images sounds like a candidate too. So the
> browser size would shrink if there's VM pressure, instead of ending up
> swapping out uncompressed image data that can be regenerated more
> quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

> 
> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.

Sounds flexible.

> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.
> 
> But if you don't intend to solve the syscall -EFAULT problem, well
> then probably the overlap is still as thin as I thought it was before
> (like also mentioned in the below link).

Sounds doable. I will look into your patch.
Thanks for reminding!

> 
> Thanks,
> Andrea
> 
> PS. my last email about this from a more KVM centric point of view:
> 
> http://www.spinics.net/lists/kvm/msg101449.html
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan@kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>, Dave Hansen <dave@sr71.net>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Mike Hommey <mh@glandium.org>,
	Taras Glek <tglek@mozilla.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
Date: Mon, 7 Apr 2014 15:11:01 +0900	[thread overview]
Message-ID: <20140407061101.GE12144@bbox> (raw)
In-Reply-To: <20140402183113.GL1500@redhat.com>

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

True.

> 
> The usages I see for this are plenty, like for maintaining caches in
> memory that may be big and would be nice to discard if there's VM
> pressure, jpeg uncompressed images sounds like a candidate too. So the
> browser size would shrink if there's VM pressure, instead of ending up
> swapping out uncompressed image data that can be regenerated more
> quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

> 
> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.

Sounds flexible.

> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.
> 
> But if you don't intend to solve the syscall -EFAULT problem, well
> then probably the overlap is still as thin as I thought it was before
> (like also mentioned in the below link).

Sounds doable. I will look into your patch.
Thanks for reminding!

> 
> Thanks,
> Andrea
> 
> PS. my last email about this from a more KVM centric point of view:
> 
> http://www.spinics.net/lists/kvm/msg101449.html
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

next prev parent reply	other threads:[~2014-04-07  6:10 UTC|newest]

Thread overview: 112+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-21 21:17 [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder John Stultz
2014-03-21 21:17 ` John Stultz
2014-03-21 21:17 ` [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-23 12:20   ` Jan Kara
2014-03-23 12:20     ` Jan Kara
2014-03-23 20:34     ` John Stultz
2014-03-23 20:34       ` John Stultz
2014-03-23 16:50   ` KOSAKI Motohiro
2014-03-23 16:50     ` KOSAKI Motohiro
2014-04-08 18:52     ` John Stultz
2014-04-08 18:52       ` John Stultz
2014-03-21 21:17 ` [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-23 12:29   ` Jan Kara
2014-03-23 12:29     ` Jan Kara
2014-03-23 20:21     ` John Stultz
2014-03-23 20:21       ` John Stultz
2014-03-23 17:42   ` KOSAKI Motohiro
2014-03-23 17:42     ` KOSAKI Motohiro
2014-04-07 18:37     ` John Stultz
2014-04-07 18:37       ` John Stultz
2014-04-07 22:14       ` KOSAKI Motohiro
2014-04-07 22:14         ` KOSAKI Motohiro
2014-04-08  3:09         ` John Stultz
2014-04-08  3:09           ` John Stultz
2014-03-23 17:50   ` KOSAKI Motohiro
2014-03-23 17:50     ` KOSAKI Motohiro
2014-03-23 20:26     ` John Stultz
2014-03-23 20:26       ` John Stultz
2014-03-23 21:50       ` KOSAKI Motohiro
2014-03-23 21:50         ` KOSAKI Motohiro
2014-04-09 18:29         ` John Stultz
2014-04-09 18:29           ` John Stultz
2014-03-21 21:17 ` [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-23 23:44   ` KOSAKI Motohiro
2014-03-23 23:44     ` KOSAKI Motohiro
2014-04-10 18:49     ` John Stultz
2014-04-10 18:49       ` John Stultz
2014-03-21 21:17 ` [PATCH 4/5] vrange: Set affected pages referenced when marking volatile John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-24  0:01   ` KOSAKI Motohiro
2014-03-24  0:01     ` KOSAKI Motohiro
2014-03-21 21:17 ` [PATCH 5/5] vmscan: Age anonymous memory even when swap is off John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-24 17:33   ` Rik van Riel
2014-03-24 17:33     ` Rik van Riel
2014-03-24 18:04     ` John Stultz
2014-03-24 18:04       ` John Stultz
2014-04-01 21:21 ` [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder Johannes Weiner
2014-04-01 21:21   ` Johannes Weiner
2014-04-01 21:34   ` H. Peter Anvin
2014-04-01 21:34     ` H. Peter Anvin
2014-04-01 21:35   ` H. Peter Anvin
2014-04-01 21:35     ` H. Peter Anvin
2014-04-01 23:01     ` Dave Hansen
2014-04-01 23:01       ` Dave Hansen
2014-04-02  4:12       ` John Stultz
2014-04-02  4:12         ` John Stultz
2014-04-02 16:36         ` Johannes Weiner
2014-04-02 16:36           ` Johannes Weiner
2014-04-02 17:40           ` John Stultz
2014-04-02 17:40             ` John Stultz
2014-04-02 17:58             ` Johannes Weiner
2014-04-02 17:58               ` Johannes Weiner
2014-04-02 19:01               ` John Stultz
2014-04-02 19:01                 ` John Stultz
2014-04-02 19:47                 ` Johannes Weiner
2014-04-02 19:47                   ` Johannes Weiner
2014-04-02 20:13                   ` John Stultz
2014-04-02 20:13                     ` John Stultz
2014-04-02 22:44                     ` Jan Kara
2014-04-02 22:44                       ` Jan Kara
2014-04-11 19:32                     ` John Stultz
2014-04-11 19:32                       ` John Stultz
2014-04-07  5:48             ` Minchan Kim
2014-04-07  5:48               ` Minchan Kim
2014-04-08  4:32             ` Kevin Easton
2014-04-08  3:38               ` John Stultz
2014-04-08  3:38                 ` John Stultz
2014-04-07  5:24           ` Minchan Kim
2014-04-07  5:24             ` Minchan Kim
2014-04-02  4:03   ` John Stultz
2014-04-02  4:03     ` John Stultz
2014-04-02  4:07     ` H. Peter Anvin
2014-04-02  4:07       ` H. Peter Anvin
2014-04-02 16:30     ` Johannes Weiner
2014-04-02 16:30       ` Johannes Weiner
2014-04-02 16:32       ` H. Peter Anvin
2014-04-02 16:32         ` H. Peter Anvin
2014-04-02 16:37         ` H. Peter Anvin
2014-04-02 17:18           ` Johannes Weiner
2014-04-02 17:18             ` Johannes Weiner
2014-04-02 17:40             ` Dave Hansen
2014-04-02 17:40               ` Dave Hansen
2014-04-02 17:48               ` John Stultz
2014-04-02 17:48                 ` John Stultz
2014-04-02 18:07                 ` Johannes Weiner
2014-04-02 18:07                   ` Johannes Weiner
2014-04-02 19:37                   ` John Stultz
2014-04-02 19:37                     ` John Stultz
2014-04-02 18:31     ` Andrea Arcangeli
2014-04-02 18:31       ` Andrea Arcangeli
2014-04-02 19:27       ` Johannes Weiner
2014-04-02 19:27         ` Johannes Weiner
2014-04-07  6:19         ` Minchan Kim
2014-04-07  6:19           ` Minchan Kim
2014-04-02 19:51       ` John Stultz
2014-04-02 19:51         ` John Stultz
2014-04-07  6:11       ` Minchan Kim [this message]
2014-04-07  6:11         ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140407061101.GE12144@bbox \
    --to=minchan@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@sr71.net \
    --cc=dmitry.adamushko@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=john.stultz@linaro.org \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mh@glandium.org \
    --cc=neilb@suse.de \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.