* [RFC PATCH 0/2] vmpslice support for zero-copy gifting of pages @ 2013-07-25 17:21 Robert Jennings 2013-07-25 17:21 ` [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient Robert Jennings 2013-07-25 17:21 ` [RFC PATCH 2/2] Add limited zero copy to vmsplice Robert Jennings 0 siblings, 2 replies; 6+ messages in thread From: Robert Jennings @ 2013-07-25 17:21 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Robert Jennings, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia This patch set would add the ability to move anonymous user pages from one process to another through vmsplice without copying data. Moving pages rather than copying is implemented for a narrow case in this RFC to meet the needs of QEMU's usage (below). Among the restrictions the source address and destination addresses must be page aligned, the size argument must be a multiple of page size, and by the time the reader calls vmsplice, the page must no longer be mapped in the source. If a move is not possible the code transparently falls back to copying data. This comes from work in QEMU[1] to migrate a VM from one QEMU instance to another with minimal down-time for the VM. This would allow for an update of the QEMU executable under the VM. New flag usage This introduces use of the SPLICE_F_MOVE flag for vmsplice, previously unused. Proposed usage is as follows: Writer gifts pages to pipe, can not access original contents after gift: vmsplice(fd, iov, nr_segs, (SPLICE_F_GIFT | SPLICE_F_MOVE); Reader asks kernel to move pages from pipe to memory described by iovec: vmsplice(fd, iov, nr_segs, SPLICE_F_MOVE); Moving pages rather than copying is implemented for a narrow case in this RFC to meet the needs of QEMU's usage. If a move is not possible the code transparently falls back to copying data. For older kernels the SPLICE_F_MOVE would be ignored and a copy would occur. [1] QEMU localhost live migration: http://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02540.html http://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02577.html _______________________________________________________ RFC: vmsplice unmap gifted pages for recipient RFC: Add limited zero copy to vmsplice fs/splice.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/splice.h | 1 + 2 files changed, 88 insertions(+), 1 deletion(-) -- 1.8.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient 2013-07-25 17:21 [RFC PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings @ 2013-07-25 17:21 ` Robert Jennings 2013-07-25 17:30 ` Dave Hansen 2013-07-25 17:21 ` [RFC PATCH 2/2] Add limited zero copy to vmsplice Robert Jennings 1 sibling, 1 reply; 6+ messages in thread From: Robert Jennings @ 2013-07-25 17:21 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Robert Jennings, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia From: Matt Helsley <matthltc@us.ibm.com> Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap pages. When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the writer's gift'ed pages would be zapped. This patch supports further work to move vmsplice'd pages rather than copying them. That patch has the restriction that the page must not be mapped by the source for the move, otherwise it will fall back to copying the page. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Matt Helsley <matt.helsley@gmail.com> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com> --- fs/splice.c | 25 ++++++++++++++++++++++++- include/linux/splice.h | 1 + 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/fs/splice.c b/fs/splice.c index 3b7ee65..6aa964f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -172,6 +172,18 @@ static void wakeup_pipe_readers(struct pipe_inode_info *pipe) kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } +static void zap_buf_page(unsigned long useraddr) +{ + struct vm_area_struct *vma; + + down_read(¤t->mm->mmap_sem); + vma = find_vma_intersection(current->mm, useraddr, + useraddr + PAGE_SIZE); + if (!IS_ERR_OR_NULL(vma)) + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); + up_read(¤t->mm->mmap_sem); +} + /** * splice_to_pipe - fill passed data into a pipe * @pipe: pipe to fill @@ -212,8 +224,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf->len = spd->partial[page_nr].len; buf->private = spd->partial[page_nr].private; buf->ops = spd->ops; - if (spd->flags & SPLICE_F_GIFT) + if (spd->flags & SPLICE_F_GIFT) { + unsigned long useraddr = + spd->partial[page_nr].useraddr; + + if ((spd->flags & SPLICE_F_MOVE) && + !buf->offset && (buf->len == PAGE_SIZE)) + /* Can move page aligned buf */ + zap_buf_page(useraddr); buf->flags |= PIPE_BUF_FLAG_GIFT; + } pipe->nrbufs++; page_nr++; @@ -485,6 +505,7 @@ fill_it: spd.partial[page_nr].offset = loff; spd.partial[page_nr].len = this_len; + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; len -= this_len; loff = 0; spd.nr_pages++; @@ -656,6 +677,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, this_len = min_t(size_t, vec[i].iov_len, res); spd.partial[i].offset = 0; spd.partial[i].len = this_len; + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; if (!this_len) { __free_page(spd.pages[i]); spd.pages[i] = NULL; @@ -1475,6 +1497,7 @@ static int get_iovec_page_array(const struct iovec __user *iov, partial[buffers].offset = off; partial[buffers].len = plen; + partial[buffers].useraddr = (unsigned long)base; off = 0; len -= plen; diff --git a/include/linux/splice.h b/include/linux/splice.h index 74575cb..56661e3 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -44,6 +44,7 @@ struct partial_page { unsigned int offset; unsigned int len; unsigned long private; + unsigned long useraddr; }; /* -- 1.8.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient 2013-07-25 17:21 ` [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient Robert Jennings @ 2013-07-25 17:30 ` Dave Hansen 2013-07-26 15:16 ` Robert Jennings 0 siblings, 1 reply; 6+ messages in thread From: Dave Hansen @ 2013-07-25 17:30 UTC (permalink / raw) To: Robert Jennings Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia On 07/25/2013 10:21 AM, Robert Jennings wrote: > +static void zap_buf_page(unsigned long useraddr) > +{ > + struct vm_area_struct *vma; > + > + down_read(¤t->mm->mmap_sem); > + vma = find_vma_intersection(current->mm, useraddr, > + useraddr + PAGE_SIZE); > + if (!IS_ERR_OR_NULL(vma)) > + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); > + up_read(¤t->mm->mmap_sem); > +} > + > /** > * splice_to_pipe - fill passed data into a pipe > * @pipe: pipe to fill > @@ -212,8 +224,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > buf->len = spd->partial[page_nr].len; > buf->private = spd->partial[page_nr].private; > buf->ops = spd->ops; > - if (spd->flags & SPLICE_F_GIFT) > + if (spd->flags & SPLICE_F_GIFT) { > + unsigned long useraddr = > + spd->partial[page_nr].useraddr; > + > + if ((spd->flags & SPLICE_F_MOVE) && > + !buf->offset && (buf->len == PAGE_SIZE)) > + /* Can move page aligned buf */ > + zap_buf_page(useraddr); > buf->flags |= PIPE_BUF_FLAG_GIFT; > + } There isn't quite enough context here, but is it going to do this zap_buf_page() very often? Seems a bit wasteful to do the up/down and find_vma() every trip through the loop. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient 2013-07-25 17:30 ` Dave Hansen @ 2013-07-26 15:16 ` Robert Jennings 2013-07-26 16:47 ` Dave Hansen 0 siblings, 1 reply; 6+ messages in thread From: Robert Jennings @ 2013-07-26 15:16 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia * Dave Hansen (dave@sr71.net) wrote: > On 07/25/2013 10:21 AM, Robert Jennings wrote: > > +static void zap_buf_page(unsigned long useraddr) > > +{ > > + struct vm_area_struct *vma; > > + > > + down_read(¤t->mm->mmap_sem); > > + vma = find_vma_intersection(current->mm, useraddr, > > + useraddr + PAGE_SIZE); > > + if (!IS_ERR_OR_NULL(vma)) > > + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); > > + up_read(¤t->mm->mmap_sem); > > +} > > + > > /** > > * splice_to_pipe - fill passed data into a pipe > > * @pipe: pipe to fill > > @@ -212,8 +224,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > > buf->len = spd->partial[page_nr].len; > > buf->private = spd->partial[page_nr].private; > > buf->ops = spd->ops; > > - if (spd->flags & SPLICE_F_GIFT) > > + if (spd->flags & SPLICE_F_GIFT) { > > + unsigned long useraddr = > > + spd->partial[page_nr].useraddr; > > + > > + if ((spd->flags & SPLICE_F_MOVE) && > > + !buf->offset && (buf->len == PAGE_SIZE)) > > + /* Can move page aligned buf */ > > + zap_buf_page(useraddr); > > buf->flags |= PIPE_BUF_FLAG_GIFT; > > + } > > There isn't quite enough context here, but is it going to do this > zap_buf_page() very often? Seems a bit wasteful to do the up/down and > find_vma() every trip through the loop. The call to zap_buf_page() is in a loop where each pipe buffer is being processed, but in that loop we have a pipe_wait() where we schedule(). So as things are structured I don't have the ability to hold mmap_sem for multiple find_vma() calls. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient 2013-07-26 15:16 ` Robert Jennings @ 2013-07-26 16:47 ` Dave Hansen 0 siblings, 0 replies; 6+ messages in thread From: Dave Hansen @ 2013-07-26 16:47 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia On 07/26/2013 08:16 AM, Robert Jennings wrote: >>> > > + if ((spd->flags & SPLICE_F_MOVE) && >>> > > + !buf->offset && (buf->len == PAGE_SIZE)) >>> > > + /* Can move page aligned buf */ >>> > > + zap_buf_page(useraddr); >>> > > buf->flags |= PIPE_BUF_FLAG_GIFT; >>> > > + } >> > >> > There isn't quite enough context here, but is it going to do this >> > zap_buf_page() very often? Seems a bit wasteful to do the up/down and >> > find_vma() every trip through the loop. > The call to zap_buf_page() is in a loop where each pipe buffer is being > processed, but in that loop we have a pipe_wait() where we schedule(). > So as things are structured I don't have the ability to hold mmap_sem > for multiple find_vma() calls. You can hold a semaphore over a schedule(). :) You could also theoretically hold mmap_sem and only drop it on actual cases when you reschedule if you were afraid of holding mmap_sem for long periods of time (even though it's a read). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 2/2] Add limited zero copy to vmsplice 2013-07-25 17:21 [RFC PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings 2013-07-25 17:21 ` [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient Robert Jennings @ 2013-07-25 17:21 ` Robert Jennings 1 sibling, 0 replies; 6+ messages in thread From: Robert Jennings @ 2013-07-25 17:21 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Robert Jennings, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia From: Matt Helsley <matthltc@us.ibm.com> It is sometimes useful to move anonymous pages over a pipe rather than save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see if userspace would like to move such pages. This differs from plain SPLICE_F_GIFT in that the memory written to the pipe will no longer have the same contents as the original -- it effectively faults in new, empty anonymous pages. On the read side the page written to the pipe will be copied unless SPLICE_F_MOVE is used. Otherwise copying will be performed and the page will be reclaimed. Note that so long as there is a mapping to the page copies will be done instead because rmap will have upped the map count for each anonymous mapping; this can happen do to fork(), for example. This is necessary because moving the page will usually change the anonymous page's nonlinear index and that can only be done if it's unmapped. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Matt Helsley <matt.helsley@gmail.com> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com> --- fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 6aa964f..0a715c3 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -32,6 +32,10 @@ #include <linux/gfp.h> #include <linux/socket.h> #include <linux/compat.h> +#include <linux/page-flags.h> +#include <linux/hugetlb.h> +#include <linux/ksm.h> +#include <linux/swapops.h> #include "internal.h" /* @@ -1536,6 +1540,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf, char *src; int ret; + if (!buf->offset && (buf->len == PAGE_SIZE) && + (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) { + struct page *page = buf->page; + struct mm_struct *mm; + struct vm_area_struct *vma; + spinlock_t *ptl; + pte_t *ptep, pte; + unsigned long useraddr; + + if (!PageAnon(page)) + goto copy; + if (PageCompound(page)) + goto copy; + if (PageHuge(page) || PageTransHuge(page)) + goto copy; + if (page_mapped(page)) + goto copy; + useraddr = (unsigned long)sd->u.userptr; + mm = current->mm; + + ret = -EAGAIN; + down_read(&mm->mmap_sem); + vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE); + if (IS_ERR_OR_NULL(vma)) + goto up_copy; + if (!vma->anon_vma) { + ret = anon_vma_prepare(vma); + if (ret) + goto up_copy; + } + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); + ret = lock_page_killable(page); + if (ret) + goto up_copy; + ptep = get_locked_pte(mm, useraddr, &ptl); + if (!ptep) + goto unlock_up_copy; + pte = *ptep; + if (pte_present(pte)) + goto unlock_up_copy; + get_page(page); + page_add_anon_rmap(page, vma, useraddr); + pte = mk_pte(page, vma->vm_page_prot); + set_pte_at(mm, useraddr, ptep, pte); + update_mmu_cache(vma, useraddr, ptep); + pte_unmap_unlock(ptep, ptl); + ret = 0; +unlock_up_copy: + unlock_page(page); +up_copy: + up_read(&mm->mmap_sem); + if (!ret) { + ret = sd->len; + goto out; + } + /* else ret < 0 and we should fallback to copying */ + VM_BUG_ON(ret > 0); + } +copy: /* * See if we can use the atomic maps, by prefaulting in the * pages and doing an atomic copy -- 1.8.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-07-26 16:47 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-25 17:21 [RFC PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings 2013-07-25 17:21 ` [RFC PATCH 1/2] vmsplice unmap gifted pages for recipient Robert Jennings 2013-07-25 17:30 ` Dave Hansen 2013-07-26 15:16 ` Robert Jennings 2013-07-26 16:47 ` Dave Hansen 2013-07-25 17:21 ` [RFC PATCH 2/2] Add limited zero copy to vmsplice Robert Jennings
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).