* [PATCH 0/2] vmpslice support for zero-copy gifting of pages @ 2013-10-07 20:21 Robert C Jennings 2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings 2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings 0 siblings, 2 replies; 15+ messages in thread From: Robert C Jennings @ 2013-10-07 20:21 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Robert C Jennings, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka This patch set would add the ability to move anonymous user pages from one process to another through vmsplice without copying data. Moving pages rather than copying is implemented for a narrow case in this RFC to meet the needs of QEMU's usage (below). Among the restrictions the source address and destination addresses must be page aligned, the size argument must be a multiple of page size, and by the time the reader calls vmsplice, the page must no longer be mapped in the source. If a move is not possible the code transparently falls back to copying data. This comes from work in QEMU[1] to migrate a VM from one QEMU instance to another with minimal down-time for the VM. This would allow for an update of the QEMU executable under the VM. New flag usage This introduces use of the SPLICE_F_MOVE flag for vmsplice, previously unused. Proposed usage is as follows: Writer gifts pages to pipe, can not access original contents after gift: vmsplice(fd, iov, nr_segs, (SPLICE_F_GIFT | SPLICE_F_MOVE); Reader asks kernel to move pages from pipe to memory described by iovec: vmsplice(fd, iov, nr_segs, SPLICE_F_MOVE); Moving pages rather than copying is implemented for a narrow case in this RFC to meet the needs of QEMU's usage. If a move is not possible the code transparently falls back to copying data. For older kernels the SPLICE_F_MOVE would be ignored and a copy would occur. [1] QEMU localhost live migration: http://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02540.html http://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02577.html _______________________________________________________ vmsplice: Add limited zero copy to vmsplice vmsplice: unmap gifted pages for recipient fs/splice.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/splice.h | 1 + 2 files changed, 114 insertions(+), 1 deletion(-) -- 1.8.1.2 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-07 20:21 [PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert C Jennings @ 2013-10-07 20:21 ` Robert C Jennings 2013-10-08 16:14 ` Dave Hansen ` (2 more replies) 2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings 1 sibling, 3 replies; 15+ messages in thread From: Robert C Jennings @ 2013-10-07 20:21 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Robert C Jennings, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap pages. When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the writer's gift'ed pages would be zapped. This patch supports further work to move vmsplice'd pages rather than copying them. That patch has the restriction that the page must not be mapped by the source for the move, otherwise it will fall back to copying the page. Signed-off-by: Matt Helsley <matt.helsley@gmail.com> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> --- Since the RFC went out I have coalesced the zap_page_range() call to operate on VMAs rather than calling this for each page. For a 256MB vmsplice this reduced the write side 50% from the RFC. --- fs/splice.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/splice.h | 1 + 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/fs/splice.c b/fs/splice.c index 3b7ee65..a62d61e 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, { unsigned int spd_pages = spd->nr_pages; int ret, do_wakeup, page_nr; + struct vm_area_struct *vma; + unsigned long user_start, user_end; ret = 0; do_wakeup = 0; page_nr = 0; + vma = NULL; + user_start = user_end = 0; pipe_lock(pipe); + down_read(¤t->mm->mmap_sem); for (;;) { if (!pipe->readers) { @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf->len = spd->partial[page_nr].len; buf->private = spd->partial[page_nr].private; buf->ops = spd->ops; - if (spd->flags & SPLICE_F_GIFT) + if (spd->flags & SPLICE_F_GIFT) { + unsigned long useraddr = + spd->partial[page_nr].useraddr; + + if ((spd->flags & SPLICE_F_MOVE) && + !buf->offset && + (buf->len == PAGE_SIZE)) { + /* Can move page aligned buf, gather + * requests to make a single + * zap_page_range() call per VMA + */ + if (vma && (useraddr == user_end) && + ((useraddr + PAGE_SIZE) <= + vma->vm_end)) { + /* same vma, no holes */ + user_end += PAGE_SIZE; + } else { + if (vma) + zap_page_range(vma, + user_start, + (user_end - + user_start), + NULL); + vma = find_vma_intersection( + current->mm, + useraddr, + (useraddr + + PAGE_SIZE)); + if (!IS_ERR_OR_NULL(vma)) { + user_start = useraddr; + user_end = (useraddr + + PAGE_SIZE); + } else + vma = NULL; + } + } buf->flags |= PIPE_BUF_FLAG_GIFT; + } pipe->nrbufs++; page_nr++; @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, pipe->waiting_writers--; } + if (vma) + zap_page_range(vma, user_start, (user_end - user_start), NULL); + + up_read(¤t->mm->mmap_sem); pipe_unlock(pipe); if (do_wakeup) @@ -485,6 +530,7 @@ fill_it: spd.partial[page_nr].offset = loff; spd.partial[page_nr].len = this_len; + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; len -= this_len; loff = 0; spd.nr_pages++; @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, this_len = min_t(size_t, vec[i].iov_len, res); spd.partial[i].offset = 0; spd.partial[i].len = this_len; + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; if (!this_len) { __free_page(spd.pages[i]); spd.pages[i] = NULL; @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov, partial[buffers].offset = off; partial[buffers].len = plen; + partial[buffers].useraddr = (unsigned long)base; + base = (void*)((unsigned long)base + PAGE_SIZE); off = 0; len -= plen; diff --git a/include/linux/splice.h b/include/linux/splice.h index 74575cb..56661e3 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -44,6 +44,7 @@ struct partial_page { unsigned int offset; unsigned int len; unsigned long private; + unsigned long useraddr; }; /* -- 1.8.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings @ 2013-10-08 16:14 ` Dave Hansen 2013-10-08 19:48 ` Robert Jennings 2013-10-08 16:23 ` Dave Hansen 2013-10-17 10:20 ` Vlastimil Babka 2 siblings, 1 reply; 15+ messages in thread From: Dave Hansen @ 2013-10-08 16:14 UTC (permalink / raw) To: Robert C Jennings, linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka On 10/07/2013 01:21 PM, Robert C Jennings wrote: > + } else { > + if (vma) > + zap_page_range(vma, > + user_start, > + (user_end - > + user_start), > + NULL); > + vma = find_vma_intersection( > + current->mm, > + useraddr, > + (useraddr + > + PAGE_SIZE)); > + if (!IS_ERR_OR_NULL(vma)) { > + user_start = useraddr; > + user_end = (useraddr + > + PAGE_SIZE); > + } else > + vma = NULL; > + } This is pretty unspeakably hideous. Was there truly no better way to do this? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-08 16:14 ` Dave Hansen @ 2013-10-08 19:48 ` Robert Jennings 2013-10-08 21:22 ` Dave Hansen 0 siblings, 1 reply; 15+ messages in thread From: Robert Jennings @ 2013-10-08 19:48 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka * Dave Hansen (dave@sr71.net) wrote: > On 10/07/2013 01:21 PM, Robert C Jennings wrote: > > + } else { > > + if (vma) > > + zap_page_range(vma, > > + user_start, > > + (user_end - > > + user_start), > > + NULL); > > + vma = find_vma_intersection( > > + current->mm, > > + useraddr, > > + (useraddr + > > + PAGE_SIZE)); > > + if (!IS_ERR_OR_NULL(vma)) { > > + user_start = useraddr; > > + user_end = (useraddr + > > + PAGE_SIZE); > > + } else > > + vma = NULL; > > + } > > This is pretty unspeakably hideous. Was there truly no better way to do > this? I was hoping to find a better way to coalesce pipe buffers and zap entire VMAs (and it needs better documentation but your argument is with structure and I agree). I would love suggestions for improving this but that is not to say that I've abandoned it; I'm still looking for ways to make this cleaner. Doing find_vma() on a single page in the VMA rather than on each and then zapping once provides a 50% runtime reduction for the writer when tested with a 256MB vmsplice operation. Based on the result I felt that coalescing was justfied but the implementation is ugly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-08 19:48 ` Robert Jennings @ 2013-10-08 21:22 ` Dave Hansen 0 siblings, 0 replies; 15+ messages in thread From: Dave Hansen @ 2013-10-08 21:22 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka On 10/08/2013 12:48 PM, Robert Jennings wrote: > * Dave Hansen (dave@sr71.net) wrote: >> On 10/07/2013 01:21 PM, Robert C Jennings wrote: >>> + } else { >>> + if (vma) >>> + zap_page_range(vma, >>> + user_start, >>> + (user_end - >>> + user_start), >>> + NULL); >>> + vma = find_vma_intersection( >>> + current->mm, >>> + useraddr, >>> + (useraddr + >>> + PAGE_SIZE)); >>> + if (!IS_ERR_OR_NULL(vma)) { >>> + user_start = useraddr; >>> + user_end = (useraddr + >>> + PAGE_SIZE); >>> + } else >>> + vma = NULL; >>> + } >> >> This is pretty unspeakably hideous. Was there truly no better way to do >> this? > > I was hoping to find a better way to coalesce pipe buffers and zap > entire VMAs (and it needs better documentation but your argument is with > structure and I agree). I would love suggestions for improving this but > that is not to say that I've abandoned it; I'm still looking for ways > to make this cleaner. Doing the VMA search each and every time seems a bit silly. Do one find_vma(), the look at the _end_ virtual address of the VMA. You can continue to collect your set of zap_page_range() addresses as long as you do not hit the end of that address range. If and only if you hit the end of the vma, do the zap_page_range(), and then look up the VMA again. Storing the .useraddr still seems odd to me, and you haven't fully explained why you're doing it or how it is safe, or why you store both virtual addresses and file locations in it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings 2013-10-08 16:14 ` Dave Hansen @ 2013-10-08 16:23 ` Dave Hansen 2013-10-17 13:54 ` Robert Jennings 2013-10-17 10:20 ` Vlastimil Babka 2 siblings, 1 reply; 15+ messages in thread From: Dave Hansen @ 2013-10-08 16:23 UTC (permalink / raw) To: Robert C Jennings, linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka On 10/07/2013 01:21 PM, Robert C Jennings wrote: > spd.partial[page_nr].offset = loff; > spd.partial[page_nr].len = this_len; > + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; > len -= this_len; > loff = 0; > spd.nr_pages++; > @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > this_len = min_t(size_t, vec[i].iov_len, res); > spd.partial[i].offset = 0; > spd.partial[i].len = this_len; > + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; > if (!this_len) { > __free_page(spd.pages[i]); > spd.pages[i] = NULL; > @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov, > > partial[buffers].offset = off; > partial[buffers].len = plen; > + partial[buffers].useraddr = (unsigned long)base; > + base = (void*)((unsigned long)base + PAGE_SIZE); > > off = 0; > len -= plen; > diff --git a/include/linux/splice.h b/include/linux/splice.h > index 74575cb..56661e3 100644 > --- a/include/linux/splice.h > +++ b/include/linux/splice.h > @@ -44,6 +44,7 @@ struct partial_page { > unsigned int offset; > unsigned int len; > unsigned long private; > + unsigned long useraddr; > }; "useraddr" confuses me. You make it an 'unsigned long', yet two of the three assignments are from "void __user *". The other assignment: spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; 'index' looks to be the offset inside the file, not a user address, so I'm confused what that is doing. Could you elaborate a little more on why 'useraddr' is suddenly needed in these patches? How is "index << PAGE_CACHE_SHIFT" a virtual address? Also, are we losing any of the advantages of sparse checking since 'useraddr' is without the __user annotation? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-08 16:23 ` Dave Hansen @ 2013-10-17 13:54 ` Robert Jennings 0 siblings, 0 replies; 15+ messages in thread From: Robert Jennings @ 2013-10-17 13:54 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka * Dave Hansen (dave@sr71.net) wrote: > On 10/07/2013 01:21 PM, Robert C Jennings wrote: > > spd.partial[page_nr].offset = loff; > > spd.partial[page_nr].len = this_len; > > + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; > > len -= this_len; > > loff = 0; > > spd.nr_pages++; > > @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > > this_len = min_t(size_t, vec[i].iov_len, res); > > spd.partial[i].offset = 0; > > spd.partial[i].len = this_len; > > + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; > > if (!this_len) { > > __free_page(spd.pages[i]); > > spd.pages[i] = NULL; > > @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov, > > > > partial[buffers].offset = off; > > partial[buffers].len = plen; > > + partial[buffers].useraddr = (unsigned long)base; > > + base = (void*)((unsigned long)base + PAGE_SIZE); > > > > off = 0; > > len -= plen; > > diff --git a/include/linux/splice.h b/include/linux/splice.h > > index 74575cb..56661e3 100644 > > --- a/include/linux/splice.h > > +++ b/include/linux/splice.h > > @@ -44,6 +44,7 @@ struct partial_page { > > unsigned int offset; > > unsigned int len; > > unsigned long private; > > + unsigned long useraddr; > > }; > > "useraddr" confuses me. You make it an 'unsigned long', yet two of the > three assignments are from "void __user *". The other assignment: > > spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; > > 'index' looks to be the offset inside the file, not a user address, so > I'm confused what that is doing. > > Could you elaborate a little more on why 'useraddr' is suddenly needed > in these patches? How is "index << PAGE_CACHE_SHIFT" a virtual address? > Also, are we losing any of the advantages of sparse checking since > 'useraddr' is without the __user annotation? > I'm working on cleaning this up. Trying to remove useraddr altogher through the use of the existing 'private' field just for the splice_to_user/pipe_to_user flow without upsetting other uses of the private field. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings 2013-10-08 16:14 ` Dave Hansen 2013-10-08 16:23 ` Dave Hansen @ 2013-10-17 10:20 ` Vlastimil Babka 2013-10-17 13:48 ` Robert Jennings 2 siblings, 1 reply; 15+ messages in thread From: Vlastimil Babka @ 2013-10-17 10:20 UTC (permalink / raw) To: Robert C Jennings, linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia On 10/07/2013 10:21 PM, Robert C Jennings wrote: > Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap > pages. > > When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the > writer's gift'ed pages would be zapped. This patch supports further work > to move vmsplice'd pages rather than copying them. That patch has the > restriction that the page must not be mapped by the source for the move, > otherwise it will fall back to copying the page. > > Signed-off-by: Matt Helsley <matt.helsley@gmail.com> > Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> > --- > Since the RFC went out I have coalesced the zap_page_range() call to > operate on VMAs rather than calling this for each page. For a 256MB > vmsplice this reduced the write side 50% from the RFC. > --- > fs/splice.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- > include/linux/splice.h | 1 + > 2 files changed, 51 insertions(+), 1 deletion(-) > > diff --git a/fs/splice.c b/fs/splice.c > index 3b7ee65..a62d61e 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > { > unsigned int spd_pages = spd->nr_pages; > int ret, do_wakeup, page_nr; > + struct vm_area_struct *vma; > + unsigned long user_start, user_end; > > ret = 0; > do_wakeup = 0; > page_nr = 0; > + vma = NULL; > + user_start = user_end = 0; > > pipe_lock(pipe); > + down_read(¤t->mm->mmap_sem); Seems like you could take the mmap_sem only when GIFT and MOVE is set. Maybe it won't help that much for performance but at least serve as documenting the reason it's needed? Vlastimil > for (;;) { > if (!pipe->readers) { > @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > buf->len = spd->partial[page_nr].len; > buf->private = spd->partial[page_nr].private; > buf->ops = spd->ops; > - if (spd->flags & SPLICE_F_GIFT) > + if (spd->flags & SPLICE_F_GIFT) { > + unsigned long useraddr = > + spd->partial[page_nr].useraddr; > + > + if ((spd->flags & SPLICE_F_MOVE) && > + !buf->offset && > + (buf->len == PAGE_SIZE)) { > + /* Can move page aligned buf, gather > + * requests to make a single > + * zap_page_range() call per VMA > + */ > + if (vma && (useraddr == user_end) && > + ((useraddr + PAGE_SIZE) <= > + vma->vm_end)) { > + /* same vma, no holes */ > + user_end += PAGE_SIZE; > + } else { > + if (vma) > + zap_page_range(vma, > + user_start, > + (user_end - > + user_start), > + NULL); > + vma = find_vma_intersection( > + current->mm, > + useraddr, > + (useraddr + > + PAGE_SIZE)); > + if (!IS_ERR_OR_NULL(vma)) { > + user_start = useraddr; > + user_end = (useraddr + > + PAGE_SIZE); > + } else > + vma = NULL; > + } > + } > buf->flags |= PIPE_BUF_FLAG_GIFT; > + } > > pipe->nrbufs++; > page_nr++; > @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > pipe->waiting_writers--; > } > > + if (vma) > + zap_page_range(vma, user_start, (user_end - user_start), NULL); > + > + up_read(¤t->mm->mmap_sem); > pipe_unlock(pipe); > > if (do_wakeup) > @@ -485,6 +530,7 @@ fill_it: > > spd.partial[page_nr].offset = loff; > spd.partial[page_nr].len = this_len; > + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; > len -= this_len; > loff = 0; > spd.nr_pages++; > @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > this_len = min_t(size_t, vec[i].iov_len, res); > spd.partial[i].offset = 0; > spd.partial[i].len = this_len; > + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; > if (!this_len) { > __free_page(spd.pages[i]); > spd.pages[i] = NULL; > @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov, > > partial[buffers].offset = off; > partial[buffers].len = plen; > + partial[buffers].useraddr = (unsigned long)base; > + base = (void*)((unsigned long)base + PAGE_SIZE); > > off = 0; > len -= plen; > diff --git a/include/linux/splice.h b/include/linux/splice.h > index 74575cb..56661e3 100644 > --- a/include/linux/splice.h > +++ b/include/linux/splice.h > @@ -44,6 +44,7 @@ struct partial_page { > unsigned int offset; > unsigned int len; > unsigned long private; > + unsigned long useraddr; > }; > > /* > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-17 10:20 ` Vlastimil Babka @ 2013-10-17 13:48 ` Robert Jennings 2013-10-18 8:10 ` Vlastimil Babka 0 siblings, 1 reply; 15+ messages in thread From: Robert Jennings @ 2013-10-17 13:48 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia * Vlastimil Babka (vbabka@suse.cz) wrote: > On 10/07/2013 10:21 PM, Robert C Jennings wrote: > > Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap > > pages. > > > > When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the > > writer's gift'ed pages would be zapped. This patch supports further work > > to move vmsplice'd pages rather than copying them. That patch has the > > restriction that the page must not be mapped by the source for the move, > > otherwise it will fall back to copying the page. > > > > Signed-off-by: Matt Helsley <matt.helsley@gmail.com> > > Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> > > --- > > Since the RFC went out I have coalesced the zap_page_range() call to > > operate on VMAs rather than calling this for each page. For a 256MB > > vmsplice this reduced the write side 50% from the RFC. > > --- > > fs/splice.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- > > include/linux/splice.h | 1 + > > 2 files changed, 51 insertions(+), 1 deletion(-) > > > > diff --git a/fs/splice.c b/fs/splice.c > > index 3b7ee65..a62d61e 100644 > > --- a/fs/splice.c > > +++ b/fs/splice.c > > @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > > { > > unsigned int spd_pages = spd->nr_pages; > > int ret, do_wakeup, page_nr; > > + struct vm_area_struct *vma; > > + unsigned long user_start, user_end; > > > > ret = 0; > > do_wakeup = 0; > > page_nr = 0; > > + vma = NULL; > > + user_start = user_end = 0; > > > > pipe_lock(pipe); > > + down_read(¤t->mm->mmap_sem); > > Seems like you could take the mmap_sem only when GIFT and MOVE is set. > Maybe it won't help that much for performance but at least serve as > documenting the reason it's needed? > > Vlastimil > I had been doing that previously but moving this outside the loop and acquiring it once did improve performance. I'll add a comment on down_read() as to the reason for taking this though. -Rob > > for (;;) { > > if (!pipe->readers) { > > @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > > buf->len = spd->partial[page_nr].len; > > buf->private = spd->partial[page_nr].private; > > buf->ops = spd->ops; > > - if (spd->flags & SPLICE_F_GIFT) > > + if (spd->flags & SPLICE_F_GIFT) { > > + unsigned long useraddr = > > + spd->partial[page_nr].useraddr; > > + > > + if ((spd->flags & SPLICE_F_MOVE) && > > + !buf->offset && > > + (buf->len == PAGE_SIZE)) { > > + /* Can move page aligned buf, gather > > + * requests to make a single > > + * zap_page_range() call per VMA > > + */ > > + if (vma && (useraddr == user_end) && > > + ((useraddr + PAGE_SIZE) <= > > + vma->vm_end)) { > > + /* same vma, no holes */ > > + user_end += PAGE_SIZE; > > + } else { > > + if (vma) > > + zap_page_range(vma, > > + user_start, > > + (user_end - > > + user_start), > > + NULL); > > + vma = find_vma_intersection( > > + current->mm, > > + useraddr, > > + (useraddr + > > + PAGE_SIZE)); > > + if (!IS_ERR_OR_NULL(vma)) { > > + user_start = useraddr; > > + user_end = (useraddr + > > + PAGE_SIZE); > > + } else > > + vma = NULL; > > + } > > + } > > buf->flags |= PIPE_BUF_FLAG_GIFT; > > + } > > > > pipe->nrbufs++; > > page_nr++; > > @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, > > pipe->waiting_writers--; > > } > > > > + if (vma) > > + zap_page_range(vma, user_start, (user_end - user_start), NULL); > > + > > + up_read(¤t->mm->mmap_sem); > > pipe_unlock(pipe); > > > > if (do_wakeup) > > @@ -485,6 +530,7 @@ fill_it: > > > > spd.partial[page_nr].offset = loff; > > spd.partial[page_nr].len = this_len; > > + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; > > len -= this_len; > > loff = 0; > > spd.nr_pages++; > > @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > > this_len = min_t(size_t, vec[i].iov_len, res); > > spd.partial[i].offset = 0; > > spd.partial[i].len = this_len; > > + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; > > if (!this_len) { > > __free_page(spd.pages[i]); > > spd.pages[i] = NULL; > > @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov, > > > > partial[buffers].offset = off; > > partial[buffers].len = plen; > > + partial[buffers].useraddr = (unsigned long)base; > > + base = (void*)((unsigned long)base + PAGE_SIZE); > > > > off = 0; > > len -= plen; > > diff --git a/include/linux/splice.h b/include/linux/splice.h > > index 74575cb..56661e3 100644 > > --- a/include/linux/splice.h > > +++ b/include/linux/splice.h > > @@ -44,6 +44,7 @@ struct partial_page { > > unsigned int offset; > > unsigned int len; > > unsigned long private; > > + unsigned long useraddr; > > }; > > > > /* > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient 2013-10-17 13:48 ` Robert Jennings @ 2013-10-18 8:10 ` Vlastimil Babka 0 siblings, 0 replies; 15+ messages in thread From: Vlastimil Babka @ 2013-10-18 8:10 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia On 10/17/2013 03:48 PM, Robert Jennings wrote: > * Vlastimil Babka (vbabka@suse.cz) wrote: >> On 10/07/2013 10:21 PM, Robert C Jennings wrote: >>> Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap >>> pages. >>> >>> When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the >>> writer's gift'ed pages would be zapped. This patch supports further work >>> to move vmsplice'd pages rather than copying them. That patch has the >>> restriction that the page must not be mapped by the source for the move, >>> otherwise it will fall back to copying the page. >>> >>> Signed-off-by: Matt Helsley <matt.helsley@gmail.com> >>> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> >>> --- >>> Since the RFC went out I have coalesced the zap_page_range() call to >>> operate on VMAs rather than calling this for each page. For a 256MB >>> vmsplice this reduced the write side 50% from the RFC. >>> --- >>> fs/splice.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- >>> include/linux/splice.h | 1 + >>> 2 files changed, 51 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/splice.c b/fs/splice.c >>> index 3b7ee65..a62d61e 100644 >>> --- a/fs/splice.c >>> +++ b/fs/splice.c >>> @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, >>> { >>> unsigned int spd_pages = spd->nr_pages; >>> int ret, do_wakeup, page_nr; >>> + struct vm_area_struct *vma; >>> + unsigned long user_start, user_end; >>> >>> ret = 0; >>> do_wakeup = 0; >>> page_nr = 0; >>> + vma = NULL; >>> + user_start = user_end = 0; >>> >>> pipe_lock(pipe); >>> + down_read(¤t->mm->mmap_sem); >> >> Seems like you could take the mmap_sem only when GIFT and MOVE is set. >> Maybe it won't help that much for performance but at least serve as >> documenting the reason it's needed? >> >> Vlastimil >> > > I had been doing that previously but moving this outside the loop and > acquiring it once did improve performance. I'll add a comment on > down_read() as to the reason for taking this though. > > -Rob Hm perhaps in light of recent patches to reduce mmap_sem usage only to really critical regions, maybe it really shouldn't be taken at all if not needed. Vlastimil >>> for (;;) { >>> if (!pipe->readers) { >>> @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, >>> buf->len = spd->partial[page_nr].len; >>> buf->private = spd->partial[page_nr].private; >>> buf->ops = spd->ops; >>> - if (spd->flags & SPLICE_F_GIFT) >>> + if (spd->flags & SPLICE_F_GIFT) { >>> + unsigned long useraddr = >>> + spd->partial[page_nr].useraddr; >>> + >>> + if ((spd->flags & SPLICE_F_MOVE) && >>> + !buf->offset && >>> + (buf->len == PAGE_SIZE)) { >>> + /* Can move page aligned buf, gather >>> + * requests to make a single >>> + * zap_page_range() call per VMA >>> + */ >>> + if (vma && (useraddr == user_end) && >>> + ((useraddr + PAGE_SIZE) <= >>> + vma->vm_end)) { >>> + /* same vma, no holes */ >>> + user_end += PAGE_SIZE; >>> + } else { >>> + if (vma) >>> + zap_page_range(vma, >>> + user_start, >>> + (user_end - >>> + user_start), >>> + NULL); >>> + vma = find_vma_intersection( >>> + current->mm, >>> + useraddr, >>> + (useraddr + >>> + PAGE_SIZE)); >>> + if (!IS_ERR_OR_NULL(vma)) { >>> + user_start = useraddr; >>> + user_end = (useraddr + >>> + PAGE_SIZE); >>> + } else >>> + vma = NULL; >>> + } >>> + } >>> buf->flags |= PIPE_BUF_FLAG_GIFT; >>> + } >>> >>> pipe->nrbufs++; >>> page_nr++; >>> @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, >>> pipe->waiting_writers--; >>> } >>> >>> + if (vma) >>> + zap_page_range(vma, user_start, (user_end - user_start), NULL); >>> + >>> + up_read(¤t->mm->mmap_sem); >>> pipe_unlock(pipe); >>> >>> if (do_wakeup) >>> @@ -485,6 +530,7 @@ fill_it: >>> >>> spd.partial[page_nr].offset = loff; >>> spd.partial[page_nr].len = this_len; >>> + spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT; >>> len -= this_len; >>> loff = 0; >>> spd.nr_pages++; >>> @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos, >>> this_len = min_t(size_t, vec[i].iov_len, res); >>> spd.partial[i].offset = 0; >>> spd.partial[i].len = this_len; >>> + spd.partial[i].useraddr = (unsigned long)vec[i].iov_base; >>> if (!this_len) { >>> __free_page(spd.pages[i]); >>> spd.pages[i] = NULL; >>> @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov, >>> >>> partial[buffers].offset = off; >>> partial[buffers].len = plen; >>> + partial[buffers].useraddr = (unsigned long)base; >>> + base = (void*)((unsigned long)base + PAGE_SIZE); >>> >>> off = 0; >>> len -= plen; >>> diff --git a/include/linux/splice.h b/include/linux/splice.h >>> index 74575cb..56661e3 100644 >>> --- a/include/linux/splice.h >>> +++ b/include/linux/splice.h >>> @@ -44,6 +44,7 @@ struct partial_page { >>> unsigned int offset; >>> unsigned int len; >>> unsigned long private; >>> + unsigned long useraddr; >>> }; >>> >>> /* >>> >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice 2013-10-07 20:21 [PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert C Jennings 2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings @ 2013-10-07 20:21 ` Robert C Jennings 2013-10-08 16:45 ` Dave Hansen 2013-10-17 11:23 ` Vlastimil Babka 1 sibling, 2 replies; 15+ messages in thread From: Robert C Jennings @ 2013-10-07 20:21 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Robert C Jennings, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka From: Matt Helsley <matt.helsley@gmail.com> It is sometimes useful to move anonymous pages over a pipe rather than save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see if userspace would like to move such pages. This differs from plain SPLICE_F_GIFT in that the memory written to the pipe will no longer have the same contents as the original -- it effectively faults in new, empty anonymous pages. On the read side the page written to the pipe will be copied unless SPLICE_F_MOVE is used. Otherwise copying will be performed and the page will be reclaimed. Note that so long as there is a mapping to the page copies will be done instead because rmap will have upped the map count for each anonymous mapping; this can happen do to fork(), for example. This is necessary because moving the page will usually change the anonymous page's nonlinear index and that can only be done if it's unmapped. Signed-off-by: Matt Helsley <matt.helsley@gmail.com> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> --- fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index a62d61e..9d2ed128 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -32,6 +32,10 @@ #include <linux/gfp.h> #include <linux/socket.h> #include <linux/compat.h> +#include <linux/page-flags.h> +#include <linux/hugetlb.h> +#include <linux/ksm.h> +#include <linux/swapops.h> #include "internal.h" /* @@ -1562,6 +1566,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf, char *src; int ret; + if (!buf->offset && (buf->len == PAGE_SIZE) && + (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) { + struct page *page = buf->page; + struct mm_struct *mm; + struct vm_area_struct *vma; + spinlock_t *ptl; + pte_t *ptep, pte; + unsigned long useraddr; + + if (!PageAnon(page)) + goto copy; + if (PageCompound(page)) + goto copy; + if (PageHuge(page) || PageTransHuge(page)) + goto copy; + if (page_mapped(page)) + goto copy; + useraddr = (unsigned long)sd->u.userptr; + mm = current->mm; + + ret = -EAGAIN; + down_read(&mm->mmap_sem); + vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE); + if (IS_ERR_OR_NULL(vma)) + goto up_copy; + if (!vma->anon_vma) { + ret = anon_vma_prepare(vma); + if (ret) + goto up_copy; + } + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); + ret = lock_page_killable(page); + if (ret) + goto up_copy; + ptep = get_locked_pte(mm, useraddr, &ptl); + if (!ptep) + goto unlock_up_copy; + pte = *ptep; + if (pte_present(pte)) + goto unlock_up_copy; + get_page(page); + page_add_anon_rmap(page, vma, useraddr); + pte = mk_pte(page, vma->vm_page_prot); + set_pte_at(mm, useraddr, ptep, pte); + update_mmu_cache(vma, useraddr, ptep); + pte_unmap_unlock(ptep, ptl); + ret = 0; +unlock_up_copy: + unlock_page(page); +up_copy: + up_read(&mm->mmap_sem); + if (!ret) { + ret = sd->len; + goto out; + } + /* else ret < 0 and we should fallback to copying */ + VM_BUG_ON(ret > 0); + } +copy: /* * See if we can use the atomic maps, by prefaulting in the * pages and doing an atomic copy -- 1.8.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice 2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings @ 2013-10-08 16:45 ` Dave Hansen 2013-10-08 17:35 ` Robert Jennings 2013-10-17 11:23 ` Vlastimil Babka 1 sibling, 1 reply; 15+ messages in thread From: Dave Hansen @ 2013-10-08 16:45 UTC (permalink / raw) To: Robert C Jennings, linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka On 10/07/2013 01:21 PM, Robert C Jennings wrote: > + if (!buf->offset && (buf->len == PAGE_SIZE) && > + (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) { > + struct page *page = buf->page; > + struct mm_struct *mm; > + struct vm_area_struct *vma; > + spinlock_t *ptl; > + pte_t *ptep, pte; > + unsigned long useraddr; > + > + if (!PageAnon(page)) > + goto copy; > + if (PageCompound(page)) > + goto copy; > + if (PageHuge(page) || PageTransHuge(page)) > + goto copy; > + if (page_mapped(page)) > + goto copy; I'd really like to see some comments about those cases. You touched on page_mapped() above, but could you replicate some of that in a comment? Also, considering that this is being targeted at QEMU VMs, I would imagine that you're going to want to support PageTransHuge() in here pretty fast. Do you anticipate that being very much trouble? Have you planned for it in here? > + useraddr = (unsigned long)sd->u.userptr; > + mm = current->mm; > + > + ret = -EAGAIN; > + down_read(&mm->mmap_sem); > + vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE); If oyu are only doing these a page at a time, why bother with find_vma_intersection()? Why not a plain find_vma()? Also, if we fail to find a VMA, won't this return -EAGAIN? That seems like a rather uninformative error code to get returned back out to userspace, especially since retrying won't help. > + if (IS_ERR_OR_NULL(vma)) > + goto up_copy; > + if (!vma->anon_vma) { > + ret = anon_vma_prepare(vma); > + if (ret) > + goto up_copy; > + } The first thing anon_vma_prepare() does is check vma->anon_vma. This extra check seems unnecessary. > + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); > + ret = lock_page_killable(page); > + if (ret) > + goto up_copy; > + ptep = get_locked_pte(mm, useraddr, &ptl); > + if (!ptep) > + goto unlock_up_copy; > + pte = *ptep; > + if (pte_present(pte)) > + goto unlock_up_copy; > + get_page(page); > + page_add_anon_rmap(page, vma, useraddr); > + pte = mk_pte(page, vma->vm_page_prot); 'pte' is getting used for two different things here, which makes it a bit confusing. I'd probably just skip this first assignment and directly do: if (pte_present(*ptep)) goto unlock_up_copy; > + set_pte_at(mm, useraddr, ptep, pte); > + update_mmu_cache(vma, useraddr, ptep); > + pte_unmap_unlock(ptep, ptl); > + ret = 0; > +unlock_up_copy: > + unlock_page(page); > +up_copy: > + up_read(&mm->mmap_sem); > + if (!ret) { > + ret = sd->len; > + goto out; > + } > + /* else ret < 0 and we should fallback to copying */ > + VM_BUG_ON(ret > 0); > + } This also screams to be broken out in to a helper function instead of just being thrown in with the existing code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice 2013-10-08 16:45 ` Dave Hansen @ 2013-10-08 17:35 ` Robert Jennings 0 siblings, 0 replies; 15+ messages in thread From: Robert Jennings @ 2013-10-08 17:35 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka * Dave Hansen (dave@sr71.net) wrote: > On 10/07/2013 01:21 PM, Robert C Jennings wrote: > > + if (!buf->offset && (buf->len == PAGE_SIZE) && > > + (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) { > > + struct page *page = buf->page; > > + struct mm_struct *mm; > > + struct vm_area_struct *vma; > > + spinlock_t *ptl; > > + pte_t *ptep, pte; > > + unsigned long useraddr; > > + > > + if (!PageAnon(page)) > > + goto copy; > > + if (PageCompound(page)) > > + goto copy; > > + if (PageHuge(page) || PageTransHuge(page)) > > + goto copy; > > + if (page_mapped(page)) > > + goto copy; > > I'd really like to see some comments about those cases. You touched on > page_mapped() above, but could you replicate some of that in a comment? Yes, I'll add comments in the code for these cases. > Also, considering that this is being targeted at QEMU VMs, I would > imagine that you're going to want to support PageTransHuge() in here > pretty fast. Do you anticipate that being very much trouble? Have you > planned for it in here? My focus with this patch set was to get agreement on the change in the first patch of the vmsplice syscall flags to perform page flipping rather than copying. I am working on support of PageTransHuge() but it is not complete. It reworks this function to coalesce PAGE_SIZE pipe buffers into THP-sized units and operate on those. > > + useraddr = (unsigned long)sd->u.userptr; > > + mm = current->mm; > > + > > + ret = -EAGAIN; > > + down_read(&mm->mmap_sem); > > + vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE); > > If oyu are only doing these a page at a time, why bother with > find_vma_intersection()? Why not a plain find_vma()? You're correct, I can change this to use find_vma(). > Also, if we fail to find a VMA, won't this return -EAGAIN? That seems > like a rather uninformative error code to get returned back out to > userspace, especially since retrying won't help. Yes, -EAGAIN is not good for this case, I will use -EFAULT. > > + if (IS_ERR_OR_NULL(vma)) > > + goto up_copy; > > + if (!vma->anon_vma) { > > + ret = anon_vma_prepare(vma); > > + if (ret) > > + goto up_copy; > > + } > > The first thing anon_vma_prepare() does is check vma->anon_vma. This > extra check seems unnecessary. I'll fix this, thanks. > > + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); > > + ret = lock_page_killable(page); > > + if (ret) > > + goto up_copy; > > + ptep = get_locked_pte(mm, useraddr, &ptl); > > + if (!ptep) > > + goto unlock_up_copy; > > + pte = *ptep; > > + if (pte_present(pte)) > > + goto unlock_up_copy; > > + get_page(page); > > + page_add_anon_rmap(page, vma, useraddr); > > + pte = mk_pte(page, vma->vm_page_prot); > > 'pte' is getting used for two different things here, which makes it a > bit confusing. I'd probably just skip this first assignment and > directly do: > > if (pte_present(*ptep)) > goto unlock_up_copy; I'll fix this, thanks. > > + set_pte_at(mm, useraddr, ptep, pte); > > + update_mmu_cache(vma, useraddr, ptep); > > + pte_unmap_unlock(ptep, ptl); > > + ret = 0; > > +unlock_up_copy: > > + unlock_page(page); > > +up_copy: > > + up_read(&mm->mmap_sem); > > + if (!ret) { > > + ret = sd->len; > > + goto out; > > + } > > + /* else ret < 0 and we should fallback to copying */ > > + VM_BUG_ON(ret > 0); > > + } > > This also screams to be broken out in to a helper function instead of > just being thrown in with the existing code. You're right, it's very self-contained already. I'll pull it out. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice 2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings 2013-10-08 16:45 ` Dave Hansen @ 2013-10-17 11:23 ` Vlastimil Babka 2013-10-17 13:44 ` Robert Jennings 1 sibling, 1 reply; 15+ messages in thread From: Vlastimil Babka @ 2013-10-17 11:23 UTC (permalink / raw) To: Robert C Jennings, linux-kernel Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia On 10/07/2013 10:21 PM, Robert C Jennings wrote: > From: Matt Helsley <matt.helsley@gmail.com> > > It is sometimes useful to move anonymous pages over a pipe rather than > save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see > if userspace would like to move such pages. This differs from plain > SPLICE_F_GIFT in that the memory written to the pipe will no longer > have the same contents as the original -- it effectively faults in new, > empty anonymous pages. > > On the read side the page written to the pipe will be copied unless > SPLICE_F_MOVE is used. Otherwise copying will be performed and the page > will be reclaimed. Note that so long as there is a mapping to the page > copies will be done instead because rmap will have upped the map count for > each anonymous mapping; this can happen do to fork(), for example. This > is necessary because moving the page will usually change the anonymous > page's nonlinear index and that can only be done if it's unmapped. You might want to update comments of vmsplice_to_user() and SYSCALL_DEFINE4(vmsplice) as they both explain how it's done only via copying. Vlastimil > Signed-off-by: Matt Helsley <matt.helsley@gmail.com> > Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> > --- > fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 63 insertions(+) > > diff --git a/fs/splice.c b/fs/splice.c > index a62d61e..9d2ed128 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -32,6 +32,10 @@ > #include <linux/gfp.h> > #include <linux/socket.h> > #include <linux/compat.h> > +#include <linux/page-flags.h> > +#include <linux/hugetlb.h> > +#include <linux/ksm.h> > +#include <linux/swapops.h> > #include "internal.h" > > /* > @@ -1562,6 +1566,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > char *src; > int ret; > > + if (!buf->offset && (buf->len == PAGE_SIZE) && > + (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) { > + struct page *page = buf->page; > + struct mm_struct *mm; > + struct vm_area_struct *vma; > + spinlock_t *ptl; > + pte_t *ptep, pte; > + unsigned long useraddr; > + > + if (!PageAnon(page)) > + goto copy; > + if (PageCompound(page)) > + goto copy; > + if (PageHuge(page) || PageTransHuge(page)) > + goto copy; > + if (page_mapped(page)) > + goto copy; > + useraddr = (unsigned long)sd->u.userptr; > + mm = current->mm; > + > + ret = -EAGAIN; > + down_read(&mm->mmap_sem); > + vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE); > + if (IS_ERR_OR_NULL(vma)) > + goto up_copy; > + if (!vma->anon_vma) { > + ret = anon_vma_prepare(vma); > + if (ret) > + goto up_copy; > + } > + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); > + ret = lock_page_killable(page); > + if (ret) > + goto up_copy; > + ptep = get_locked_pte(mm, useraddr, &ptl); > + if (!ptep) > + goto unlock_up_copy; > + pte = *ptep; > + if (pte_present(pte)) > + goto unlock_up_copy; > + get_page(page); > + page_add_anon_rmap(page, vma, useraddr); > + pte = mk_pte(page, vma->vm_page_prot); > + set_pte_at(mm, useraddr, ptep, pte); > + update_mmu_cache(vma, useraddr, ptep); > + pte_unmap_unlock(ptep, ptl); > + ret = 0; > +unlock_up_copy: > + unlock_page(page); > +up_copy: > + up_read(&mm->mmap_sem); > + if (!ret) { > + ret = sd->len; > + goto out; > + } > + /* else ret < 0 and we should fallback to copying */ > + VM_BUG_ON(ret > 0); > + } > +copy: > /* > * See if we can use the atomic maps, by prefaulting in the > * pages and doing an atomic copy > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice 2013-10-17 11:23 ` Vlastimil Babka @ 2013-10-17 13:44 ` Robert Jennings 0 siblings, 0 replies; 15+ messages in thread From: Robert Jennings @ 2013-10-17 13:44 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia * Vlastimil Babka (vbabka@suse.cz) wrote: > On 10/07/2013 10:21 PM, Robert C Jennings wrote: > > From: Matt Helsley <matt.helsley@gmail.com> > > > > It is sometimes useful to move anonymous pages over a pipe rather than > > save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see > > if userspace would like to move such pages. This differs from plain > > SPLICE_F_GIFT in that the memory written to the pipe will no longer > > have the same contents as the original -- it effectively faults in new, > > empty anonymous pages. > > > > On the read side the page written to the pipe will be copied unless > > SPLICE_F_MOVE is used. Otherwise copying will be performed and the page > > will be reclaimed. Note that so long as there is a mapping to the page > > copies will be done instead because rmap will have upped the map count for > > each anonymous mapping; this can happen do to fork(), for example. This > > is necessary because moving the page will usually change the anonymous > > page's nonlinear index and that can only be done if it's unmapped. > > You might want to update comments of vmsplice_to_user() and > SYSCALL_DEFINE4(vmsplice) as they both explain how it's done only via > copying. > > Vlastimil > I will update those as well. Thanks. > > Signed-off-by: Matt Helsley <matt.helsley@gmail.com> > > Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> > > --- > > fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 63 insertions(+) > > > > diff --git a/fs/splice.c b/fs/splice.c > > index a62d61e..9d2ed128 100644 > > --- a/fs/splice.c > > +++ b/fs/splice.c > > @@ -32,6 +32,10 @@ > > #include <linux/gfp.h> > > #include <linux/socket.h> > > #include <linux/compat.h> > > +#include <linux/page-flags.h> > > +#include <linux/hugetlb.h> > > +#include <linux/ksm.h> > > +#include <linux/swapops.h> > > #include "internal.h" > > > > /* > > @@ -1562,6 +1566,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > > char *src; > > int ret; > > > > + if (!buf->offset && (buf->len == PAGE_SIZE) && > > + (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) { > > + struct page *page = buf->page; > > + struct mm_struct *mm; > > + struct vm_area_struct *vma; > > + spinlock_t *ptl; > > + pte_t *ptep, pte; > > + unsigned long useraddr; > > + > > + if (!PageAnon(page)) > > + goto copy; > > + if (PageCompound(page)) > > + goto copy; > > + if (PageHuge(page) || PageTransHuge(page)) > > + goto copy; > > + if (page_mapped(page)) > > + goto copy; > > + useraddr = (unsigned long)sd->u.userptr; > > + mm = current->mm; > > + > > + ret = -EAGAIN; > > + down_read(&mm->mmap_sem); > > + vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE); > > + if (IS_ERR_OR_NULL(vma)) > > + goto up_copy; > > + if (!vma->anon_vma) { > > + ret = anon_vma_prepare(vma); > > + if (ret) > > + goto up_copy; > > + } > > + zap_page_range(vma, useraddr, PAGE_SIZE, NULL); > > + ret = lock_page_killable(page); > > + if (ret) > > + goto up_copy; > > + ptep = get_locked_pte(mm, useraddr, &ptl); > > + if (!ptep) > > + goto unlock_up_copy; > > + pte = *ptep; > > + if (pte_present(pte)) > > + goto unlock_up_copy; > > + get_page(page); > > + page_add_anon_rmap(page, vma, useraddr); > > + pte = mk_pte(page, vma->vm_page_prot); > > + set_pte_at(mm, useraddr, ptep, pte); > > + update_mmu_cache(vma, useraddr, ptep); > > + pte_unmap_unlock(ptep, ptl); > > + ret = 0; > > +unlock_up_copy: > > + unlock_page(page); > > +up_copy: > > + up_read(&mm->mmap_sem); > > + if (!ret) { > > + ret = sd->len; > > + goto out; > > + } > > + /* else ret < 0 and we should fallback to copying */ > > + VM_BUG_ON(ret > 0); > > + } > > +copy: > > /* > > * See if we can use the atomic maps, by prefaulting in the > > * pages and doing an atomic copy > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-10-18 8:10 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-07 20:21 [PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert C Jennings 2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings 2013-10-08 16:14 ` Dave Hansen 2013-10-08 19:48 ` Robert Jennings 2013-10-08 21:22 ` Dave Hansen 2013-10-08 16:23 ` Dave Hansen 2013-10-17 13:54 ` Robert Jennings 2013-10-17 10:20 ` Vlastimil Babka 2013-10-17 13:48 ` Robert Jennings 2013-10-18 8:10 ` Vlastimil Babka 2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings 2013-10-08 16:45 ` Dave Hansen 2013-10-08 17:35 ` Robert Jennings 2013-10-17 11:23 ` Vlastimil Babka 2013-10-17 13:44 ` Robert Jennings
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).