From: Michael Kerrisk <mtk.manpages@gmail.com>
To: Shaohua Li <shli@fb.com>
Cc: linux-mm <linux-mm@kvack.org>,
danielmicay@gmail.com, Kernel-team@fb.com,
Rik van Riel <riel@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Hugh Dickins <hughd@google.com>,
Andy Lutomirski <luto@amacapital.net>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [RFC] mremap: add MREMAP_NOHOLE flag
Date: Wed, 4 Feb 2015 11:22:40 +0100 [thread overview]
Message-ID: <CAHO5Pa2WZuSNuwhX77fsS+q1KFLJ0Y2s7_f14zTdZWRrG65TdA@mail.gmail.com> (raw)
In-Reply-To: <7064772f72049de8a79383105f49b5db84a946e5.1422990665.git.shli@fb.com>
[CC += linux-api]
Hello Shaohua Li,
Since this is an API change, please CC linux-api@. (The kernel source
file Documentation/SubmitChecklist notes that all Linux kernel patches
that change userspace interfaces should be CCed to
linux-api@vger.kernel.org. See also
https://www.kernel.org/doc/man-pages/linux-api-ml.html)
Thanks,
Michael
On Tue, Feb 3, 2015 at 8:19 PM, Shaohua Li <shli@fb.com> wrote:
> There was a similar patch posted before, but it doesn't get merged. I'd like
> to try again if there are more discussions.
> http://marc.info/?l=linux-mm&m=141230769431688&w=2
>
> mremap can be used to accelerate realloc. The problem is mremap will
> punch a hole in original VMA, which makes specific memory allocator
> unable to utilize it. Jemalloc is an example. It manages memory in 4M
> chunks. mremap a range of the chunk will punch a hole, which other
> mmap() syscall can fill into. The 4M chunk is then fragmented, jemalloc
> can't handle it.
>
> This patch adds a new flag for mremap. With it, mremap will not punch the
> hole. page tables of original vma will be zapped in the same way, but
> vma is still there. That is original vma will look like a vma without
> pagefault. Behavior of new vma isn't changed.
>
> For private vma, accessing original vma will cause
> page fault and just like the address of the vma has never been accessed.
> So for anonymous, new page/zero page will be fault in. For file mapping,
> new page will be allocated with file reading for cow, or pagefault will
> use existing page cache.
>
> For shared vma, original and new vma will map to the same file. We can
> optimize this without zaping original vma's page table in this case, but
> this patch doesn't do it yet.
>
> Since with MREMAP_NOHOLE, original vma still exists. pagefault handler
> for special vma might not able to handle pagefault for mremap'd area.
> The patch doesn't allow vmas with VM_PFNMAP|VM_MIXEDMAP flags do NOHOLE
> mremap.
>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
> include/uapi/linux/mman.h | 1 +
> mm/mremap.c | 97 ++++++++++++++++++++++++++++++++---------------
> 2 files changed, 67 insertions(+), 31 deletions(-)
>
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index ade4acd..9ee9a15 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -5,6 +5,7 @@
>
> #define MREMAP_MAYMOVE 1
> #define MREMAP_FIXED 2
> +#define MREMAP_NOHOLE 4
>
> #define OVERCOMMIT_GUESS 0
> #define OVERCOMMIT_ALWAYS 1
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 3b886dc..ea3f40d 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -236,7 +236,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>
> static unsigned long move_vma(struct vm_area_struct *vma,
> unsigned long old_addr, unsigned long old_len,
> - unsigned long new_len, unsigned long new_addr, bool *locked)
> + unsigned long new_len, unsigned long new_addr, bool *locked,
> + bool nohole)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *new_vma;
> @@ -292,7 +293,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
>
> /* Conceal VM_ACCOUNT so old reservation is not undone */
> - if (vm_flags & VM_ACCOUNT) {
> + if ((vm_flags & VM_ACCOUNT) && !nohole) {
> vma->vm_flags &= ~VM_ACCOUNT;
> excess = vma->vm_end - vma->vm_start - old_len;
> if (old_addr > vma->vm_start &&
> @@ -312,11 +313,18 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> hiwater_vm = mm->hiwater_vm;
> vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
>
> - if (do_munmap(mm, old_addr, old_len) < 0) {
> + if (!nohole && do_munmap(mm, old_addr, old_len) < 0) {
> /* OOM: unable to split vma, just get accounts right */
> vm_unacct_memory(excess >> PAGE_SHIFT);
> excess = 0;
> }
> +
> + if (nohole && (new_addr & ~PAGE_MASK)) {
> + /* caller will unaccount */
> + vma->vm_flags &= ~VM_ACCOUNT;
> + do_munmap(mm, old_addr, old_len);
> + }
> +
> mm->hiwater_vm = hiwater_vm;
>
> /* Restore VM_ACCOUNT if one or two pieces of vma left */
> @@ -334,14 +342,13 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> return new_addr;
> }
>
> -static struct vm_area_struct *vma_to_resize(unsigned long addr,
> - unsigned long old_len, unsigned long new_len, unsigned long *p)
> +static unsigned long validate_vma_and_charge(struct vm_area_struct *vma,
> + unsigned long addr,
> + unsigned long old_len, unsigned long new_len, unsigned long *p,
> + bool nohole)
> {
> struct mm_struct *mm = current->mm;
> - struct vm_area_struct *vma = find_vma(mm, addr);
> -
> - if (!vma || vma->vm_start > addr)
> - goto Efault;
> + unsigned long diff;
>
> if (is_vm_hugetlb_page(vma))
> goto Einval;
> @@ -350,6 +357,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
> if (old_len > vma->vm_end - addr)
> goto Efault;
>
> + if (nohole && (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
> + goto Einval;
> +
> /* Need to be careful about a growing mapping */
> if (new_len > old_len) {
> unsigned long pgoff;
> @@ -362,39 +372,45 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
> goto Einval;
> }
>
> + if (nohole)
> + diff = new_len;
> + else
> + diff = new_len - old_len;
> +
> if (vma->vm_flags & VM_LOCKED) {
> unsigned long locked, lock_limit;
> locked = mm->locked_vm << PAGE_SHIFT;
> lock_limit = rlimit(RLIMIT_MEMLOCK);
> - locked += new_len - old_len;
> + locked += diff;
> if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> goto Eagain;
> }
>
> - if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
> + if (!may_expand_vm(mm, diff >> PAGE_SHIFT))
> goto Enomem;
>
> if (vma->vm_flags & VM_ACCOUNT) {
> - unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
> + unsigned long charged = diff >> PAGE_SHIFT;
> if (security_vm_enough_memory_mm(mm, charged))
> goto Efault;
> *p = charged;
> }
>
> - return vma;
> + return 0;
>
> Efault: /* very odd choice for most of the cases, but... */
> - return ERR_PTR(-EFAULT);
> + return -EFAULT;
> Einval:
> - return ERR_PTR(-EINVAL);
> + return -EINVAL;
> Enomem:
> - return ERR_PTR(-ENOMEM);
> + return -ENOMEM;
> Eagain:
> - return ERR_PTR(-EAGAIN);
> + return -EAGAIN;
> }
>
> static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> - unsigned long new_addr, unsigned long new_len, bool *locked)
> + unsigned long new_addr, unsigned long new_len, bool *locked,
> + bool nohole)
> {
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> @@ -422,17 +438,23 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> goto out;
>
> if (old_len >= new_len) {
> - ret = do_munmap(mm, addr+new_len, old_len - new_len);
> - if (ret && old_len != new_len)
> - goto out;
> + if (!nohole) {
> + ret = do_munmap(mm, addr+new_len, old_len - new_len);
> + if (ret && old_len != new_len)
> + goto out;
> + }
> old_len = new_len;
> }
>
> - vma = vma_to_resize(addr, old_len, new_len, &charged);
> - if (IS_ERR(vma)) {
> - ret = PTR_ERR(vma);
> + vma = find_vma(mm, addr);
> + if (!vma || vma->vm_start > addr) {
> + ret = -EFAULT;
> goto out;
> }
> + ret = validate_vma_and_charge(vma, addr, old_len, new_len, &charged,
> + nohole);
> + if (ret)
> + goto out;
>
> map_flags = MAP_FIXED;
> if (vma->vm_flags & VM_MAYSHARE)
> @@ -444,7 +466,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> if (ret & ~PAGE_MASK)
> goto out1;
>
> - ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
> + ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, nohole);
> if (!(ret & ~PAGE_MASK))
> goto out;
> out1:
> @@ -483,8 +505,9 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> unsigned long ret = -EINVAL;
> unsigned long charged = 0;
> bool locked = false;
> + bool nohole = flags & MREMAP_NOHOLE;
>
> - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_NOHOLE))
> return ret;
>
> if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
> @@ -508,7 +531,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>
> if (flags & MREMAP_FIXED) {
> ret = mremap_to(addr, old_len, new_addr, new_len,
> - &locked);
> + &locked, nohole);
> goto out;
> }
>
> @@ -528,9 +551,9 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> /*
> * Ok, we need to grow..
> */
> - vma = vma_to_resize(addr, old_len, new_len, &charged);
> - if (IS_ERR(vma)) {
> - ret = PTR_ERR(vma);
> + vma = find_vma(mm, addr);
> + if (!vma || vma->vm_start > addr) {
> + ret = -EFAULT;
> goto out;
> }
>
> @@ -541,6 +564,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> if (vma_expandable(vma, new_len - old_len)) {
> int pages = (new_len - old_len) >> PAGE_SHIFT;
>
> + ret = validate_vma_and_charge(vma, addr, old_len, new_len,
> + &charged, false);
> + if (ret) {
> + BUG_ON(charged != 0);
> + goto out;
> + }
> if (vma_adjust(vma, vma->vm_start, addr + new_len,
> vma->vm_pgoff, NULL)) {
> ret = -ENOMEM;
> @@ -558,6 +587,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> }
> }
>
> + ret = validate_vma_and_charge(vma, addr, old_len, new_len,
> + &charged, nohole);
> + if (ret)
> + goto out;
> +
> /*
> * We weren't able to just expand or shrink the area,
> * we need to create a new one and move it..
> @@ -577,7 +611,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> goto out;
> }
>
> - ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
> + ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked,
> + nohole);
> }
> out:
> if (ret & ~PAGE_MASK)
> --
> 1.8.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-02-04 10:23 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-03 19:19 [RFC] mremap: add MREMAP_NOHOLE flag Shaohua Li
2015-02-03 23:02 ` Daniel Micay
2015-02-04 10:22 ` Michael Kerrisk [this message]
2015-02-23 22:10 ` Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHO5Pa2WZuSNuwhX77fsS+q1KFLJ0Y2s7_f14zTdZWRrG65TdA@mail.gmail.com \
--to=mtk.manpages@gmail.com \
--cc=Kernel-team@fb.com \
--cc=akpm@linux-foundation.org \
--cc=danielmicay@gmail.com \
--cc=hughd@google.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@amacapital.net \
--cc=riel@redhat.com \
--cc=shli@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).