* [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held @ 2012-12-21 0:49 Michel Lespinasse 2012-12-21 0:49 ` [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags Michel Lespinasse ` (11 more replies) 0 siblings, 12 replies; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel We have many vma manipulation functions that are fast in the typical case, but can optionally be instructed to populate an unbounded number of ptes within the region they work on: - mmap with MAP_POPULATE or MAP_LOCKED flags; - remap_file_pages() with MAP_NONBLOCK not set or when working on a VM_LOCKED vma; - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; - brk() when mlock(MCL_FUTURE) is in effect. Current code handles these pte operations locally, while the sourrounding code has to hold the mmap_sem write side since it's manipulating vmas. This means we're doing an unbounded amount of pte population work with mmap_sem held, and this causes problems as Andy Lutomirski reported (we've hit this at Google as well, though it's not entirely clear why people keep trying to use mlock(MCL_FUTURE) in the first place). I propose introducing a new mm_populate() function to do this pte population work after the mmap_sem has been released. mm_populate() does need to acquire the mmap_sem read side, but critically, it doesn't need to hold continuously for the entire duration of the operation - it can drop it whenever things take too long (such as when hitting disk for a file read) and re-acquire it later on. The following patches are against v3.7: - Patches 1-2 fix some issues I noticed while working on the existing code. If needed, they could potentially go in before the rest of the patches. - Patch 3 introduces the new mm_populate() function and changes mmap_region() call sites to use it after they drop mmap_sem. This is inspired from Andy Lutomirski's proposal and is built as an extension of the work I had previously done for mlock() and mlockall() around v2.6.38-rc1. I had tried doing something similar at the time but had given up as there were so many do_mmap() call sites; the recent cleanups by Linus and Viro are a tremendous help here. - Patches 4-6 convert some of the less-obvious places doing unbounded pte populates to the new mm_populate() mechanism. - Patches 7-8 are code cleanups that are made possible by the mm_populate() work. In particular, they remove more code than the entire patch series added, which should be a good thing :) - Patch 9 is optional to this entire series. It only helps to deal more nicely with racy userspace programs that might modify their mappings while we're trying to populate them. It adds a new VM_POPULATE flag on the mappings we do want to populate, so that if userspace replaces them with mappings it doesn't want populated, mm_populate() won't populate those replacement mappings. Michel Lespinasse (9): mm: make mlockall preserve flags other than VM_LOCKED in def_flags mm: remap_file_pages() fixes mm: introduce mm_populate() for populating new vmas mm: use mm_populate() for blocking remap_file_pages() mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect. mm: use mm_populate() for mremap() of VM_LOCKED vmas mm: remove flags argument to mmap_region mm: directly use __mlock_vma_pages_range() in find_extend_vma() mm: introduce VM_POPULATE flag to better deal with racy userspace programs arch/tile/mm/elf.c | 1 - fs/aio.c | 6 +++- include/linux/mm.h | 23 +++++++++--- include/linux/mman.h | 4 ++- ipc/shm.c | 12 ++++--- mm/fremap.c | 51 ++++++++++++++------------- mm/internal.h | 4 +- mm/memory.c | 24 ------------- mm/mlock.c | 94 +++++++++++++------------------------------------ mm/mmap.c | 77 ++++++++++++++++++++++++---------------- mm/mremap.c | 25 +++++++------ mm/nommu.c | 5 ++- mm/util.c | 6 +++- 13 files changed, 154 insertions(+), 178 deletions(-) -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2012-12-22 4:25 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 2/9] mm: remap_file_pages() fixes Michel Lespinasse ` (10 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel On most architectures, def_flags is either 0 or VM_LOCKED depending on whether mlockall(MCL_FUTURE) was called. However, this is not an absolute rule as kvm support on s390 may set the VM_NOHUGEPAGE flag in def_flags. We don't want mlockall to clear that. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/mlock.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index f0b9ce572fc7..a2ee45c030fa 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -517,10 +517,11 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len) static int do_mlockall(int flags) { struct vm_area_struct * vma, * prev = NULL; - unsigned int def_flags = 0; + unsigned int def_flags; + def_flags = current->mm->def_flags & ~VM_LOCKED; if (flags & MCL_FUTURE) - def_flags = VM_LOCKED; + def_flags |= VM_LOCKED; current->mm->def_flags = def_flags; if (flags == MCL_FUTURE) goto out; -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags 2012-12-21 0:49 ` [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags Michel Lespinasse @ 2012-12-22 4:25 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2012-12-22 4:25 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > On most architectures, def_flags is either 0 or VM_LOCKED depending on > whether mlockall(MCL_FUTURE) was called. However, this is not an absolute > rule as kvm support on s390 may set the VM_NOHUGEPAGE flag in def_flags. > We don't want mlockall to clear that. > > Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 2/9] mm: remap_file_pages() fixes 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse 2012-12-21 0:49 ` [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 0:31 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 3/9] mm: introduce mm_populate() for populating new vmas Michel Lespinasse ` (9 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel Assorted small fixes. The first two are quite small: - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR) within existing if (!(vma->vm_flags & VM_NONLINEAR)) block. Purely cosmetic. - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped range, make sure we own the mmap_sem write lock around the munlock_vma_pages_range call as this manipulates the vma's vm_flags. Last fix requires a longer explanation. remap_file_pages() can do its work either through VM_NONLINEAR manipulation or by creating extra vmas. These two cases were inconsistent with each other (and ultimately, both wrong) as to exactly when did they fault in the newly mapped file pages: - In the VM_NONLINEAR case, new file pages would be populated if the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed, new file pages wouldn't be populated even if the vma is already marked as VM_LOCKED. - In the linear (emulated) case, the work is passed to the mmap_region() function which would populate the pages if the vma is marked as VM_LOCKED, and would not otherwise - regardless of the value of the MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to mmap_region(). The desired behavior is that we want the pages to be populated and locked if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK flag is not passed to remap_file_pages(). Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/fremap.c | 22 ++++++++++++++-------- 1 files changed, 14 insertions(+), 8 deletions(-) diff --git a/mm/fremap.c b/mm/fremap.c index a0aaf0e56800..2db886e31044 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -160,15 +160,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, /* * Make sure the vma is shared, that it supports prefaulting, * and that the remapped range is valid and fully within - * the single existing vma. vm_private_data is used as a - * swapout cursor in a VM_NONLINEAR vma. + * the single existing vma. */ if (!vma || !(vma->vm_flags & VM_SHARED)) goto out; - if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)) - goto out; - if (!vma->vm_ops || !vma->vm_ops->remap_pages) goto out; @@ -177,6 +173,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, /* Must set VM_NONLINEAR before any pages are populated. */ if (!(vma->vm_flags & VM_NONLINEAR)) { + /* + * vm_private_data is used as a swapout cursor + * in a VM_NONLINEAR vma. + */ + if (vma->vm_private_data) + goto out; + /* Don't need a nonlinear mapping, exit success */ if (pgoff == linear_page_index(vma, start)) { err = 0; @@ -184,6 +187,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, } if (!has_write_lock) { +get_write_lock: up_read(&mm->mmap_sem); down_write(&mm->mmap_sem); has_write_lock = 1; @@ -199,7 +203,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, unsigned long addr; struct file *file = get_file(vma->vm_file); - flags &= MAP_NONBLOCK; + flags = (flags & MAP_NONBLOCK) | MAP_POPULATE; addr = mmap_region(file, start, size, flags, vma->vm_flags, pgoff); fput(file); @@ -225,6 +229,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, * drop PG_Mlocked flag for over-mapped range */ vm_flags_t saved_flags = vma->vm_flags; + if (!has_write_lock) + goto get_write_lock; munlock_vma_pages_range(vma, start, start + size); vma->vm_flags = saved_flags; } @@ -232,13 +238,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, mmu_notifier_invalidate_range_start(mm, start, start + size); err = vma->vm_ops->remap_pages(vma, start, size, pgoff); mmu_notifier_invalidate_range_end(mm, start, start + size); - if (!err && !(flags & MAP_NONBLOCK)) { + if (!err) { if (vma->vm_flags & VM_LOCKED) { /* * might be mapping previously unmapped range of file */ mlock_vma_pages_range(vma, start, start + size); - } else { + } else if (!(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); has_write_lock = 0; -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/9] mm: remap_file_pages() fixes 2012-12-21 0:49 ` [PATCH 2/9] mm: remap_file_pages() fixes Michel Lespinasse @ 2013-01-03 0:31 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 0:31 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > Assorted small fixes. The first two are quite small: > > - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR) > within existing if (!(vma->vm_flags & VM_NONLINEAR)) block. > Purely cosmetic. > > - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped > range, make sure we own the mmap_sem write lock around the > munlock_vma_pages_range call as this manipulates the vma's vm_flags. > > Last fix requires a longer explanation. remap_file_pages() can do its work > either through VM_NONLINEAR manipulation or by creating extra vmas. > These two cases were inconsistent with each other (and ultimately, both wrong) > as to exactly when did they fault in the newly mapped file pages: > > - In the VM_NONLINEAR case, new file pages would be populated if > the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed, > new file pages wouldn't be populated even if the vma is already > marked as VM_LOCKED. > > - In the linear (emulated) case, the work is passed to the mmap_region() > function which would populate the pages if the vma is marked as > VM_LOCKED, and would not otherwise - regardless of the value of the > MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to > mmap_region(). > > The desired behavior is that we want the pages to be populated and locked > if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK > flag is not passed to remap_file_pages(). > > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 3/9] mm: introduce mm_populate() for populating new vmas 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse 2012-12-21 0:49 ` [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags Michel Lespinasse 2012-12-21 0:49 ` [PATCH 2/9] mm: remap_file_pages() fixes Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 2:14 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() Michel Lespinasse ` (8 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Michel Lespinasse <walken@google.com> --- fs/aio.c | 6 +++++- include/linux/mm.h | 18 +++++++++++++++--- ipc/shm.c | 12 +++++++----- mm/mlock.c | 17 +++++++++++------ mm/mmap.c | 20 +++++++++++++++----- mm/nommu.c | 5 ++++- mm/util.c | 6 +++++- 7 files changed, 62 insertions(+), 22 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 71f613cf4a85..82eec7c7b4bb 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -103,6 +103,7 @@ static int aio_setup_ring(struct kioctx *ctx) unsigned nr_events = ctx->max_reqs; unsigned long size; int nr_pages; + bool populate; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -129,7 +130,8 @@ static int aio_setup_ring(struct kioctx *ctx) down_write(&ctx->mm->mmap_sem); info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size, PROT_READ|PROT_WRITE, - MAP_ANONYMOUS|MAP_PRIVATE, 0); + MAP_ANONYMOUS|MAP_PRIVATE, 0, + &populate); if (IS_ERR((void *)info->mmap_base)) { up_write(&ctx->mm->mmap_sem); info->mmap_size = 0; @@ -147,6 +149,8 @@ static int aio_setup_ring(struct kioctx *ctx) aio_free_ring(ctx); return -EAGAIN; } + if (populate) + mm_populate(info->mmap_base, info->mmap_size); ctx->user_id = info->mmap_base; diff --git a/include/linux/mm.h b/include/linux/mm.h index bcaab4e6fe91..fea461cd9027 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1444,11 +1444,23 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff); -extern unsigned long do_mmap_pgoff(struct file *, unsigned long, - unsigned long, unsigned long, - unsigned long, unsigned long); +extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, + unsigned long len, unsigned long prot, unsigned long flags, + unsigned long pgoff, bool *populate); extern int do_munmap(struct mm_struct *, unsigned long, size_t); +#ifdef CONFIG_MMU +extern int __mm_populate(unsigned long addr, unsigned long len, + int ignore_errors); +static inline void mm_populate(unsigned long addr, unsigned long len) +{ + /* Ignore errors */ + (void) __mm_populate(addr, len, 1); +} +#else +static inline void mm_populate(unsigned long addr, unsigned long len) {} +#endif + /* These take the mm semaphore themselves */ extern unsigned long vm_brk(unsigned long, unsigned long); extern int vm_munmap(unsigned long, size_t); diff --git a/ipc/shm.c b/ipc/shm.c index dff40c9f73c9..ee2dde1f94d1 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -966,11 +966,11 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, unsigned long flags; unsigned long prot; int acc_mode; - unsigned long user_addr; struct ipc_namespace *ns; struct shm_file_data *sfd; struct path path; fmode_t f_mode; + bool populate = false; err = -EINVAL; if (shmid < 0) @@ -1069,13 +1069,15 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, goto invalid; } - user_addr = do_mmap_pgoff(file, addr, size, prot, flags, 0); - *raddr = user_addr; + addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate); + *raddr = addr; err = 0; - if (IS_ERR_VALUE(user_addr)) - err = (long)user_addr; + if (IS_ERR_VALUE(addr)) + err = (long)addr; invalid: up_write(¤t->mm->mmap_sem); + if (populate) + mm_populate(addr, size); out_fput: fput(file); diff --git a/mm/mlock.c b/mm/mlock.c index a2ee45c030fa..7f94bc3b46ef 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -416,7 +416,14 @@ static int do_mlock(unsigned long start, size_t len, int on) return error; } -static int do_mlock_pages(unsigned long start, size_t len, int ignore_errors) +/* + * __mm_populate - populate and/or mlock pages within a range of address space. + * + * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap + * flags. VMAs must be already marked with the desired vm_flags, and + * mmap_sem must not be held. + */ +int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) { struct mm_struct *mm = current->mm; unsigned long end, nstart, nend; @@ -498,7 +505,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) error = do_mlock(start, len, 1); up_write(¤t->mm->mmap_sem); if (!error) - error = do_mlock_pages(start, len, 0); + error = __mm_populate(start, len, 0); return error; } @@ -565,10 +572,8 @@ SYSCALL_DEFINE1(mlockall, int, flags) capable(CAP_IPC_LOCK)) ret = do_mlockall(flags); up_write(¤t->mm->mmap_sem); - if (!ret && (flags & MCL_CURRENT)) { - /* Ignore errors */ - do_mlock_pages(0, TASK_SIZE, 1); - } + if (!ret && (flags & MCL_CURRENT)) + mm_populate(0, TASK_SIZE); out: return ret; } diff --git a/mm/mmap.c b/mm/mmap.c index 9a796c41e7d9..a16fc499dbd1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1001,12 +1001,15 @@ static inline unsigned long round_hint_to_min(unsigned long hint) unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - unsigned long flags, unsigned long pgoff) + unsigned long flags, unsigned long pgoff, + bool *populate) { struct mm_struct * mm = current->mm; struct inode *inode; vm_flags_t vm_flags; + *populate = false; + /* * Does the application expect PROT_READ to imply PROT_EXEC? * @@ -1127,7 +1130,12 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, } } - return mmap_region(file, addr, len, flags, vm_flags, pgoff); + addr = mmap_region(file, addr, len, flags, vm_flags, pgoff); + if (!IS_ERR_VALUE(addr) && + ((vm_flags & VM_LOCKED) || + (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) + *populate = true; + return addr; } SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, @@ -1373,10 +1381,12 @@ out: vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { - if (!mlock_vma_pages_range(vma, addr, addr + len)) + if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || + vma == get_gate_vma(current->mm))) mm->locked_vm += (len >> PAGE_SHIFT); - } else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK)) - make_pages_present(addr, addr + len); + else + vma->vm_flags &= ~VM_LOCKED; + } if (file) uprobe_mmap(vma); diff --git a/mm/nommu.c b/mm/nommu.c index 45131b41bcdb..d7690b97b81d 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1234,7 +1234,8 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long len, unsigned long prot, unsigned long flags, - unsigned long pgoff) + unsigned long pgoff, + bool *populate) { struct vm_area_struct *vma; struct vm_region *region; @@ -1244,6 +1245,8 @@ unsigned long do_mmap_pgoff(struct file *file, kenter(",%lx,%lx,%lx,%lx,%lx", addr, len, prot, flags, pgoff); + *populate = false; + /* decide whether we should attempt the mapping, and if so what sort of * mapping */ ret = validate_mmap_request(file, addr, len, prot, flags, pgoff, diff --git a/mm/util.c b/mm/util.c index dc3036cdcc6a..44f006ac2ccd 100644 --- a/mm/util.c +++ b/mm/util.c @@ -355,12 +355,16 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, { unsigned long ret; struct mm_struct *mm = current->mm; + bool populate; ret = security_mmap_file(file, prot, flag); if (!ret) { down_write(&mm->mmap_sem); - ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff); + ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, + &populate); up_write(&mm->mmap_sem); + if (!IS_ERR_VALUE(ret) && populate) + mm_populate(ret, len); } return ret; } -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 3/9] mm: introduce mm_populate() for populating new vmas 2012-12-21 0:49 ` [PATCH 3/9] mm: introduce mm_populate() for populating new vmas Michel Lespinasse @ 2013-01-03 2:14 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 2:14 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags > (or with MCL_FUTURE in effect), we want to populate the pages within the > newly created vmas. This may take a while as we may have to read pages > from disk, so ideally we want to do this outside of the write-locked > mmap_sem region. > > This change introduces mm_populate(), which is used to defer populating > such mappings until after the mmap_sem write lock has been released. > This is implemented as a generalization of the former do_mlock_pages(), > which accomplished the same task but was using during mlock() / mlockall(). > > Reported-by: Andy Lutomirski <luto@amacapital.net> > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (2 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 3/9] mm: introduce mm_populate() for populating new vmas Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 2:25 ` Rik van Riel 2013-03-10 18:55 ` Tommi Rantala 2012-12-21 0:49 ` [PATCH 5/9] mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect Michel Lespinasse ` (7 subsequent siblings) 11 siblings, 2 replies; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/fremap.c | 22 ++++++---------------- 1 files changed, 6 insertions(+), 16 deletions(-) diff --git a/mm/fremap.c b/mm/fremap.c index 2db886e31044..b42e32171530 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -129,6 +129,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, struct vm_area_struct *vma; int err = -EINVAL; int has_write_lock = 0; + vm_flags_t vm_flags; if (prot) return err; @@ -228,30 +229,16 @@ get_write_lock: /* * drop PG_Mlocked flag for over-mapped range */ - vm_flags_t saved_flags = vma->vm_flags; if (!has_write_lock) goto get_write_lock; + vm_flags = vma->vm_flags; munlock_vma_pages_range(vma, start, start + size); - vma->vm_flags = saved_flags; + vma->vm_flags = vm_flags; } mmu_notifier_invalidate_range_start(mm, start, start + size); err = vma->vm_ops->remap_pages(vma, start, size, pgoff); mmu_notifier_invalidate_range_end(mm, start, start + size); - if (!err) { - if (vma->vm_flags & VM_LOCKED) { - /* - * might be mapping previously unmapped range of file - */ - mlock_vma_pages_range(vma, start, start + size); - } else if (!(flags & MAP_NONBLOCK)) { - if (unlikely(has_write_lock)) { - downgrade_write(&mm->mmap_sem); - has_write_lock = 0; - } - make_pages_present(start, start+size); - } - } /* * We can't clear VM_NONLINEAR because we'd have to do @@ -260,10 +247,13 @@ get_write_lock: */ out: + vm_flags = vma->vm_flags; if (likely(!has_write_lock)) up_read(&mm->mmap_sem); else up_write(&mm->mmap_sem); + if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK))) + mm_populate(start, size); return err; } -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2012-12-21 0:49 ` [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() Michel Lespinasse @ 2013-01-03 2:25 ` Rik van Riel 2013-03-10 18:55 ` Tommi Rantala 1 sibling, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 2:25 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > Signed-off-by: Michel Lespinasse <walken@google.com> Changelog could use some help :) Other than that: Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2012-12-21 0:49 ` [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() Michel Lespinasse 2013-01-03 2:25 ` Rik van Riel @ 2013-03-10 18:55 ` Tommi Rantala 2013-03-11 23:03 ` Andrew Morton 1 sibling, 1 reply; 40+ messages in thread From: Tommi Rantala @ 2013-03-10 18:55 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm@kvack.org, LKML, Dave Jones 2012/12/21 Michel Lespinasse <walken@google.com>: > Signed-off-by: Michel Lespinasse <walken@google.com> Hello, this patch introduced the following bug, seen while fuzzing with trinity: [ 396.825414] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 396.826013] IP: [<ffffffff81176efb>] sys_remap_file_pages+0xbb/0x3e0 [ 396.826013] PGD 61e65067 PUD 3fb4067 PMD 0 [ 396.826013] Oops: 0000 [#8] SMP [ 396.826013] CPU 0 [ 396.826013] Pid: 27553, comm: trinity-child53 Tainted: G D W 3.9.0-rc1+ #108 Bochs Bochs [ 396.826013] RIP: 0010:[<ffffffff81176efb>] [<ffffffff81176efb>] sys_remap_file_pages+0xbb/0x3e0 [ 396.826013] RSP: 0018:ffff880071a23f08 EFLAGS: 00010246 [ 396.826013] RAX: 0000000000000000 RBX: ffffffff00000000 RCX: 0000000000000001 [ 396.826013] RDX: 0000000000000000 RSI: ffffffff00000000 RDI: ffff8800679657c0 [ 396.826013] RBP: ffff880071a23f78 R08: 0000000000000002 R09: 0000000000000000 [ 396.826013] R10: 0000000026dad294 R11: 0000000000000000 R12: 0000000000000000 [ 396.826013] R13: ffff880067965870 R14: ffffffffffffffea R15: 0000000000000000 [ 396.826013] FS: 00007f6691a57700(0000) GS:ffff88007f800000(0000) knlGS:0000000000000000 [ 396.826013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 396.826013] CR2: 0000000000000050 CR3: 0000000068ab3000 CR4: 00000000000006f0 [ 396.826013] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 396.826013] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 396.826013] Process trinity-child53 (pid: 27553, threadinfo ffff880071a22000, task ffff88006a360000) [ 396.826013] Stack: [ 396.826013] 0000000000000000 ffffffff810f33b6 0000000000000035 0000000000000000 [ 396.826013] 000000000000f000 0000000000000000 0000000026dad294 ffff8800679657c0 [ 396.826013] a80006367e000000 ffffffff00000000 00000000000006c0 00000000000000d8 [ 396.826013] Call Trace: [ 396.826013] [<ffffffff810f33b6>] ? trace_hardirqs_on_caller+0x16/0x1f0 [ 396.826013] [<ffffffff81faf169>] system_call_fastpath+0x16/0x1b [ 396.826013] Code: 43 e3 00 48 8b 45 a8 25 00 00 01 00 48 89 45 b8 48 8b 7d c8 48 89 de e8 74 9b 00 00 48 85 c0 49 89 c7 75 1c 49 c7 c6 ea ff ff ff <48> 8b 14 25 50 00 00 00 44 89 f0 e9 7f 02 00 00 0f 1f 44 00 00 [ 396.826013] RIP [<ffffffff81176efb>] sys_remap_file_pages+0xbb/0x3e0 [ 396.826013] RSP <ffff880071a23f08> [ 396.826013] CR2: 0000000000000050 [ 396.876275] ---[ end trace 0444599b5c1ba02b ]--- > --- > mm/fremap.c | 22 ++++++---------------- > 1 files changed, 6 insertions(+), 16 deletions(-) > > diff --git a/mm/fremap.c b/mm/fremap.c > index 2db886e31044..b42e32171530 100644 > --- a/mm/fremap.c > +++ b/mm/fremap.c > @@ -129,6 +129,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, > struct vm_area_struct *vma; > int err = -EINVAL; > int has_write_lock = 0; > + vm_flags_t vm_flags; > > if (prot) > return err; > @@ -228,30 +229,16 @@ get_write_lock: > /* > * drop PG_Mlocked flag for over-mapped range > */ > - vm_flags_t saved_flags = vma->vm_flags; > if (!has_write_lock) > goto get_write_lock; > + vm_flags = vma->vm_flags; > munlock_vma_pages_range(vma, start, start + size); > - vma->vm_flags = saved_flags; > + vma->vm_flags = vm_flags; > } > > mmu_notifier_invalidate_range_start(mm, start, start + size); > err = vma->vm_ops->remap_pages(vma, start, size, pgoff); > mmu_notifier_invalidate_range_end(mm, start, start + size); > - if (!err) { > - if (vma->vm_flags & VM_LOCKED) { > - /* > - * might be mapping previously unmapped range of file > - */ > - mlock_vma_pages_range(vma, start, start + size); > - } else if (!(flags & MAP_NONBLOCK)) { > - if (unlikely(has_write_lock)) { > - downgrade_write(&mm->mmap_sem); > - has_write_lock = 0; > - } > - make_pages_present(start, start+size); > - } > - } > > /* > * We can't clear VM_NONLINEAR because we'd have to do > @@ -260,10 +247,13 @@ get_write_lock: > */ > > out: > + vm_flags = vma->vm_flags; When find_vma() fails, vma is NULL here. > if (likely(!has_write_lock)) > up_read(&mm->mmap_sem); > else > up_write(&mm->mmap_sem); > + if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK))) > + mm_populate(start, size); > > return err; > } > -- > 1.7.7.3 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2013-03-10 18:55 ` Tommi Rantala @ 2013-03-11 23:03 ` Andrew Morton 2013-03-12 0:24 ` Michel Lespinasse 0 siblings, 1 reply; 40+ messages in thread From: Andrew Morton @ 2013-03-11 23:03 UTC (permalink / raw) To: Tommi Rantala Cc: Michel Lespinasse, Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, linux-mm@kvack.org, LKML, Dave Jones On Sun, 10 Mar 2013 20:55:21 +0200 Tommi Rantala <tt.rantala@gmail.com> wrote: > 2012/12/21 Michel Lespinasse <walken@google.com>: > > Signed-off-by: Michel Lespinasse <walken@google.com> > > Hello, this patch introduced the following bug, seen while fuzzing with trinity: > > [ 396.825414] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000050 > > ... > > > + vm_flags = vma->vm_flags; > > When find_vma() fails, vma is NULL here. Yup, thanks. This, methinks: From: Andrew Morton <akpm@linux-foundation.org> Subject: mm/fremap.c: fix oops on error path If find_vma() fails, sys_remap_file_pages() will dereference `vma', which contains NULL. Fix it by checking the pointer. (We could alternatively check for err==0, but this seems more direct) (The vm_flags change is to squish a bogus used-uninitialised warning without adding extra code). Reported-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/fremap.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff -puN mm/fremap.c~mm-fremapc-fix-oops-on-error-path mm/fremap.c --- a/mm/fremap.c~mm-fremapc-fix-oops-on-error-path +++ a/mm/fremap.c @@ -163,7 +163,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsign * and that the remapped range is valid and fully within * the single existing vma. */ - if (!vma || !(vma->vm_flags & VM_SHARED)) + vm_flags = vma->vm_flags; + if (!vma || !(vm_flags & VM_SHARED)) goto out; if (!vma->vm_ops || !vma->vm_ops->remap_pages) @@ -254,7 +255,8 @@ get_write_lock: */ out: - vm_flags = vma->vm_flags; + if (vma) + vm_flags = vma->vm_flags; if (likely(!has_write_lock)) up_read(&mm->mmap_sem); else _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2013-03-11 23:03 ` Andrew Morton @ 2013-03-12 0:24 ` Michel Lespinasse 2013-03-12 4:23 ` Hillf Danton 2013-03-12 20:47 ` Andrew Morton 0 siblings, 2 replies; 40+ messages in thread From: Michel Lespinasse @ 2013-03-12 0:24 UTC (permalink / raw) To: Andrew Morton Cc: Tommi Rantala, Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, linux-mm@kvack.org, LKML, Dave Jones (Sorry for the late reply) On Mon, Mar 11, 2013 at 4:03 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Sun, 10 Mar 2013 20:55:21 +0200 Tommi Rantala <tt.rantala@gmail.com> wrote: > >> 2012/12/21 Michel Lespinasse <walken@google.com>: >> > Signed-off-by: Michel Lespinasse <walken@google.com> >> >> Hello, this patch introduced the following bug, seen while fuzzing with trinity: >> >> [ 396.825414] BUG: unable to handle kernel NULL pointer dereference >> at 0000000000000050 Good catch... > From: Andrew Morton <akpm@linux-foundation.org> > Subject: mm/fremap.c: fix oops on error path > > If find_vma() fails, sys_remap_file_pages() will dereference `vma', which > contains NULL. Fix it by checking the pointer. > > (We could alternatively check for err==0, but this seems more direct) > > (The vm_flags change is to squish a bogus used-uninitialised warning > without adding extra code). > > Reported-by: Tommi Rantala <tt.rantala@gmail.com> > Cc: Michel Lespinasse <walken@google.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/fremap.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff -puN mm/fremap.c~mm-fremapc-fix-oops-on-error-path mm/fremap.c > --- a/mm/fremap.c~mm-fremapc-fix-oops-on-error-path > +++ a/mm/fremap.c > @@ -163,7 +163,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsign > * and that the remapped range is valid and fully within > * the single existing vma. > */ > - if (!vma || !(vma->vm_flags & VM_SHARED)) > + vm_flags = vma->vm_flags; > + if (!vma || !(vm_flags & VM_SHARED)) > goto out; Your commit message indicates the vm_flags load here doesn't generate any code, but this seems very brittle and compiler dependent. If the compiler was to generate an actual load here, the issue with vma == NULL would reappear. > if (!vma->vm_ops || !vma->vm_ops->remap_pages) > @@ -254,7 +255,8 @@ get_write_lock: > */ > > out: > - vm_flags = vma->vm_flags; > + if (vma) > + vm_flags = vma->vm_flags; > if (likely(!has_write_lock)) > up_read(&mm->mmap_sem); > else Would the following work ? I think it's simpler, and with the compiler I'm using here it doesn't emit warnings: diff --git a/mm/fremap.c b/mm/fremap.c index 0cd4c11488ed..329507e832fb 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -254,7 +254,8 @@ get_write_lock: */ out: - vm_flags = vma->vm_flags; + if (!err) + vm_flags = vma->vm_flags; if (likely(!has_write_lock)) up_read(&mm->mmap_sem); else -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2013-03-12 0:24 ` Michel Lespinasse @ 2013-03-12 4:23 ` Hillf Danton 2013-03-12 5:01 ` Michel Lespinasse 2013-03-12 20:47 ` Andrew Morton 1 sibling, 1 reply; 40+ messages in thread From: Hillf Danton @ 2013-03-12 4:23 UTC (permalink / raw) To: Michel Lespinasse Cc: Andrew Morton, Tommi Rantala, Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Linux-MM, LKML On Tue, Mar 12, 2013 at 8:24 AM, Michel Lespinasse <walken@google.com> wrote: > (Sorry for the late reply) > > On Mon, Mar 11, 2013 at 4:03 PM, Andrew Morton <akpm@linux-foundation.org> wrote: >> On Sun, 10 Mar 2013 20:55:21 +0200 Tommi Rantala <tt.rantala@gmail.com> wrote: >> >>> 2012/12/21 Michel Lespinasse <walken@google.com>: >>> > Signed-off-by: Michel Lespinasse <walken@google.com> >>> >>> Hello, this patch introduced the following bug, seen while fuzzing with trinity: >>> >>> [ 396.825414] BUG: unable to handle kernel NULL pointer dereference >>> at 0000000000000050 > > Good catch... > >> From: Andrew Morton <akpm@linux-foundation.org> >> Subject: mm/fremap.c: fix oops on error path >> >> If find_vma() fails, sys_remap_file_pages() will dereference `vma', which >> contains NULL. Fix it by checking the pointer. >> >> (We could alternatively check for err==0, but this seems more direct) >> >> (The vm_flags change is to squish a bogus used-uninitialised warning >> without adding extra code). >> >> Reported-by: Tommi Rantala <tt.rantala@gmail.com> >> Cc: Michel Lespinasse <walken@google.com> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >> --- >> >> mm/fremap.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff -puN mm/fremap.c~mm-fremapc-fix-oops-on-error-path mm/fremap.c >> --- a/mm/fremap.c~mm-fremapc-fix-oops-on-error-path >> +++ a/mm/fremap.c >> @@ -163,7 +163,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsign >> * and that the remapped range is valid and fully within >> * the single existing vma. >> */ >> - if (!vma || !(vma->vm_flags & VM_SHARED)) >> + vm_flags = vma->vm_flags; >> + if (!vma || !(vm_flags & VM_SHARED)) >> goto out; > > Your commit message indicates the vm_flags load here doesn't generate any code, but this seems very brittle and compiler dependent. If the compiler was to generate an actual load here, the issue with vma == NULL would reappear. > >> if (!vma->vm_ops || !vma->vm_ops->remap_pages) >> @@ -254,7 +255,8 @@ get_write_lock: >> */ >> >> out: >> - vm_flags = vma->vm_flags; >> + if (vma) >> + vm_flags = vma->vm_flags; >> if (likely(!has_write_lock)) >> up_read(&mm->mmap_sem); >> else > > > > Would the following work ? I think it's simpler, and with the compiler > I'm using here it doesn't emit warnings: > > diff --git a/mm/fremap.c b/mm/fremap.c > index 0cd4c11488ed..329507e832fb 100644 > --- a/mm/fremap.c > +++ b/mm/fremap.c > @@ -254,7 +254,8 @@ get_write_lock: > */ > > out: > - vm_flags = vma->vm_flags; > + if (!err) > + vm_flags = vma->vm_flags; > if (likely(!has_write_lock)) > up_read(&mm->mmap_sem); > else > Is it still necessary to populate mm if bail out due to a linear mapping encountered? Hillf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2013-03-12 4:23 ` Hillf Danton @ 2013-03-12 5:01 ` Michel Lespinasse 0 siblings, 0 replies; 40+ messages in thread From: Michel Lespinasse @ 2013-03-12 5:01 UTC (permalink / raw) To: Hillf Danton Cc: Andrew Morton, Tommi Rantala, Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Linux-MM, LKML On Mon, Mar 11, 2013 at 9:23 PM, Hillf Danton <dhillf@gmail.com> wrote: > Is it still necessary to populate mm if bail out due > to a linear mapping encountered? Yes. mmap_region() doesn't do it on its own, and we want the emulated linear case to behave similarly to the true nonlinear case. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() 2013-03-12 0:24 ` Michel Lespinasse 2013-03-12 4:23 ` Hillf Danton @ 2013-03-12 20:47 ` Andrew Morton 1 sibling, 0 replies; 40+ messages in thread From: Andrew Morton @ 2013-03-12 20:47 UTC (permalink / raw) To: Michel Lespinasse Cc: Tommi Rantala, Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, linux-mm@kvack.org, LKML, Dave Jones On Mon, 11 Mar 2013 17:24:29 -0700 Michel Lespinasse <walken@google.com> wrote: > > --- a/mm/fremap.c~mm-fremapc-fix-oops-on-error-path > > +++ a/mm/fremap.c > > @@ -163,7 +163,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsign > > * and that the remapped range is valid and fully within > > * the single existing vma. > > */ > > - if (!vma || !(vma->vm_flags & VM_SHARED)) > > + vm_flags = vma->vm_flags; > > + if (!vma || !(vm_flags & VM_SHARED)) > > goto out; > > Your commit message indicates the vm_flags load here doesn't generate any code, but this seems very brittle and compiler dependent. If the compiler was to generate an actual load here, the issue with vma == NULL would reappear. I didn't try very hard. I have a surprisingly strong dislike of adding "= 0" everywhere just to squish warnings. There are actually quite a lot of places where this function could use s/vma->vm_flags/vm_flags/ and might save a bit of code as a result. But the function's pretty straggly and I stopped doing it. > > > if (!vma->vm_ops || !vma->vm_ops->remap_pages) > > @@ -254,7 +255,8 @@ get_write_lock: > > */ > > > > out: > > - vm_flags = vma->vm_flags; > > + if (vma) > > + vm_flags = vma->vm_flags; > > if (likely(!has_write_lock)) > > up_read(&mm->mmap_sem); > > else > > > > Would the following work ? I think it's simpler, and with the compiler > I'm using here it doesn't emit warnings: > > diff --git a/mm/fremap.c b/mm/fremap.c > index 0cd4c11488ed..329507e832fb 100644 > --- a/mm/fremap.c > +++ b/mm/fremap.c > @@ -254,7 +254,8 @@ get_write_lock: > */ > > out: > - vm_flags = vma->vm_flags; > + if (!err) > + vm_flags = vma->vm_flags; > if (likely(!has_write_lock)) > up_read(&mm->mmap_sem); > else Yes, this will work. gcc-4.4.4 does generate the warning with this. Testing `err' was my v1, but it is not obvious that err==0 always correlates with vma!= NULL. This is true (I checked), and it had better be true in the future, but it just feels safer and simpler to test `vma' directly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 5/9] mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect. 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (3 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 2:56 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 6/9] mm: use mm_populate() for mremap() of VM_LOCKED vmas Michel Lespinasse ` (6 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/mmap.c | 18 ++++++++++++++---- 1 files changed, 14 insertions(+), 4 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index a16fc499dbd1..4c8d39e64e80 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -240,6 +240,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) unsigned long newbrk, oldbrk; struct mm_struct *mm = current->mm; unsigned long min_brk; + bool populate; down_write(&mm->mmap_sem); @@ -289,8 +290,15 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) /* Ok, looks good - let it rip. */ if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk) goto out; + set_brk: mm->brk = brk; + populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0; + up_write(&mm->mmap_sem); + if (populate) + mm_populate(oldbrk, newbrk - oldbrk); + return brk; + out: retval = mm->brk; up_write(&mm->mmap_sem); @@ -2274,10 +2282,8 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) out: perf_event_mmap(vma); mm->total_vm += len >> PAGE_SHIFT; - if (flags & VM_LOCKED) { - if (!mlock_vma_pages_range(vma, addr, addr + len)) - mm->locked_vm += (len >> PAGE_SHIFT); - } + if (flags & VM_LOCKED) + mm->locked_vm += (len >> PAGE_SHIFT); return addr; } @@ -2285,10 +2291,14 @@ unsigned long vm_brk(unsigned long addr, unsigned long len) { struct mm_struct *mm = current->mm; unsigned long ret; + bool populate; down_write(&mm->mmap_sem); ret = do_brk(addr, len); + populate = ((mm->def_flags & VM_LOCKED) != 0); up_write(&mm->mmap_sem); + if (populate) + mm_populate(addr, len); return ret; } EXPORT_SYMBOL(vm_brk); -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 5/9] mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect. 2012-12-21 0:49 ` [PATCH 5/9] mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect Michel Lespinasse @ 2013-01-03 2:56 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 2:56 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 6/9] mm: use mm_populate() for mremap() of VM_LOCKED vmas 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (4 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 5/9] mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 5:47 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 7/9] mm: remove flags argument to mmap_region Michel Lespinasse ` (5 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/mremap.c | 25 +++++++++++++------------ 1 files changed, 13 insertions(+), 12 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 1b61c2d3307a..c5a8bf344b1f 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, static unsigned long move_vma(struct vm_area_struct *vma, unsigned long old_addr, unsigned long old_len, - unsigned long new_len, unsigned long new_addr) + unsigned long new_len, unsigned long new_addr, bool *locked) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *new_vma; @@ -299,9 +299,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (vm_flags & VM_LOCKED) { mm->locked_vm += new_len >> PAGE_SHIFT; - if (new_len > old_len) - mlock_vma_pages_range(new_vma, new_addr + old_len, - new_addr + new_len); + *locked = true; } return new_addr; @@ -366,9 +364,8 @@ Eagain: return ERR_PTR(-EAGAIN); } -static unsigned long mremap_to(unsigned long addr, - unsigned long old_len, unsigned long new_addr, - unsigned long new_len) +static unsigned long mremap_to(unsigned long addr, unsigned long old_len, + unsigned long new_addr, unsigned long new_len, bool *locked) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; @@ -418,7 +415,7 @@ static unsigned long mremap_to(unsigned long addr, if (ret & ~PAGE_MASK) goto out1; - ret = move_vma(vma, addr, old_len, new_len, new_addr); + ret = move_vma(vma, addr, old_len, new_len, new_addr, locked); if (!(ret & ~PAGE_MASK)) goto out; out1: @@ -456,6 +453,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, struct vm_area_struct *vma; unsigned long ret = -EINVAL; unsigned long charged = 0; + bool locked = false; down_write(¤t->mm->mmap_sem); @@ -478,7 +476,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, if (flags & MREMAP_FIXED) { if (flags & MREMAP_MAYMOVE) - ret = mremap_to(addr, old_len, new_addr, new_len); + ret = mremap_to(addr, old_len, new_addr, new_len, + &locked); goto out; } @@ -520,8 +519,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages); if (vma->vm_flags & VM_LOCKED) { mm->locked_vm += pages; - mlock_vma_pages_range(vma, addr + old_len, - addr + new_len); + locked = true; + new_addr = addr; } ret = addr; goto out; @@ -547,11 +546,13 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; } - ret = move_vma(vma, addr, old_len, new_len, new_addr); + ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked); } out: if (ret & ~PAGE_MASK) vm_unacct_memory(charged); up_write(¤t->mm->mmap_sem); + if (locked && new_len > old_len) + mm_populate(new_addr + old_len, new_len - old_len); return ret; } -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 6/9] mm: use mm_populate() for mremap() of VM_LOCKED vmas 2012-12-21 0:49 ` [PATCH 6/9] mm: use mm_populate() for mremap() of VM_LOCKED vmas Michel Lespinasse @ 2013-01-03 5:47 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 5:47 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > Signed-off-by: Michel Lespinasse <walken@google.com> Changelog? Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 7/9] mm: remove flags argument to mmap_region 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (5 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 6/9] mm: use mm_populate() for mremap() of VM_LOCKED vmas Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 5:49 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 8/9] mm: directly use __mlock_vma_pages_range() in find_extend_vma() Michel Lespinasse ` (4 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel After the MAP_POPULATE handling has been moved to mmap_region() call sites, the only remaining use of the flags argument is to pass the MAP_NORESERVE flag. This can be just as easily handled by do_mmap_pgoff(), so do that and remove the mmap_region() flags parameter. Signed-off-by: Michel Lespinasse <walken@google.com> --- arch/tile/mm/elf.c | 1 - include/linux/mm.h | 3 +-- mm/fremap.c | 3 +-- mm/mmap.c | 33 ++++++++++++++++----------------- 4 files changed, 18 insertions(+), 22 deletions(-) diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 3cfa98bf9125..743c951c61b0 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -130,7 +130,6 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, if (!retval) { unsigned long addr = MEM_USER_INTRPT; addr = mmap_region(NULL, addr, INTRPT_SIZE, - MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0); if (addr > (unsigned long) -PAGE_SIZE) diff --git a/include/linux/mm.h b/include/linux/mm.h index fea461cd9027..3b2912f6e91a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1442,8 +1442,7 @@ extern int install_special_mapping(struct mm_struct *mm, extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long mmap_region(struct file *file, unsigned long addr, - unsigned long len, unsigned long flags, - vm_flags_t vm_flags, unsigned long pgoff); + unsigned long len, vm_flags_t vm_flags, unsigned long pgoff); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, bool *populate); diff --git a/mm/fremap.c b/mm/fremap.c index b42e32171530..503a72387087 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -204,9 +204,8 @@ get_write_lock: unsigned long addr; struct file *file = get_file(vma->vm_file); - flags = (flags & MAP_NONBLOCK) | MAP_POPULATE; addr = mmap_region(file, start, size, - flags, vma->vm_flags, pgoff); + vma->vm_flags, pgoff); fput(file); if (IS_ERR_VALUE(addr)) { err = addr; diff --git a/mm/mmap.c b/mm/mmap.c index 4c8d39e64e80..b0a341e5685f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1138,7 +1138,21 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, } } - addr = mmap_region(file, addr, len, flags, vm_flags, pgoff); + /* + * Set 'VM_NORESERVE' if we should not account for the + * memory use of this mapping. + */ + if ((flags & MAP_NORESERVE)) { + /* We honor MAP_NORESERVE if allowed to overcommit */ + if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) + vm_flags |= VM_NORESERVE; + + /* hugetlb applies strict overcommit unless MAP_NORESERVE */ + if (file && is_file_hugepages(file)) + vm_flags |= VM_NORESERVE; + } + + addr = mmap_region(file, addr, len, vm_flags, pgoff); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -1257,8 +1271,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) } unsigned long mmap_region(struct file *file, unsigned long addr, - unsigned long len, unsigned long flags, - vm_flags_t vm_flags, unsigned long pgoff) + unsigned long len, vm_flags_t vm_flags, unsigned long pgoff) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1282,20 +1295,6 @@ munmap_back: return -ENOMEM; /* - * Set 'VM_NORESERVE' if we should not account for the - * memory use of this mapping. - */ - if ((flags & MAP_NORESERVE)) { - /* We honor MAP_NORESERVE if allowed to overcommit */ - if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) - vm_flags |= VM_NORESERVE; - - /* hugetlb applies strict overcommit unless MAP_NORESERVE */ - if (file && is_file_hugepages(file)) - vm_flags |= VM_NORESERVE; - } - - /* * Private writable mapping: check memory availability */ if (accountable_mapping(file, vm_flags)) { -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 7/9] mm: remove flags argument to mmap_region 2012-12-21 0:49 ` [PATCH 7/9] mm: remove flags argument to mmap_region Michel Lespinasse @ 2013-01-03 5:49 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 5:49 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > After the MAP_POPULATE handling has been moved to mmap_region() call sites, > the only remaining use of the flags argument is to pass the MAP_NORESERVE > flag. This can be just as easily handled by do_mmap_pgoff(), so do that > and remove the mmap_region() flags parameter. > > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 8/9] mm: directly use __mlock_vma_pages_range() in find_extend_vma() 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (6 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 7/9] mm: remove flags argument to mmap_region Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 5:50 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 9/9] mm: introduce VM_POPULATE flag to better deal with racy userspace programs Michel Lespinasse ` (3 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel In find_extend_vma(), we don't need mlock_vma_pages_range() to verify the vma type - we know we're working with a stack. So, we can call directly into __mlock_vma_pages_range(), and remove the last make_pages_present() call site. Note that we don't use mm_populate() here, so we can't release the mmap_sem while allocating new stack pages. This is deemed acceptable, because the stack vmas grow by a bounded number of pages at a time, and these are anon pages so we don't have to read from disk to populate them. Signed-off-by: Michel Lespinasse <walken@google.com> --- include/linux/mm.h | 1 - mm/internal.h | 4 +- mm/memory.c | 24 --------------------- mm/mlock.c | 57 ++------------------------------------------------- mm/mmap.c | 10 +++----- 5 files changed, 9 insertions(+), 87 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3b2912f6e91a..d32ace5fba93 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1007,7 +1007,6 @@ static inline int fixup_user_fault(struct task_struct *tsk, } #endif -extern int make_pages_present(unsigned long addr, unsigned long end); extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write); extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, int write); diff --git a/mm/internal.h b/mm/internal.h index a4fa284f6bc2..e646c46c0d63 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -158,8 +158,8 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node *rb_parent); #ifdef CONFIG_MMU -extern long mlock_vma_pages_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end); +extern long __mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, int *nonblocking); extern void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); static inline void munlock_vma_pages_all(struct vm_area_struct *vma) diff --git a/mm/memory.c b/mm/memory.c index 221fc9ffcab1..e4ab66b94bb8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3629,30 +3629,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -int make_pages_present(unsigned long addr, unsigned long end) -{ - int ret, len, write; - struct vm_area_struct * vma; - - vma = find_vma(current->mm, addr); - if (!vma) - return -ENOMEM; - /* - * We want to touch writable mappings with a write fault in order - * to break COW, except for shared mappings because these don't COW - * and we would not want to dirty them for nothing. - */ - write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE; - BUG_ON(addr >= end); - BUG_ON(end > vma->vm_end); - len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; - ret = get_user_pages(current, current->mm, addr, - len, write, 0, NULL, NULL); - if (ret < 0) - return ret; - return ret == len ? 0 : -EFAULT; -} - #if !defined(__HAVE_ARCH_GATE_AREA) #if defined(AT_SYSINFO_EHDR) diff --git a/mm/mlock.c b/mm/mlock.c index 7f94bc3b46ef..ab0cfe21f538 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -155,9 +155,8 @@ void munlock_vma_page(struct page *page) * * vma->vm_mm->mmap_sem must be held for at least read. */ -static long __mlock_vma_pages_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - int *nonblocking) +long __mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, int *nonblocking) { struct mm_struct *mm = vma->vm_mm; unsigned long addr = start; @@ -202,56 +201,6 @@ static int __mlock_posix_error_return(long retval) return retval; } -/** - * mlock_vma_pages_range() - mlock pages in specified vma range. - * @vma - the vma containing the specfied address range - * @start - starting address in @vma to mlock - * @end - end address [+1] in @vma to mlock - * - * For mmap()/mremap()/expansion of mlocked vma. - * - * return 0 on success for "normal" vmas. - * - * return number of pages [> 0] to be removed from locked_vm on success - * of "special" vmas. - */ -long mlock_vma_pages_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end) -{ - int nr_pages = (end - start) / PAGE_SIZE; - BUG_ON(!(vma->vm_flags & VM_LOCKED)); - - /* - * filter unlockable vmas - */ - if (vma->vm_flags & (VM_IO | VM_PFNMAP)) - goto no_mlock; - - if (!((vma->vm_flags & VM_DONTEXPAND) || - is_vm_hugetlb_page(vma) || - vma == get_gate_vma(current->mm))) { - - __mlock_vma_pages_range(vma, start, end, NULL); - - /* Hide errors from mmap() and other callers */ - return 0; - } - - /* - * User mapped kernel pages or huge pages: - * make these pages present to populate the ptes, but - * fall thru' to reset VM_LOCKED--no need to unlock, and - * return nr_pages so these don't get counted against task's - * locked limit. huge pages are already counted against - * locked vm limit. - */ - make_pages_present(start, end); - -no_mlock: - vma->vm_flags &= ~VM_LOCKED; /* and don't come back! */ - return nr_pages; /* error or pages NOT mlocked */ -} - /* * munlock_vma_pages_range() - munlock all pages in the vma range.' * @vma - vma containing range to be munlock()ed. @@ -303,7 +252,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, * * Filters out "special" vmas -- VM_LOCKED never gets set for these, and * munlock is a no-op. However, for some special vmas, we go ahead and - * populate the ptes via make_pages_present(). + * populate the ptes. * * For vmas that pass the filters, merge/split as appropriate. */ diff --git a/mm/mmap.c b/mm/mmap.c index b0a341e5685f..290d023632e6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1881,9 +1881,8 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr) return vma; if (!prev || expand_stack(prev, addr)) return NULL; - if (prev->vm_flags & VM_LOCKED) { - mlock_vma_pages_range(prev, addr, prev->vm_end); - } + if (prev->vm_flags & VM_LOCKED) + __mlock_vma_pages_range(prev, addr, prev->vm_end, NULL); return prev; } #else @@ -1909,9 +1908,8 @@ find_extend_vma(struct mm_struct * mm, unsigned long addr) start = vma->vm_start; if (expand_stack(vma, addr)) return NULL; - if (vma->vm_flags & VM_LOCKED) { - mlock_vma_pages_range(vma, addr, start); - } + if (vma->vm_flags & VM_LOCKED) + __mlock_vma_pages_range(vma, addr, start, NULL); return vma; } #endif -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 8/9] mm: directly use __mlock_vma_pages_range() in find_extend_vma() 2012-12-21 0:49 ` [PATCH 8/9] mm: directly use __mlock_vma_pages_range() in find_extend_vma() Michel Lespinasse @ 2013-01-03 5:50 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 5:50 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > In find_extend_vma(), we don't need mlock_vma_pages_range() to verify the > vma type - we know we're working with a stack. So, we can call directly > into __mlock_vma_pages_range(), and remove the last make_pages_present() > call site. > > Note that we don't use mm_populate() here, so we can't release the mmap_sem > while allocating new stack pages. This is deemed acceptable, because the > stack vmas grow by a bounded number of pages at a time, and these are > anon pages so we don't have to read from disk to populate them. > > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 9/9] mm: introduce VM_POPULATE flag to better deal with racy userspace programs 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (7 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 8/9] mm: directly use __mlock_vma_pages_range() in find_extend_vma() Michel Lespinasse @ 2012-12-21 0:49 ` Michel Lespinasse 2013-01-03 6:20 ` Rik van Riel 2012-12-21 10:46 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (2 subsequent siblings) 11 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 0:49 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel Cc: Andrew Morton, linux-mm, linux-kernel The vm_populate() code populates user mappings without constantly holding the mmap_sem. This makes it susceptible to racy userspace programs: the user mappings may change while vm_populate() is running, and in this case vm_populate() may end up populating the new mapping instead of the old one. In order to reduce the possibility of userspace getting surprised by this behavior, this change introduces the VM_POPULATE vma flag which gets set on vmas we want vm_populate() to work on. This way vm_populate() may still end up populating the new mapping after such a race, but only if the new mapping is also one that the user has requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated. Signed-off-by: Michel Lespinasse <walken@google.com> --- include/linux/mm.h | 1 + include/linux/mman.h | 4 +++- mm/fremap.c | 12 ++++++++++-- mm/mlock.c | 19 ++++++++++--------- mm/mmap.c | 4 +--- 5 files changed, 25 insertions(+), 15 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index d32ace5fba93..77311274f0b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ +#define VM_POPULATE 0x00001000 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ diff --git a/include/linux/mman.h b/include/linux/mman.h index d09dde1e57fb..e4ad758962e5 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -77,6 +77,8 @@ calc_vm_flag_bits(unsigned long flags) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); + ((flags & MAP_LOCKED) ? (VM_LOCKED | VM_POPULATE) : 0) | + (((flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE) ? + VM_POPULATE : 0); } #endif /* _LINUX_MMAN_H */ diff --git a/mm/fremap.c b/mm/fremap.c index 503a72387087..0cd4c11488ed 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -204,8 +204,10 @@ get_write_lock: unsigned long addr; struct file *file = get_file(vma->vm_file); - addr = mmap_region(file, start, size, - vma->vm_flags, pgoff); + vm_flags = vma->vm_flags; + if (!(flags & MAP_NONBLOCK)) + vm_flags |= VM_POPULATE; + addr = mmap_region(file, start, size, vm_flags, pgoff); fput(file); if (IS_ERR_VALUE(addr)) { err = addr; @@ -224,6 +226,12 @@ get_write_lock: mutex_unlock(&mapping->i_mmap_mutex); } + if (!(flags & MAP_NONBLOCK) && !(vma->vm_flags & VM_POPULATE)) { + if (!has_write_lock) + goto get_write_lock; + vma->vm_flags |= VM_POPULATE; + } + if (vma->vm_flags & VM_LOCKED) { /* * drop PG_Mlocked flag for over-mapped range diff --git a/mm/mlock.c b/mm/mlock.c index ab0cfe21f538..b1647fbd6bce 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -340,9 +340,9 @@ static int do_mlock(unsigned long start, size_t len, int on) /* Here we know that vma->vm_start <= nstart < vma->vm_end. */ - newflags = vma->vm_flags | VM_LOCKED; - if (!on) - newflags &= ~VM_LOCKED; + newflags = vma->vm_flags & ~VM_LOCKED; + if (on) + newflags |= VM_LOCKED | VM_POPULATE; tmp = vma->vm_end; if (tmp > end) @@ -402,7 +402,8 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) * range with the first VMA. Also, skip undesirable VMA types. */ nend = min(end, vma->vm_end); - if (vma->vm_flags & (VM_IO | VM_PFNMAP)) + if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_POPULATE)) != + VM_POPULATE) continue; if (nstart < vma->vm_start) nstart = vma->vm_start; @@ -475,9 +476,9 @@ static int do_mlockall(int flags) struct vm_area_struct * vma, * prev = NULL; unsigned int def_flags; - def_flags = current->mm->def_flags & ~VM_LOCKED; + def_flags = current->mm->def_flags & ~(VM_LOCKED | VM_POPULATE); if (flags & MCL_FUTURE) - def_flags |= VM_LOCKED; + def_flags |= (VM_LOCKED | VM_POPULATE); current->mm->def_flags = def_flags; if (flags == MCL_FUTURE) goto out; @@ -485,9 +486,9 @@ static int do_mlockall(int flags) for (vma = current->mm->mmap; vma ; vma = prev->vm_next) { vm_flags_t newflags; - newflags = vma->vm_flags | VM_LOCKED; - if (!(flags & MCL_CURRENT)) - newflags &= ~VM_LOCKED; + newflags = vma->vm_flags & ~VM_LOCKED; + if (flags & MCL_CURRENT) + newflags |= VM_LOCKED | VM_POPULATE; /* Ignore errors */ mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags); diff --git a/mm/mmap.c b/mm/mmap.c index 290d023632e6..27f98850fa8c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1153,9 +1153,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, } addr = mmap_region(file, addr, len, vm_flags, pgoff); - if (!IS_ERR_VALUE(addr) && - ((vm_flags & VM_LOCKED) || - (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) + if (!IS_ERR_VALUE(addr) && (vm_flags & VM_POPULATE)) *populate = true; return addr; } -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 9/9] mm: introduce VM_POPULATE flag to better deal with racy userspace programs 2012-12-21 0:49 ` [PATCH 9/9] mm: introduce VM_POPULATE flag to better deal with racy userspace programs Michel Lespinasse @ 2013-01-03 6:20 ` Rik van Riel 0 siblings, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 6:20 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/20/2012 07:49 PM, Michel Lespinasse wrote: > The vm_populate() code populates user mappings without constantly > holding the mmap_sem. This makes it susceptible to racy userspace > programs: the user mappings may change while vm_populate() is running, > and in this case vm_populate() may end up populating the new mapping > instead of the old one. > > In order to reduce the possibility of userspace getting surprised by > this behavior, this change introduces the VM_POPULATE vma flag which > gets set on vmas we want vm_populate() to work on. This way > vm_populate() may still end up populating the new mapping after such a > race, but only if the new mapping is also one that the user has > requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated. > > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <ri -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (8 preceding siblings ...) 2012-12-21 0:49 ` [PATCH 9/9] mm: introduce VM_POPULATE flag to better deal with racy userspace programs Michel Lespinasse @ 2012-12-21 10:46 ` Michel Lespinasse 2012-12-22 15:02 ` Greg Ungerer 2013-01-23 13:37 ` Greg Ungerer 2012-12-22 0:36 ` Andy Lutomirski 2013-01-04 18:16 ` Andy Lutomirski 11 siblings, 2 replies; 40+ messages in thread From: Michel Lespinasse @ 2012-12-21 10:46 UTC (permalink / raw) To: Greg Ungerer Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel Hi Greg, I wanted to ask if you could check the sanity of the following patches in nommu configurations. My understanding is that these always populate mappings when they are created, so that MAP_POPULATE and MAP_LOCKED are actually no-ops. Is this an accurate description ? Thanks, On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: > We have many vma manipulation functions that are fast in the typical case, > but can optionally be instructed to populate an unbounded number of ptes > within the region they work on: > - mmap with MAP_POPULATE or MAP_LOCKED flags; > - remap_file_pages() with MAP_NONBLOCK not set or when working on a > VM_LOCKED vma; > - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; > - brk() when mlock(MCL_FUTURE) is in effect. > > Current code handles these pte operations locally, while the sourrounding > code has to hold the mmap_sem write side since it's manipulating vmas. > This means we're doing an unbounded amount of pte population work with > mmap_sem held, and this causes problems as Andy Lutomirski reported > (we've hit this at Google as well, though it's not entirely clear why > people keep trying to use mlock(MCL_FUTURE) in the first place). > > I propose introducing a new mm_populate() function to do this pte > population work after the mmap_sem has been released. mm_populate() > does need to acquire the mmap_sem read side, but critically, it > doesn't need to hold continuously for the entire duration of the > operation - it can drop it whenever things take too long (such as when > hitting disk for a file read) and re-acquire it later on. > > The following patches are against v3.7: > > - Patches 1-2 fix some issues I noticed while working on the existing code. > If needed, they could potentially go in before the rest of the patches. > > - Patch 3 introduces the new mm_populate() function and changes > mmap_region() call sites to use it after they drop mmap_sem. This is > inspired from Andy Lutomirski's proposal and is built as an extension > of the work I had previously done for mlock() and mlockall() around > v2.6.38-rc1. I had tried doing something similar at the time but had > given up as there were so many do_mmap() call sites; the recent cleanups > by Linus and Viro are a tremendous help here. > > - Patches 4-6 convert some of the less-obvious places doing unbounded > pte populates to the new mm_populate() mechanism. > > - Patches 7-8 are code cleanups that are made possible by the > mm_populate() work. In particular, they remove more code than the > entire patch series added, which should be a good thing :) > > - Patch 9 is optional to this entire series. It only helps to deal more > nicely with racy userspace programs that might modify their mappings > while we're trying to populate them. It adds a new VM_POPULATE flag > on the mappings we do want to populate, so that if userspace replaces > them with mappings it doesn't want populated, mm_populate() won't > populate those replacement mappings. > > Michel Lespinasse (9): > mm: make mlockall preserve flags other than VM_LOCKED in def_flags > mm: remap_file_pages() fixes > mm: introduce mm_populate() for populating new vmas > mm: use mm_populate() for blocking remap_file_pages() > mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect. > mm: use mm_populate() for mremap() of VM_LOCKED vmas > mm: remove flags argument to mmap_region > mm: directly use __mlock_vma_pages_range() in find_extend_vma() > mm: introduce VM_POPULATE flag to better deal with racy userspace programs > > arch/tile/mm/elf.c | 1 - > fs/aio.c | 6 +++- > include/linux/mm.h | 23 +++++++++--- > include/linux/mman.h | 4 ++- > ipc/shm.c | 12 ++++--- > mm/fremap.c | 51 ++++++++++++++------------- > mm/internal.h | 4 +- > mm/memory.c | 24 ------------- > mm/mlock.c | 94 +++++++++++++------------------------------------ > mm/mmap.c | 77 ++++++++++++++++++++++++---------------- > mm/mremap.c | 25 +++++++------ > mm/nommu.c | 5 ++- > mm/util.c | 6 +++- > 13 files changed, 154 insertions(+), 178 deletions(-) > > -- > 1.7.7.3 -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-21 10:46 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse @ 2012-12-22 15:02 ` Greg Ungerer 2013-01-23 13:37 ` Greg Ungerer 1 sibling, 0 replies; 40+ messages in thread From: Greg Ungerer @ 2012-12-22 15:02 UTC (permalink / raw) To: Michel Lespinasse Cc: Greg Ungerer, Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel, David Howells Hi Michel, On 12/21/2012 08:46 PM, Michel Lespinasse wrote: > I wanted to ask if you could check the sanity of the following patches > in nommu configurations. My understanding is that these always Sure. I think it is worth CC'ing David Howells on these as well, he has spent a fair bit of time in the mmap code for nommu. Regards Greg > populate mappings when they are created, so that MAP_POPULATE and > MAP_LOCKED are actually no-ops. Is this an accurate description ? > > Thanks, > > On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: >> We have many vma manipulation functions that are fast in the typical case, >> but can optionally be instructed to populate an unbounded number of ptes >> within the region they work on: >> - mmap with MAP_POPULATE or MAP_LOCKED flags; >> - remap_file_pages() with MAP_NONBLOCK not set or when working on a >> VM_LOCKED vma; >> - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; >> - brk() when mlock(MCL_FUTURE) is in effect. >> >> Current code handles these pte operations locally, while the sourrounding >> code has to hold the mmap_sem write side since it's manipulating vmas. >> This means we're doing an unbounded amount of pte population work with >> mmap_sem held, and this causes problems as Andy Lutomirski reported >> (we've hit this at Google as well, though it's not entirely clear why >> people keep trying to use mlock(MCL_FUTURE) in the first place). >> >> I propose introducing a new mm_populate() function to do this pte >> population work after the mmap_sem has been released. mm_populate() >> does need to acquire the mmap_sem read side, but critically, it >> doesn't need to hold continuously for the entire duration of the >> operation - it can drop it whenever things take too long (such as when >> hitting disk for a file read) and re-acquire it later on. >> >> The following patches are against v3.7: >> >> - Patches 1-2 fix some issues I noticed while working on the existing code. >> If needed, they could potentially go in before the rest of the patches. >> >> - Patch 3 introduces the new mm_populate() function and changes >> mmap_region() call sites to use it after they drop mmap_sem. This is >> inspired from Andy Lutomirski's proposal and is built as an extension >> of the work I had previously done for mlock() and mlockall() around >> v2.6.38-rc1. I had tried doing something similar at the time but had >> given up as there were so many do_mmap() call sites; the recent cleanups >> by Linus and Viro are a tremendous help here. >> >> - Patches 4-6 convert some of the less-obvious places doing unbounded >> pte populates to the new mm_populate() mechanism. >> >> - Patches 7-8 are code cleanups that are made possible by the >> mm_populate() work. In particular, they remove more code than the >> entire patch series added, which should be a good thing :) >> >> - Patch 9 is optional to this entire series. It only helps to deal more >> nicely with racy userspace programs that might modify their mappings >> while we're trying to populate them. It adds a new VM_POPULATE flag >> on the mappings we do want to populate, so that if userspace replaces >> them with mappings it doesn't want populated, mm_populate() won't >> populate those replacement mappings. >> >> Michel Lespinasse (9): >> mm: make mlockall preserve flags other than VM_LOCKED in def_flags >> mm: remap_file_pages() fixes >> mm: introduce mm_populate() for populating new vmas >> mm: use mm_populate() for blocking remap_file_pages() >> mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect. >> mm: use mm_populate() for mremap() of VM_LOCKED vmas >> mm: remove flags argument to mmap_region >> mm: directly use __mlock_vma_pages_range() in find_extend_vma() >> mm: introduce VM_POPULATE flag to better deal with racy userspace programs >> >> arch/tile/mm/elf.c | 1 - >> fs/aio.c | 6 +++- >> include/linux/mm.h | 23 +++++++++--- >> include/linux/mman.h | 4 ++- >> ipc/shm.c | 12 ++++--- >> mm/fremap.c | 51 ++++++++++++++------------- >> mm/internal.h | 4 +- >> mm/memory.c | 24 ------------- >> mm/mlock.c | 94 +++++++++++++------------------------------------ >> mm/mmap.c | 77 ++++++++++++++++++++++++---------------- >> mm/mremap.c | 25 +++++++------ >> mm/nommu.c | 5 ++- >> mm/util.c | 6 +++- >> 13 files changed, 154 insertions(+), 178 deletions(-) >> >> -- >> 1.7.7.3 > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-21 10:46 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse 2012-12-22 15:02 ` Greg Ungerer @ 2013-01-23 13:37 ` Greg Ungerer 1 sibling, 0 replies; 40+ messages in thread From: Greg Ungerer @ 2013-01-23 13:37 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel Hi Michael, On 12/21/2012 08:46 PM, Michel Lespinasse wrote: > I wanted to ask if you could check the sanity of the following patches > in nommu configurations. My understanding is that these always > populate mappings when they are created, so that MAP_POPULATE and > MAP_LOCKED are actually no-ops. Is this an accurate description ? Sorry, I have been very slow at getting around to this. MAP_LOCKED and MAP_POPULATE are effectively no-ops, we don't deal with them on the no-mmu mmap path. I had a look over the patches and they look ok to me with regard to no-mmu. So if you still want it: Acked-by: Greg Ungerer <gerg@uclinux.org> Regards Greg > On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: >> We have many vma manipulation functions that are fast in the typical case, >> but can optionally be instructed to populate an unbounded number of ptes >> within the region they work on: >> - mmap with MAP_POPULATE or MAP_LOCKED flags; >> - remap_file_pages() with MAP_NONBLOCK not set or when working on a >> VM_LOCKED vma; >> - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; >> - brk() when mlock(MCL_FUTURE) is in effect. >> >> Current code handles these pte operations locally, while the sourrounding >> code has to hold the mmap_sem write side since it's manipulating vmas. >> This means we're doing an unbounded amount of pte population work with >> mmap_sem held, and this causes problems as Andy Lutomirski reported >> (we've hit this at Google as well, though it's not entirely clear why >> people keep trying to use mlock(MCL_FUTURE) in the first place). >> >> I propose introducing a new mm_populate() function to do this pte >> population work after the mmap_sem has been released. mm_populate() >> does need to acquire the mmap_sem read side, but critically, it >> doesn't need to hold continuously for the entire duration of the >> operation - it can drop it whenever things take too long (such as when >> hitting disk for a file read) and re-acquire it later on. >> >> The following patches are against v3.7: >> >> - Patches 1-2 fix some issues I noticed while working on the existing code. >> If needed, they could potentially go in before the rest of the patches. >> >> - Patch 3 introduces the new mm_populate() function and changes >> mmap_region() call sites to use it after they drop mmap_sem. This is >> inspired from Andy Lutomirski's proposal and is built as an extension >> of the work I had previously done for mlock() and mlockall() around >> v2.6.38-rc1. I had tried doing something similar at the time but had >> given up as there were so many do_mmap() call sites; the recent cleanups >> by Linus and Viro are a tremendous help here. >> >> - Patches 4-6 convert some of the less-obvious places doing unbounded >> pte populates to the new mm_populate() mechanism. >> >> - Patches 7-8 are code cleanups that are made possible by the >> mm_populate() work. In particular, they remove more code than the >> entire patch series added, which should be a good thing :) >> >> - Patch 9 is optional to this entire series. It only helps to deal more >> nicely with racy userspace programs that might modify their mappings >> while we're trying to populate them. It adds a new VM_POPULATE flag >> on the mappings we do want to populate, so that if userspace replaces >> them with mappings it doesn't want populated, mm_populate() won't >> populate those replacement mappings. >> >> Michel Lespinasse (9): >> mm: make mlockall preserve flags other than VM_LOCKED in def_flags >> mm: remap_file_pages() fixes >> mm: introduce mm_populate() for populating new vmas >> mm: use mm_populate() for blocking remap_file_pages() >> mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect. >> mm: use mm_populate() for mremap() of VM_LOCKED vmas >> mm: remove flags argument to mmap_region >> mm: directly use __mlock_vma_pages_range() in find_extend_vma() >> mm: introduce VM_POPULATE flag to better deal with racy userspace programs >> >> arch/tile/mm/elf.c | 1 - >> fs/aio.c | 6 +++- >> include/linux/mm.h | 23 +++++++++--- >> include/linux/mman.h | 4 ++- >> ipc/shm.c | 12 ++++--- >> mm/fremap.c | 51 ++++++++++++++------------- >> mm/internal.h | 4 +- >> mm/memory.c | 24 ------------- >> mm/mlock.c | 94 +++++++++++++------------------------------------ >> mm/mmap.c | 77 ++++++++++++++++++++++++---------------- >> mm/mremap.c | 25 +++++++------ >> mm/nommu.c | 5 ++- >> mm/util.c | 6 +++- >> 13 files changed, 154 insertions(+), 178 deletions(-) >> >> -- >> 1.7.7.3 > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (9 preceding siblings ...) 2012-12-21 10:46 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse @ 2012-12-22 0:36 ` Andy Lutomirski 2012-12-22 0:59 ` Michel Lespinasse 2013-01-04 18:16 ` Andy Lutomirski 11 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2012-12-22 0:36 UTC (permalink / raw) To: Michel Lespinasse Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: > We have many vma manipulation functions that are fast in the typical case, > but can optionally be instructed to populate an unbounded number of ptes > within the region they work on: > - mmap with MAP_POPULATE or MAP_LOCKED flags; > - remap_file_pages() with MAP_NONBLOCK not set or when working on a > VM_LOCKED vma; > - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; > - brk() when mlock(MCL_FUTURE) is in effect. > Something's buggy here. My evil test case is stuck with lots of threads spinning at 100% system time. Stack traces look like: [<0000000000000000>] __mlock_vma_pages_range+0x66/0x70 [<0000000000000000>] __mm_populate+0xf9/0x150 [<0000000000000000>] vm_mmap_pgoff+0x9f/0xc0 [<0000000000000000>] sys_mmap_pgoff+0x7e/0x150 [<0000000000000000>] sys_mmap+0x22/0x30 [<0000000000000000>] system_call_fastpath+0x16/0x1b [<0000000000000000>] 0xffffffffffffffff perf top says: 38.45% [kernel] [k] __mlock_vma_pages_range 33.04% [kernel] [k] __get_user_pages 28.18% [kernel] [k] __mm_populate The tasks in question use MCL_FUTURE but not MAP_POPULATE. These tasks are immune to SIGKILL. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-22 0:36 ` Andy Lutomirski @ 2012-12-22 0:59 ` Michel Lespinasse 2012-12-22 1:09 ` Andy Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-22 0:59 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Fri, Dec 21, 2012 at 4:36 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: >> We have many vma manipulation functions that are fast in the typical case, >> but can optionally be instructed to populate an unbounded number of ptes >> within the region they work on: >> - mmap with MAP_POPULATE or MAP_LOCKED flags; >> - remap_file_pages() with MAP_NONBLOCK not set or when working on a >> VM_LOCKED vma; >> - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; >> - brk() when mlock(MCL_FUTURE) is in effect. >> > > Something's buggy here. My evil test case is stuck with lots of > threads spinning at 100% system time. Stack traces look like: > > [<0000000000000000>] __mlock_vma_pages_range+0x66/0x70 > [<0000000000000000>] __mm_populate+0xf9/0x150 > [<0000000000000000>] vm_mmap_pgoff+0x9f/0xc0 > [<0000000000000000>] sys_mmap_pgoff+0x7e/0x150 > [<0000000000000000>] sys_mmap+0x22/0x30 > [<0000000000000000>] system_call_fastpath+0x16/0x1b > [<0000000000000000>] 0xffffffffffffffff > > perf top says: > > 38.45% [kernel] [k] __mlock_vma_pages_range > 33.04% [kernel] [k] __get_user_pages > 28.18% [kernel] [k] __mm_populate > > The tasks in question use MCL_FUTURE but not MAP_POPULATE. These > tasks are immune to SIGKILL. Looking into it. There seems to be a problem with mlockall - the following program fails in an unkillable way even before my changes: #include <sys/mman.h> #include <stdio.h> #include <stdint.h> int main(void) { void *p = mmap(NULL, 0x100000000000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_NORESERVE, -1, 0); printf("p: %p\n", p); mlockall(MCL_CURRENT); return 0; } I think my changes propagate this existing problem so it now shows up in more places :/ -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-22 0:59 ` Michel Lespinasse @ 2012-12-22 1:09 ` Andy Lutomirski 2012-12-22 1:59 ` Michel Lespinasse 0 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2012-12-22 1:09 UTC (permalink / raw) To: Michel Lespinasse Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Fri, Dec 21, 2012 at 4:59 PM, Michel Lespinasse <walken@google.com> wrote: > On Fri, Dec 21, 2012 at 4:36 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: >>> We have many vma manipulation functions that are fast in the typical case, >>> but can optionally be instructed to populate an unbounded number of ptes >>> within the region they work on: >>> - mmap with MAP_POPULATE or MAP_LOCKED flags; >>> - remap_file_pages() with MAP_NONBLOCK not set or when working on a >>> VM_LOCKED vma; >>> - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; >>> - brk() when mlock(MCL_FUTURE) is in effect. >>> >> >> Something's buggy here. My evil test case is stuck with lots of >> threads spinning at 100% system time. Stack traces look like: >> >> [<0000000000000000>] __mlock_vma_pages_range+0x66/0x70 >> [<0000000000000000>] __mm_populate+0xf9/0x150 >> [<0000000000000000>] vm_mmap_pgoff+0x9f/0xc0 >> [<0000000000000000>] sys_mmap_pgoff+0x7e/0x150 >> [<0000000000000000>] sys_mmap+0x22/0x30 >> [<0000000000000000>] system_call_fastpath+0x16/0x1b >> [<0000000000000000>] 0xffffffffffffffff >> >> perf top says: >> >> 38.45% [kernel] [k] __mlock_vma_pages_range >> 33.04% [kernel] [k] __get_user_pages >> 28.18% [kernel] [k] __mm_populate >> >> The tasks in question use MCL_FUTURE but not MAP_POPULATE. These >> tasks are immune to SIGKILL. > > Looking into it. > > There seems to be a problem with mlockall - the following program > fails in an unkillable way even before my changes: > > #include <sys/mman.h> > #include <stdio.h> > #include <stdint.h> > > int main(void) { > void *p = mmap(NULL, 0x100000000000, > PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANON | MAP_NORESERVE, > -1, 0); > printf("p: %p\n", p); > mlockall(MCL_CURRENT); > return 0; > } > > I think my changes propagate this existing problem so it now shows up > in more places :/ Hmm. I'm using MCL_FUTURE with MAP_NORESERVE, but those mappings are not insanely large. Should MAP_NORESERVE would negate MCL_FUTURE? I'm doing MAP_NORESERVE, PROT_NONE to prevent pages from being allocated in the future -- I have no intention of ever using them. The other odd thing I do is use MAP_FIXED to replace MAP_NORESERVE pages. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-22 1:09 ` Andy Lutomirski @ 2012-12-22 1:59 ` Michel Lespinasse 2012-12-22 2:16 ` Andy Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: Michel Lespinasse @ 2012-12-22 1:59 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Fri, Dec 21, 2012 at 5:09 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Fri, Dec 21, 2012 at 4:59 PM, Michel Lespinasse <walken@google.com> wrote: >> On Fri, Dec 21, 2012 at 4:36 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> Something's buggy here. My evil test case is stuck with lots of >>> threads spinning at 100% system time. >>> >>> The tasks in question use MCL_FUTURE but not MAP_POPULATE. These >>> tasks are immune to SIGKILL. >> >> Looking into it. >> >> There seems to be a problem with mlockall - the following program >> fails in an unkillable way even before my changes: >> >> #include <sys/mman.h> >> #include <stdio.h> >> #include <stdint.h> >> >> int main(void) { >> void *p = mmap(NULL, 0x100000000000, >> PROT_READ | PROT_WRITE, >> MAP_PRIVATE | MAP_ANON | MAP_NORESERVE, >> -1, 0); >> printf("p: %p\n", p); >> mlockall(MCL_CURRENT); >> return 0; >> } >> >> I think my changes propagate this existing problem so it now shows up >> in more places :/ So in my test case, the issue was caused by the mapping being 2^32 pages, which overflowed the integer 'nr_pages' argument to __get_user_pages, which caused an infinite loop as __get_user_pages() would return 0 so __mm_populate() would make no progress. When dropping one zero from that humongous size in the test case, the test case becomes at least killable. > Hmm. I'm using MCL_FUTURE with MAP_NORESERVE, but those mappings are > not insanely large. Should MAP_NORESERVE would negate MCL_FUTURE? > I'm doing MAP_NORESERVE, PROT_NONE to prevent pages from being > allocated in the future -- I have no intention of ever using them. MAP_NORESERVE doesn't prevent page allocation, but PROT_NONE does (precisely because people use it the same way as you do :) > The other odd thing I do is use MAP_FIXED to replace MAP_NORESERVE pages. Yes, I've seen people do that here too. Could you share your test case so I can try reproducing the issue you're seeing ? Thanks, -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-22 1:59 ` Michel Lespinasse @ 2012-12-22 2:16 ` Andy Lutomirski 2012-12-22 9:37 ` Michel Lespinasse 0 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2012-12-22 2:16 UTC (permalink / raw) To: Michel Lespinasse Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Fri, Dec 21, 2012 at 5:59 PM, Michel Lespinasse <walken@google.com> wrote: > On Fri, Dec 21, 2012 at 5:09 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Fri, Dec 21, 2012 at 4:59 PM, Michel Lespinasse <walken@google.com> wrote: >>> On Fri, Dec 21, 2012 at 4:36 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>>> Something's buggy here. My evil test case is stuck with lots of >>>> threads spinning at 100% system time. >>>> >>>> The tasks in question use MCL_FUTURE but not MAP_POPULATE. These >>>> tasks are immune to SIGKILL. >>> >>> Looking into it. >>> >>> There seems to be a problem with mlockall - the following program >>> fails in an unkillable way even before my changes: >>> >>> #include <sys/mman.h> >>> #include <stdio.h> >>> #include <stdint.h> >>> >>> int main(void) { >>> void *p = mmap(NULL, 0x100000000000, >>> PROT_READ | PROT_WRITE, >>> MAP_PRIVATE | MAP_ANON | MAP_NORESERVE, >>> -1, 0); >>> printf("p: %p\n", p); >>> mlockall(MCL_CURRENT); >>> return 0; >>> } >>> >>> I think my changes propagate this existing problem so it now shows up >>> in more places :/ > > So in my test case, the issue was caused by the mapping being 2^32 > pages, which overflowed the integer 'nr_pages' argument to > __get_user_pages, which caused an infinite loop as __get_user_pages() > would return 0 so __mm_populate() would make no progress. > > When dropping one zero from that humongous size in the test case, the > test case becomes at least killable. > >> Hmm. I'm using MCL_FUTURE with MAP_NORESERVE, but those mappings are >> not insanely large. Should MAP_NORESERVE would negate MCL_FUTURE? >> I'm doing MAP_NORESERVE, PROT_NONE to prevent pages from being >> allocated in the future -- I have no intention of ever using them. > > MAP_NORESERVE doesn't prevent page allocation, but PROT_NONE does > (precisely because people use it the same way as you do :) > >> The other odd thing I do is use MAP_FIXED to replace MAP_NORESERVE pages. > Yes, I've seen people do that here too. > > Could you share your test case so I can try reproducing the issue > you're seeing ? Not so easy. My test case is a large chunk of a high-frequency trading system :) I just tried it again. Not I have a task stuck in mlockall(MCL_CURRENT|MCL_FUTURE). The stack is: [<0000000000000000>] flush_work+0x1c2/0x280 [<0000000000000000>] schedule_on_each_cpu+0xe3/0x130 [<0000000000000000>] lru_add_drain_all+0x15/0x20 [<0000000000000000>] sys_mlockall+0x125/0x1a0 [<0000000000000000>] tracesys+0xd0/0xd5 [<0000000000000000>] 0xffffffffffffffff The sequence of mmap and munmap calls, according to strace, is: 6084 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0e4000 6084 mmap(NULL, 8388744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55096c3000 6084 mmap(0x7f5509ec2000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3ff000) = 0x7f5509ec2000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0e3000 6084 mmap(NULL, 2413688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5509475000 6084 mmap(0x7f550969c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x27000) = 0x7f550969c000 6084 mmap(0x7f550969e000, 148600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f550969e000 6084 mmap(NULL, 12636304, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5508867000 6084 mmap(0x7f5509420000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7b9000) = 0x7f5509420000 6084 mmap(0x7f5509470000, 16528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5509470000 6084 mmap(NULL, 8409224, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5508061000 6084 mmap(0x7f550885b000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3fa000) = 0x7f550885b000 6084 mmap(0x7f5508864000, 8328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5508864000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5508060000 6084 mmap(NULL, 8404144, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f550785c000 6084 mmap(0x7f5508054000, 45056, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3f8000) = 0x7f5508054000 6084 mmap(0x7f550805f000, 3248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f550805f000 6084 mmap(NULL, 8390584, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f550705b000 6084 mmap(0x7f5507859000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3fe000) = 0x7f5507859000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550705a000 6084 mmap(NULL, 8393296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5506858000 6084 mmap(0x7f5507055000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3fd000) = 0x7f5507055000 6084 mmap(0x7f5507059000, 592, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5507059000 6084 mmap(NULL, 8393224, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5506056000 6084 mmap(0x7f5506855000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3ff000) = 0x7f5506855000 6084 mmap(0x7f5506857000, 520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5506857000 6084 mmap(NULL, 46777, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f550604a000 6084 mmap(NULL, 2109672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5505e46000 6084 mmap(0x7f5506048000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f5506048000 6084 mmap(NULL, 3150176, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5505b44000 6084 mmap(0x7f5505e3e000, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xfa000) = 0x7f5505e3e000 6084 mmap(0x7f5505e45000, 352, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5505e45000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0e2000 6084 mmap(NULL, 2217600, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5505926000 6084 mmap(0x7f5505b42000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c000) = 0x7f5505b42000 6084 mmap(NULL, 2509488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55056c1000 6084 mmap(0x7f5505922000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x61000) = 0x7f5505922000 6084 mmap(NULL, 3073904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55053d2000 6084 mmap(0x7f55056bc000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xea000) = 0x7f55056bc000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0e1000 6084 mmap(NULL, 3517816, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5505077000 6084 mmap(0x7f55053c7000, 40960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x150000) = 0x7f55053c7000 6084 mmap(0x7f55053d1000, 3448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f55053d1000 6084 mmap(NULL, 3142544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5504d77000 6084 mmap(0x7f5505058000, 40960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xe1000) = 0x7f5505058000 6084 mmap(0x7f5505062000, 82832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5505062000 6084 mmap(NULL, 3125544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5504a7b000 6084 mmap(0x7f5504d75000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xfa000) = 0x7f5504d75000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0e0000 6084 mmap(NULL, 2184216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5504865000 6084 mmap(0x7f5504a79000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f5504a79000 6084 mmap(NULL, 2212904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5504648000 6084 mmap(0x7f550485f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f550485f000 6084 mmap(0x7f5504861000, 13352, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5504861000 6084 mmap(NULL, 3925208, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5504289000 6084 mmap(0x7f550463d000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b4000) = 0x7f550463d000 6084 mmap(0x7f5504643000, 17624, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5504643000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0df000 6084 mmap(NULL, 8392040, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5503a88000 6084 mmap(0x7f5504287000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3ff000) = 0x7f5504287000 6084 mmap(NULL, 2196528, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f550386f000 6084 mmap(0x7f5503a85000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f5503a85000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0de000 6084 mmap(NULL, 2708032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55035d9000 6084 mmap(0x7f5503867000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x8e000) = 0x7f5503867000 6084 mmap(NULL, 2408864, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f550338c000 6084 mmap(0x7f55035d7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4b000) = 0x7f55035d7000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0dd000 6084 mmap(NULL, 8388656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5502b8b000 6084 mmap(0x7f550338a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3ff000) = 0x7f550338a000 6084 mmap(NULL, 2518464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5502924000 6084 mmap(0x7f5502b82000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5e000) = 0x7f5502b82000 6084 mmap(0x7f5502b8a000, 3520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5502b8a000 6084 mmap(NULL, 2162064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5502714000 6084 mmap(0x7f5502922000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xe000) = 0x7f5502922000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5502713000 6084 mmap(NULL, 4454088, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55022d3000 6084 mmap(0x7f5502701000, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22e000) = 0x7f5502701000 6084 mmap(0x7f5502712000, 1736, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5502712000 6084 mmap(NULL, 2249496, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55020ad000 6084 mmap(0x7f55022be000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x11000) = 0x7f55022be000 6084 mmap(0x7f55022c0000, 74520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f55022c0000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f55020ac000 6084 mmap(NULL, 2232416, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5501e8a000 6084 mmap(0x7f55020aa000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0x7f55020aa000 6084 mmap(NULL, 2113944, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5501c85000 6084 mmap(0x7f5501e88000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f5501e88000 6084 mmap(NULL, 2128984, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5501a7d000 6084 mmap(0x7f5501c83000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f5501c83000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0dc000 6084 mmap(NULL, 2109704, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5501879000 6084 mmap(0x7f5501a7b000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f5501a7b000 6084 mmap(NULL, 2366184, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5501637000 6084 mmap(0x7f5501870000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x39000) = 0x7f5501870000 6084 mmap(0x7f5501878000, 2792, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5501878000 6084 mmap(NULL, 3578368, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55012cd000 6084 mmap(0x7f5501622000, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x155000) = 0x7f5501622000 6084 mmap(0x7f5501633000, 14848, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5501633000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0db000 6084 mmap(NULL, 3964496, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5500f05000 6084 mmap(0x7f55012bf000, 57344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7f55012bf000 6084 mmap(NULL, 2187824, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5500cee000 6084 mmap(0x7f5500f03000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7f5500f03000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0da000 6084 mmap(NULL, 2469936, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5500a92000 6084 mmap(0x7f5500ce4000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x52000) = 0x7f5500ce4000 6084 mmap(0x7f5500ced000, 48, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5500ced000 6084 mmap(NULL, 2134448, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5500888000 6084 mmap(0x7f5500a90000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x8000) = 0x7f5500a90000 6084 mmap(NULL, 2205832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f550066d000 6084 mmap(0x7f5500886000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19000) = 0x7f5500886000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d9000 6084 mmap(NULL, 2864104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f55003b1000 6084 mmap(0x7f5500665000, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb4000) = 0x7f5500665000 6084 mmap(0x7f550066c000, 1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f550066c000 6084 mmap(NULL, 2255936, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f550018a000 6084 mmap(0x7f55003ac000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f55003ac000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5500189000 6084 mmap(NULL, 20377632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fee19000 6084 mmap(0x7f5500187000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x116e000) = 0x7f5500187000 6084 mmap(NULL, 3963320, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fea51000 6084 mmap(0x7f54fedef000, 155648, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19e000) = 0x7f54fedef000 6084 mmap(0x7f54fee15000, 14776, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54fee15000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f54fea50000 6084 mmap(NULL, 2612480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fe7d2000 6084 mmap(0x7f54fea4c000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7a000) = 0x7f54fea4c000 6084 mmap(NULL, 2302408, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fe59f000 6084 mmap(0x7f54fe7d0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x31000) = 0x7f54fe7d0000 6084 mmap(NULL, 2125992, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fe397000 6084 mmap(0x7f54fe59d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f54fe59d000 6084 mmap(NULL, 2348544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fe159000 6084 mmap(0x7f54fe394000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3b000) = 0x7f54fe394000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d8000 6084 mmap(NULL, 2163496, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fdf48000 6084 mmap(0x7f54fe157000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f54fe157000 6084 mmap(NULL, 2168048, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fdd36000 6084 mmap(0x7f54fdf46000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x10000) = 0x7f54fdf46000 6084 mmap(NULL, 2109528, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fdb32000 6084 mmap(0x7f54fdd34000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f54fdd34000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d7000 6084 mmap(NULL, 2940512, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fd864000 6084 mmap(0x7f54fdb27000, 45056, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc3000) = 0x7f54fdb27000 6084 mmap(NULL, 2256944, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fd63c000 6084 mmap(0x7f54fd861000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f54fd861000 6084 mmap(0x7f54fd863000, 48, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54fd863000 6084 mmap(NULL, 2109864, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fd438000 6084 mmap(0x7f54fd63a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f54fd63a000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d6000 6084 mmap(NULL, 2126576, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fd230000 6084 mmap(0x7f54fd436000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f54fd436000 6084 mmap(NULL, 2109536, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fd02c000 6084 mmap(0x7f54fd22e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f54fd22e000 6084 mmap(NULL, 2210424, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f54fce10000 6084 mmap(0x7f54fd028000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18000) = 0x7f54fd028000 6084 mmap(0x7f54fd02a000, 6776, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54fd02a000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d5000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d4000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d3000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d2000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0d1000 6084 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0cf000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0ce000 6084 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f550a0cc000 6084 munmap(0x7f550604a000, 46777) = 0 6084 mmap(NULL, 139264, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f550a0aa000 6084 mmap(0x7f550a0ab000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f550a0ab000 6084 mmap(NULL, 139264, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f550a088000 6084 mmap(0x7f550a089000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f550a089000 6084 mmap(NULL, 139264, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f550a066000 6084 mmap(0x7f550a067000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f550a067000 6084 mmap(NULL, 548864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5509fe0000 6084 mmap(0x7f5509fe1000, 540672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5509fe1000 6084 mmap(NULL, 40960, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5509fd6000 6084 mmap(0x7f5509fd7000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5509fd7000 6084 mmap(NULL, 139264, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5509fb4000 6084 mmap(0x7f5509fb5000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5509fb5000 6084 mmap(NULL, 4198400, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f54fca0f000 6084 mmap(0x7f54fcc00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|0x40000, -1, 0) = 0x7f54fcc00000 6084 munmap(0x7f54fca0f000, 2031616) = 0 6084 munmap(0x7f54fce01000, 61440) = 0 6084 mmap(NULL, 40960, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5509faa000 6084 mmap(0x7f5509fab000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5509fab000 6084 mmap(NULL, 24576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5509fa4000 6084 mmap(0x7f5509fa5000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5509fa5000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5509fa3000 6084 munmap(0x7f5509fa3000, 4096) = 0 6084 mmap(NULL, 2919792, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7f54fc936000 6084 mmap(NULL, 26258, PROT_READ, MAP_SHARED, 4, 0) = 0x7f5509f9d000 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5509f9c000 6084 munmap(0x7f5509f9c000, 4096) = 0 6084 mlockall(MCL_CURRENT|MCL_FUTURE This task is unkillable. Two other tasks are stuck spinning. Note the line with 0x40000, which is MAP_HUGETLB. Could that be the problem? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-22 2:16 ` Andy Lutomirski @ 2012-12-22 9:37 ` Michel Lespinasse 2012-12-22 9:45 ` [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool Michel Lespinasse 2013-01-03 18:56 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Andy Lutomirski 0 siblings, 2 replies; 40+ messages in thread From: Michel Lespinasse @ 2012-12-22 9:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Fri, Dec 21, 2012 at 6:16 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Fri, Dec 21, 2012 at 5:59 PM, Michel Lespinasse <walken@google.com> wrote: >> Could you share your test case so I can try reproducing the issue >> you're seeing ? > > Not so easy. My test case is a large chunk of a high-frequency > trading system :) Huh, its probably better if I don't see it then :) > I just tried it again. Not I have a task stuck in > mlockall(MCL_CURRENT|MCL_FUTURE). The stack is: > > [<0000000000000000>] flush_work+0x1c2/0x280 > [<0000000000000000>] schedule_on_each_cpu+0xe3/0x130 > [<0000000000000000>] lru_add_drain_all+0x15/0x20 > [<0000000000000000>] sys_mlockall+0x125/0x1a0 > [<0000000000000000>] tracesys+0xd0/0xd5 > [<0000000000000000>] 0xffffffffffffffff > > The sequence of mmap and munmap calls, according to strace, is: > [...] > 6084 mmap(0x7f54fd02a000, 6776, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54fd02a000 So I noticed you use mmap with a size that is not a multiple of PAGE_SIZE. This is perfectly legal, but I hadn't tested that case, and lo and behold, it's something I got wrong. Patch to be sent as a reply to this. Without this patch, vm_populate() will show a debug message if you have CONFIG_DEBUG_VM set, and likely spin in an infinite loop if you don't. > 6084 mmap(NULL, 26258, PROT_READ, MAP_SHARED, 4, 0) = 0x7f5509f9d000 > 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5509f9c000 > 6084 munmap(0x7f5509f9c000, 4096) = 0 > 6084 mlockall(MCL_CURRENT|MCL_FUTURE > > This task is unkillable. Two other tasks are stuck spinning. Now I'm confused, because: 1- your trace shows the hang occurs during mlockall(), and this code really wasn't touched much in my series (besides renaming do_mlock_pages into __mm_populate()) 2- the backtrace above showed sys_mlockall() -> lru_add_drain_all(), which is the very beginning of mlockall(), before anything of importance happens (and in particular, before the MCL_FUTURE flag takes action). So, I'm going to assume that it's one of the other spinning threads that is breaking things. If one of the spinning threads got stuck within vm_populate(), this could even be explained by the bug I mentioned above. Could you check if the fix I'm going to send as a reply to this works for you, and if not, where the two spinning threads are being stuck ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool 2012-12-22 9:37 ` Michel Lespinasse @ 2012-12-22 9:45 ` Michel Lespinasse 2013-01-03 6:21 ` Rik van Riel 2013-01-03 18:54 ` Andy Lutomirski 2013-01-03 18:56 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Andy Lutomirski 1 sibling, 2 replies; 40+ messages in thread From: Michel Lespinasse @ 2012-12-22 9:45 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, however there was no equivalent code in mm_populate(), which caused issues. This could be fixed by introduced the same rounding in mm_populate(), however I think it's preferable to make do_mmap_pgoff() return populate as a size rather than as a boolean, so we don't have to duplicate the size rounding logic in mm_populate(). Signed-off-by: Michel Lespinasse <walken@google.com> --- fs/aio.c | 5 ++--- include/linux/mm.h | 2 +- ipc/shm.c | 4 ++-- mm/mmap.c | 6 +++--- mm/nommu.c | 4 ++-- mm/util.c | 6 +++--- 6 files changed, 13 insertions(+), 14 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 82eec7c7b4bb..064bfbe37566 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -101,9 +101,8 @@ static int aio_setup_ring(struct kioctx *ctx) struct aio_ring *ring; struct aio_ring_info *info = &ctx->ring_info; unsigned nr_events = ctx->max_reqs; - unsigned long size; + unsigned long size, populate; int nr_pages; - bool populate; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -150,7 +149,7 @@ static int aio_setup_ring(struct kioctx *ctx) return -EAGAIN; } if (populate) - mm_populate(info->mmap_base, info->mmap_size); + mm_populate(info->mmap_base, populate); ctx->user_id = info->mmap_base; diff --git a/include/linux/mm.h b/include/linux/mm.h index 77311274f0b5..5fe5b86908ab 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1445,7 +1445,7 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, - unsigned long pgoff, bool *populate); + unsigned long pgoff, unsigned long *populate); extern int do_munmap(struct mm_struct *, unsigned long, size_t); #ifdef CONFIG_MMU diff --git a/ipc/shm.c b/ipc/shm.c index ee2dde1f94d1..2171b3d43943 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -970,7 +970,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, struct shm_file_data *sfd; struct path path; fmode_t f_mode; - bool populate = false; + unsigned long populate = 0; err = -EINVAL; if (shmid < 0) @@ -1077,7 +1077,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, invalid: up_write(¤t->mm->mmap_sem); if (populate) - mm_populate(addr, size); + mm_populate(addr, populate); out_fput: fput(file); diff --git a/mm/mmap.c b/mm/mmap.c index 27f98850fa8c..0baf8a42b822 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1010,13 +1010,13 @@ static inline unsigned long round_hint_to_min(unsigned long hint) unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, - bool *populate) + unsigned long *populate) { struct mm_struct * mm = current->mm; struct inode *inode; vm_flags_t vm_flags; - *populate = false; + *populate = 0; /* * Does the application expect PROT_READ to imply PROT_EXEC? @@ -1154,7 +1154,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, addr = mmap_region(file, addr, len, vm_flags, pgoff); if (!IS_ERR_VALUE(addr) && (vm_flags & VM_POPULATE)) - *populate = true; + *populate = len; return addr; } diff --git a/mm/nommu.c b/mm/nommu.c index d7690b97b81d..0622fb9d083a 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1235,7 +1235,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long prot, unsigned long flags, unsigned long pgoff, - bool *populate) + unsigned long *populate) { struct vm_area_struct *vma; struct vm_region *region; @@ -1245,7 +1245,7 @@ unsigned long do_mmap_pgoff(struct file *file, kenter(",%lx,%lx,%lx,%lx,%lx", addr, len, prot, flags, pgoff); - *populate = false; + *populate = 0; /* decide whether we should attempt the mapping, and if so what sort of * mapping */ diff --git a/mm/util.c b/mm/util.c index 44f006ac2ccd..463ef33b32fa 100644 --- a/mm/util.c +++ b/mm/util.c @@ -355,7 +355,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, { unsigned long ret; struct mm_struct *mm = current->mm; - bool populate; + unsigned long populate; ret = security_mmap_file(file, prot, flag); if (!ret) { @@ -363,8 +363,8 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate); up_write(&mm->mmap_sem); - if (!IS_ERR_VALUE(ret) && populate) - mm_populate(ret, len); + if (populate) + mm_populate(ret, populate); } return ret; } -- 1.7.7.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool 2012-12-22 9:45 ` [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool Michel Lespinasse @ 2013-01-03 6:21 ` Rik van Riel 2013-01-03 18:54 ` Andy Lutomirski 1 sibling, 0 replies; 40+ messages in thread From: Rik van Riel @ 2013-01-03 6:21 UTC (permalink / raw) To: Michel Lespinasse Cc: Andy Lutomirski, Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Andrew Morton, linux-mm, linux-kernel On 12/22/2012 04:45 AM, Michel Lespinasse wrote: > do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, > however there was no equivalent code in mm_populate(), which caused issues. > > This could be fixed by introduced the same rounding in mm_populate(), > however I think it's preferable to make do_mmap_pgoff() return populate > as a size rather than as a boolean, so we don't have to duplicate the > size rounding logic in mm_populate(). > > Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool 2012-12-22 9:45 ` [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool Michel Lespinasse 2013-01-03 6:21 ` Rik van Riel @ 2013-01-03 18:54 ` Andy Lutomirski 1 sibling, 0 replies; 40+ messages in thread From: Andy Lutomirski @ 2013-01-03 18:54 UTC (permalink / raw) To: Michel Lespinasse Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Sat, Dec 22, 2012 at 1:45 AM, Michel Lespinasse <walken@google.com> wrote: > do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, > however there was no equivalent code in mm_populate(), which caused issues. > > This could be fixed by introduced the same rounding in mm_populate(), > however I think it's preferable to make do_mmap_pgoff() return populate > as a size rather than as a boolean, so we don't have to duplicate the > size rounding logic in mm_populate(). > > Signed-off-by: Michel Lespinasse <walken@google.com> This appears to fix my hangs. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-22 9:37 ` Michel Lespinasse 2012-12-22 9:45 ` [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool Michel Lespinasse @ 2013-01-03 18:56 ` Andy Lutomirski 1 sibling, 0 replies; 40+ messages in thread From: Andy Lutomirski @ 2013-01-03 18:56 UTC (permalink / raw) To: Michel Lespinasse Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Sat, Dec 22, 2012 at 1:37 AM, Michel Lespinasse <walken@google.com> wrote: > On Fri, Dec 21, 2012 at 6:16 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Fri, Dec 21, 2012 at 5:59 PM, Michel Lespinasse <walken@google.com> wrote: >>> Could you share your test case so I can try reproducing the issue >>> you're seeing ? >> >> Not so easy. My test case is a large chunk of a high-frequency >> trading system :) > > Huh, its probably better if I don't see it then :) > >> I just tried it again. Not I have a task stuck in >> mlockall(MCL_CURRENT|MCL_FUTURE). The stack is: >> >> [<0000000000000000>] flush_work+0x1c2/0x280 >> [<0000000000000000>] schedule_on_each_cpu+0xe3/0x130 >> [<0000000000000000>] lru_add_drain_all+0x15/0x20 >> [<0000000000000000>] sys_mlockall+0x125/0x1a0 >> [<0000000000000000>] tracesys+0xd0/0xd5 >> [<0000000000000000>] 0xffffffffffffffff >> >> The sequence of mmap and munmap calls, according to strace, is: >> > [...] >> 6084 mmap(0x7f54fd02a000, 6776, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54fd02a000 > > So I noticed you use mmap with a size that is not a multiple of > PAGE_SIZE. This is perfectly legal, but I hadn't tested that case, and > lo and behold, it's something I got wrong. Patch to be sent as a reply > to this. Without this patch, vm_populate() will show a debug message > if you have CONFIG_DEBUG_VM set, and likely spin in an infinite loop > if you don't. > >> 6084 mmap(NULL, 26258, PROT_READ, MAP_SHARED, 4, 0) = 0x7f5509f9d000 >> 6084 mmap(NULL, 4096, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5509f9c000 >> 6084 munmap(0x7f5509f9c000, 4096) = 0 >> 6084 mlockall(MCL_CURRENT|MCL_FUTURE >> >> This task is unkillable. Two other tasks are stuck spinning. > > Now I'm confused, because: > > 1- your trace shows the hang occurs during mlockall(), and this code > really wasn't touched much in my series (besides renaming > do_mlock_pages into __mm_populate()) > > 2- the backtrace above showed sys_mlockall() -> lru_add_drain_all(), > which is the very beginning of mlockall(), before anything of > importance happens (and in particular, before the MCL_FUTURE flag > takes action). So, I'm going to assume that it's one of the other > spinning threads that is breaking things. If one of the spinning > threads got stuck within vm_populate(), this could even be explained > by the bug I mentioned above. > > Could you check if the fix I'm going to send as a reply to this works > for you, and if not, where the two spinning threads are being stuck ? > It works. In case anyone cares, the whole series is Tested-by: Andy Lutomirski <luto@amacapital.net> I'll let you know if anything else breaks. I'll be pounding on a kernel with this patched in for the next couple of days, I expect. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse ` (10 preceding siblings ...) 2012-12-22 0:36 ` Andy Lutomirski @ 2013-01-04 18:16 ` Andy Lutomirski 2013-01-04 22:58 ` Michel Lespinasse 11 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2013-01-04 18:16 UTC (permalink / raw) To: Michel Lespinasse Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Thu, Dec 20, 2012 at 4:49 PM, Michel Lespinasse <walken@google.com> wrote: > We have many vma manipulation functions that are fast in the typical case, > but can optionally be instructed to populate an unbounded number of ptes > within the region they work on: > - mmap with MAP_POPULATE or MAP_LOCKED flags; > - remap_file_pages() with MAP_NONBLOCK not set or when working on a > VM_LOCKED vma; > - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; > - brk() when mlock(MCL_FUTURE) is in effect. > > Current code handles these pte operations locally, while the sourrounding > code has to hold the mmap_sem write side since it's manipulating vmas. > This means we're doing an unbounded amount of pte population work with > mmap_sem held, and this causes problems as Andy Lutomirski reported > (we've hit this at Google as well, though it's not entirely clear why > people keep trying to use mlock(MCL_FUTURE) in the first place). > > I propose introducing a new mm_populate() function to do this pte > population work after the mmap_sem has been released. mm_populate() > does need to acquire the mmap_sem read side, but critically, it > doesn't need to hold continuously for the entire duration of the > operation - it can drop it whenever things take too long (such as when > hitting disk for a file read) and re-acquire it later on. > I still have quite a few instances of 2-6 ms of latency due to "call_rwsem_down_read_failed __do_page_fault do_page_fault page_fault". Any idea why? I don't know any great way to figure out who is holding mmap_sem at the time. Given what my code is doing, I suspect the contention is due to mmap or munmap on a file. MCL_FUTURE is set, and MAP_POPULATE is not set. It could be the other thread calling mmap and getting preempted (or otherwise calling schedule()). Grr. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held 2013-01-04 18:16 ` Andy Lutomirski @ 2013-01-04 22:58 ` Michel Lespinasse 0 siblings, 0 replies; 40+ messages in thread From: Michel Lespinasse @ 2013-01-04 22:58 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Al Viro, Hugh Dickins, Jorn_Engel, Rik van Riel, Andrew Morton, linux-mm, linux-kernel On Fri, Jan 4, 2013 at 10:16 AM, Andy Lutomirski <luto@amacapital.net> wrote: > I still have quite a few instances of 2-6 ms of latency due to > "call_rwsem_down_read_failed __do_page_fault do_page_fault > page_fault". Any idea why? I don't know any great way to figure out > who is holding mmap_sem at the time. Given what my code is doing, I > suspect the contention is due to mmap or munmap on a file. MCL_FUTURE > is set, and MAP_POPULATE is not set. > > It could be the other thread calling mmap and getting preempted (or > otherwise calling schedule()). Grr. The simplest way to find out who's holding the lock too long might be to enable CONFIG_LOCK_STATS. This will slow things down a little, but give you lots of useful information including which threads hold mmap_sem the longest and the call stack for where they grab it from. See Documentation/lockstat.txt I think munmap is a likely culprit, as it still happens with mmap_sem held for write (I do plan to go work on this next). But it's hard to be sure without lockstats :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2013-03-12 20:47 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-12-21 0:49 [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse 2012-12-21 0:49 ` [PATCH 1/9] mm: make mlockall preserve flags other than VM_LOCKED in def_flags Michel Lespinasse 2012-12-22 4:25 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 2/9] mm: remap_file_pages() fixes Michel Lespinasse 2013-01-03 0:31 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 3/9] mm: introduce mm_populate() for populating new vmas Michel Lespinasse 2013-01-03 2:14 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 4/9] mm: use mm_populate() for blocking remap_file_pages() Michel Lespinasse 2013-01-03 2:25 ` Rik van Riel 2013-03-10 18:55 ` Tommi Rantala 2013-03-11 23:03 ` Andrew Morton 2013-03-12 0:24 ` Michel Lespinasse 2013-03-12 4:23 ` Hillf Danton 2013-03-12 5:01 ` Michel Lespinasse 2013-03-12 20:47 ` Andrew Morton 2012-12-21 0:49 ` [PATCH 5/9] mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect Michel Lespinasse 2013-01-03 2:56 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 6/9] mm: use mm_populate() for mremap() of VM_LOCKED vmas Michel Lespinasse 2013-01-03 5:47 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 7/9] mm: remove flags argument to mmap_region Michel Lespinasse 2013-01-03 5:49 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 8/9] mm: directly use __mlock_vma_pages_range() in find_extend_vma() Michel Lespinasse 2013-01-03 5:50 ` Rik van Riel 2012-12-21 0:49 ` [PATCH 9/9] mm: introduce VM_POPULATE flag to better deal with racy userspace programs Michel Lespinasse 2013-01-03 6:20 ` Rik van Riel 2012-12-21 10:46 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Michel Lespinasse 2012-12-22 15:02 ` Greg Ungerer 2013-01-23 13:37 ` Greg Ungerer 2012-12-22 0:36 ` Andy Lutomirski 2012-12-22 0:59 ` Michel Lespinasse 2012-12-22 1:09 ` Andy Lutomirski 2012-12-22 1:59 ` Michel Lespinasse 2012-12-22 2:16 ` Andy Lutomirski 2012-12-22 9:37 ` Michel Lespinasse 2012-12-22 9:45 ` [PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool Michel Lespinasse 2013-01-03 6:21 ` Rik van Riel 2013-01-03 18:54 ` Andy Lutomirski 2013-01-03 18:56 ` [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held Andy Lutomirski 2013-01-04 18:16 ` Andy Lutomirski 2013-01-04 22:58 ` Michel Lespinasse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).