* [RFC] mlock: release mmap_sem every 256 faulted pages
@ 2010-11-23 5:00 Michel Lespinasse
2010-11-23 5:57 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Michel Lespinasse @ 2010-11-23 5:00 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Andrew Morton, Hugh Dickins, KOSAKI Motohiro,
Nick Piggin, Rik van Riel
Hi,
I'd like to sollicit comments on this proposal:
Currently mlock() holds mmap_sem in exclusive mode while the pages get
faulted in. In the case of a large mlock, this can potentially take a
very long time.
I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages, in small batches, never holding mmap_sem
for longer than it takes to process one single batch. We need to recheck
the vma flags whenever we re-acquire mmap_sem, but this is not difficult.
This is only an RFC rather than an actual submission, as I think this
could / should be completed to handle more than the mlock() and
mlockall() cases (there are many call sites to mlock_vma_pages_range()
that should ideally be converted as well), and maybe use the fault retry
mechanism to drop mmap_sem when blocking on disk access rather than
using an arbitrary page batch size.
Patch is against v2.6.36, but should apply to linus tree too.
------------------------------- 8< -----------------------------
Let mlock / mlockall release mmap_sem after the vmas have been marked
as VM_LOCKED. Then, mark the vmas as mlocked in small batches.
For every batch, we need to grab mmap_sem in read mode, check that the
vma has not been munlocked, and mlock the pages.
In the case where a vma has been munlocked before mlock completes,
pages that were already marked as PageMlocked() are handled by the
munlock() call, and mlock() is careful to not mark new page batches
as PageMlocked() after the munlock() call has cleared the VM_LOCKED
vma flags. So, the end result will be identical to what'd happen if
munlock() had executed after the mlock() call.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
mm/mlock.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 63 insertions(+), 16 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index b70919c..0aa4df5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -373,17 +373,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
int lock = newflags & VM_LOCKED;
if (newflags == vma->vm_flags ||
- (vma->vm_flags & (VM_IO | VM_PFNMAP)))
+ (vma->vm_flags & (VM_IO | VM_PFNMAP |
+ VM_DONTEXPAND | VM_RESERVED)) ||
+ is_vm_hugetlb_page(vma) || vma == get_gate_vma(current))
goto out; /* don't set VM_LOCKED, don't count */
- if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
- is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current)) {
- if (lock)
- make_pages_present(start, end);
- goto out; /* don't set VM_LOCKED, don't count */
- }
-
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma));
@@ -419,14 +413,10 @@ success:
* set VM_LOCKED, __mlock_vma_pages_range will bring it back.
*/
- if (lock) {
+ if (lock)
vma->vm_flags = newflags;
- ret = __mlock_vma_pages_range(vma, start, end);
- if (ret < 0)
- ret = __mlock_posix_error_return(ret);
- } else {
+ else
munlock_vma_pages_range(vma, start, end);
- }
out:
*prev = vma;
@@ -439,7 +429,8 @@ static int do_mlock(unsigned long start, size_t len, int on)
struct vm_area_struct * vma, * prev;
int error;
- len = PAGE_ALIGN(len);
+ VM_BUG_ON(start & ~PAGE_MASK);
+ VM_BUG_ON(len != PAGE_ALIGN(len));
end = start + len;
if (end < start)
return -EINVAL;
@@ -482,6 +473,58 @@ static int do_mlock(unsigned long start, size_t len, int on)
return error;
}
+static int do_mlock_pages(unsigned long start, size_t len)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long end, nstart, nend, nfault;
+ struct vm_area_struct *vma;
+ int error = 0;
+
+ VM_BUG_ON(start & ~PAGE_MASK);
+ VM_BUG_ON(len != PAGE_ALIGN(len));
+ end = start + len;
+
+ for (nstart = start; nstart < end; nstart = nend) {
+ down_read(&mm->mmap_sem);
+ nend = end;
+ vma = find_vma_intersection(mm, nstart, nend);
+ if (!vma)
+ goto up;
+ if (vma->vm_end < nend)
+ nend = vma->vm_end;
+ if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+ goto up;
+ if (nstart < vma->vm_start)
+ nstart = vma->vm_start;
+
+ /*
+ * Limit batch size to 256 pages in order to reduce
+ * mmap_sem hold time.
+ */
+ nfault = nstart + 256 * PAGE_SIZE;
+
+ /*
+ * Now fault in a batch of pages. We need to check the vma
+ * flags again, as we've not been holding mmap_sem.
+ */
+ if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
+ is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) {
+ if (nfault < nend)
+ nend = nfault;
+ make_pages_present(nstart, nend);
+ } else if (vma->vm_flags & VM_LOCKED) {
+ if (nfault < nend)
+ nend = nfault;
+ error = __mlock_vma_pages_range(vma, nstart, nend);
+ }
+ up:
+ up_read(&mm->mmap_sem);
+ if (error)
+ return __mlock_posix_error_return(error);
+ }
+ return 0;
+}
+
SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
{
unsigned long locked;
@@ -507,6 +550,8 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
error = do_mlock(start, len, 1);
up_write(¤t->mm->mmap_sem);
+ if (!error)
+ error = do_mlock_pages(start, len);
return error;
}
@@ -571,6 +616,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
capable(CAP_IPC_LOCK))
ret = do_mlockall(flags);
up_write(¤t->mm->mmap_sem);
+ if (!ret && (flags & MCL_CURRENT))
+ ret = do_mlock_pages(0, TASK_SIZE);
out:
return ret;
}
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [RFC] mlock: release mmap_sem every 256 faulted pages 2010-11-23 5:00 [RFC] mlock: release mmap_sem every 256 faulted pages Michel Lespinasse @ 2010-11-23 5:57 ` Andrew Morton 2010-11-23 7:49 ` KOSAKI Motohiro 2010-11-23 20:55 ` Michel Lespinasse 2010-12-01 0:14 ` KOSAKI Motohiro 2010-12-01 4:42 ` KOSAKI Motohiro 2 siblings, 2 replies; 7+ messages in thread From: Andrew Morton @ 2010-11-23 5:57 UTC (permalink / raw) To: Michel Lespinasse Cc: linux-mm, linux-kernel, Hugh Dickins, KOSAKI Motohiro, Nick Piggin, Rik van Riel On Mon, 22 Nov 2010 21:00:52 -0800 Michel Lespinasse <walken@google.com> wrote: > Hi, > > I'd like to sollicit comments on this proposal: > > Currently mlock() holds mmap_sem in exclusive mode while the pages get > faulted in. In the case of a large mlock, this can potentially take a > very long time. A more compelling description of why this problem needs addressing would help things along. > + /* > + * Limit batch size to 256 pages in order to reduce > + * mmap_sem hold time. > + */ > + nfault = nstart + 256 * PAGE_SIZE; It would be nicer if there was an rwsem API to ask if anyone is currently blocked in down_read() or down_write(). That wouldn't be too hard to do. It wouldn't detect people polling down_read_trylock() or down_write_trylock() though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] mlock: release mmap_sem every 256 faulted pages 2010-11-23 5:57 ` Andrew Morton @ 2010-11-23 7:49 ` KOSAKI Motohiro 2010-11-23 20:55 ` Michel Lespinasse 1 sibling, 0 replies; 7+ messages in thread From: KOSAKI Motohiro @ 2010-11-23 7:49 UTC (permalink / raw) To: Andrew Morton Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, linux-kernel, Hugh Dickins, Nick Piggin, Rik van Riel, Michael Rubin > On Mon, 22 Nov 2010 21:00:52 -0800 Michel Lespinasse <walken@google.com> wrote: > > > Hi, > > > > I'd like to sollicit comments on this proposal: > > > > Currently mlock() holds mmap_sem in exclusive mode while the pages get > > faulted in. In the case of a large mlock, this can potentially take a > > very long time. > > A more compelling description of why this problem needs addressing > would help things along. Michel, as far as I know, now Michael Rubin (now I'm ccing him) are trying to make automatic MM test suit. So if possible, can you please make test case which reproduce your workload? http://code.google.com/p/samplergrapher/ I hope to join to solve your issue. and I also hope you help to understand and reproduce your issue. Thanks. > > > + /* > > + * Limit batch size to 256 pages in order to reduce > > + * mmap_sem hold time. > > + */ > > + nfault = nstart + 256 * PAGE_SIZE; > > It would be nicer if there was an rwsem API to ask if anyone is > currently blocked in down_read() or down_write(). That wouldn't be too > hard to do. It wouldn't detect people polling down_read_trylock() or > down_write_trylock() though. Andrew, yes it is certinally optimal. But I doubt it improve mlock performance a lot. because mlock is _very_ slooooooow syscall. lock regrabing may be cheap than it. So, _IF_ you can allow, I hope we take a simple method at first. personally I think Michel move forwarding right way. then I don't hope to make a hardest hurdle. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] mlock: release mmap_sem every 256 faulted pages 2010-11-23 5:57 ` Andrew Morton 2010-11-23 7:49 ` KOSAKI Motohiro @ 2010-11-23 20:55 ` Michel Lespinasse 2010-11-23 23:57 ` KOSAKI Motohiro 1 sibling, 1 reply; 7+ messages in thread From: Michel Lespinasse @ 2010-11-23 20:55 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, KOSAKI Motohiro, Nick Piggin, Rik van Riel On Mon, Nov 22, 2010 at 9:57 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 22 Nov 2010 21:00:52 -0800 Michel Lespinasse <walken@google.com> wrote: >> I'd like to sollicit comments on this proposal: >> >> Currently mlock() holds mmap_sem in exclusive mode while the pages get >> faulted in. In the case of a large mlock, this can potentially take a >> very long time. > > A more compelling description of why this problem needs addressing > would help things along. Oh my. It's probably not too useful for desktops, where such large mlocks are hopefully uncommon. At google we have many applications that serve data from memory and don't want to allow for disk latencies. Some of the simpler ones use mlock (though there are other ways - anon memory running with swap disabled is a surprisingly popular choice). Kosaki is also showing interest in mlock, though I'm not sure what his use case is. Due to the large scope of mmap_sem, there are many things that may block while mlock() runs. If there are other threads running (and most of our programs are threaded from an early point in their execution), the threads might block on a page fault that needs to acquire mmap_sem. Also, various files such as /proc/pid/maps stop working. This is a problem for us because our cluster software can't monitor what's going on with that process - not by talking to it as the required threads might block, nor by looking at it through /proc files. A separate, personal interest is that I'm still carrying the (admittedly poor-taste) down_read_unfair() patches internally, and I would be able to drop them if only long mmap_sem hold times could be eliminated. >> + /* >> + * Limit batch size to 256 pages in order to reduce >> + * mmap_sem hold time. >> + */ >> + nfault = nstart + 256 * PAGE_SIZE; > > It would be nicer if there was an rwsem API to ask if anyone is > currently blocked in down_read() or down_write(). That wouldn't be too > hard to do. It wouldn't detect people polling down_read_trylock() or > down_write_trylock() though. I can do that. I actually thought about it myself, but then dismissed it as too fancy for version 1. Only problem is that this would go into per-architecture files which I can't test. But I wouldn't have to actually write asm, so this may be OK. down_read_trylock() is no problem, as these calls will succeed unless there is a queued writer, which we can easily detect. down_write_trylock() is seldom used, the only caller I could find for mmap_sem is drivers/infiniband/core/umem.c and it'll do a regular down_write() soon enough if the initial try fails. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] mlock: release mmap_sem every 256 faulted pages 2010-11-23 20:55 ` Michel Lespinasse @ 2010-11-23 23:57 ` KOSAKI Motohiro 0 siblings, 0 replies; 7+ messages in thread From: KOSAKI Motohiro @ 2010-11-23 23:57 UTC (permalink / raw) To: Michel Lespinasse Cc: kosaki.motohiro, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Nick Piggin, Rik van Riel > > A more compelling description of why this problem needs addressing > > would help things along. > > Oh my. It's probably not too useful for desktops, where such large > mlocks are hopefully uncommon. > > At google we have many applications that serve data from memory and > don't want to allow for disk latencies. Some of the simpler ones use > mlock (though there are other ways - anon memory running with swap > disabled is a surprisingly popular choice). > > Kosaki is also showing interest in mlock, though I'm not sure what his > use case is. I don't have any solid use case. Usually server app only do mlock anonymous memory. But, I haven't found any negative effect in your proposal, therefore I hope to help your effort as I always do when the proposal don't have negative impact. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] mlock: release mmap_sem every 256 faulted pages 2010-11-23 5:00 [RFC] mlock: release mmap_sem every 256 faulted pages Michel Lespinasse 2010-11-23 5:57 ` Andrew Morton @ 2010-12-01 0:14 ` KOSAKI Motohiro 2010-12-01 4:42 ` KOSAKI Motohiro 2 siblings, 0 replies; 7+ messages in thread From: KOSAKI Motohiro @ 2010-12-01 0:14 UTC (permalink / raw) To: Michel Lespinasse Cc: kosaki.motohiro, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel Hi Sorry fot the delay. I reviewed this one. > Hi, > > I'd like to sollicit comments on this proposal: > > Currently mlock() holds mmap_sem in exclusive mode while the pages get > faulted in. In the case of a large mlock, this can potentially take a > very long time. > > I propose that mlock() could release mmap_sem after the VM_LOCKED bits > have been set in all appropriate VMAs. Then a second pass could be done > to actually mlock the pages, in small batches, never holding mmap_sem > for longer than it takes to process one single batch. We need to recheck > the vma flags whenever we re-acquire mmap_sem, but this is not difficult. > > This is only an RFC rather than an actual submission, as I think this > could / should be completed to handle more than the mlock() and > mlockall() cases (there are many call sites to mlock_vma_pages_range() > that should ideally be converted as well), and maybe use the fault retry > mechanism to drop mmap_sem when blocking on disk access rather than > using an arbitrary page batch size. > > Patch is against v2.6.36, but should apply to linus tree too. > > ------------------------------- 8< ----------------------------- > > Let mlock / mlockall release mmap_sem after the vmas have been marked > as VM_LOCKED. Then, mark the vmas as mlocked in small batches. > For every batch, we need to grab mmap_sem in read mode, check that the > vma has not been munlocked, and mlock the pages. > > In the case where a vma has been munlocked before mlock completes, > pages that were already marked as PageMlocked() are handled by the > munlock() call, and mlock() is careful to not mark new page batches > as PageMlocked() after the munlock() call has cleared the VM_LOCKED > vma flags. So, the end result will be identical to what'd happen if > munlock() had executed after the mlock() call. > > Signed-off-by: Michel Lespinasse <walken@google.com> Looks good. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> some nit. > --- > mm/mlock.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++------------ > 1 files changed, 63 insertions(+), 16 deletions(-) > > diff --git a/mm/mlock.c b/mm/mlock.c > index b70919c..0aa4df5 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -373,17 +373,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, > int lock = newflags & VM_LOCKED; > > if (newflags == vma->vm_flags || > - (vma->vm_flags & (VM_IO | VM_PFNMAP))) > + (vma->vm_flags & (VM_IO | VM_PFNMAP | > + VM_DONTEXPAND | VM_RESERVED)) || > + is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) > goto out; /* don't set VM_LOCKED, don't count */ > > - if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || > - is_vm_hugetlb_page(vma) || > - vma == get_gate_vma(current)) { > - if (lock) > - make_pages_present(start, end); > - goto out; /* don't set VM_LOCKED, don't count */ > - } > - > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma, > vma->vm_file, pgoff, vma_policy(vma)); > @@ -419,14 +413,10 @@ success: > * set VM_LOCKED, __mlock_vma_pages_range will bring it back. > */ > > - if (lock) { > + if (lock) > vma->vm_flags = newflags; > - ret = __mlock_vma_pages_range(vma, start, end); > - if (ret < 0) > - ret = __mlock_posix_error_return(ret); > - } else { > + else > munlock_vma_pages_range(vma, start, end); > - } > > out: > *prev = vma; > @@ -439,7 +429,8 @@ static int do_mlock(unsigned long start, size_t len, int on) > struct vm_area_struct * vma, * prev; > int error; > > - len = PAGE_ALIGN(len); > + VM_BUG_ON(start & ~PAGE_MASK); > + VM_BUG_ON(len != PAGE_ALIGN(len)); good cleanup. but please separate this. I think this change is unrelated 256 batch concept. > end = start + len; > if (end < start) > return -EINVAL; > @@ -482,6 +473,58 @@ static int do_mlock(unsigned long start, size_t len, int on) > return error; > } > > +static int do_mlock_pages(unsigned long start, size_t len) > +{ > + struct mm_struct *mm = current->mm; > + unsigned long end, nstart, nend, nfault; > + struct vm_area_struct *vma; > + int error = 0; > + > + VM_BUG_ON(start & ~PAGE_MASK); > + VM_BUG_ON(len != PAGE_ALIGN(len)); > + end = start + len; > + > + for (nstart = start; nstart < end; nstart = nend) { > + down_read(&mm->mmap_sem); > + nend = end; > + vma = find_vma_intersection(mm, nstart, nend); > + if (!vma) > + goto up; every exception check branch need comments. > + if (vma->vm_end < nend) > + nend = vma->vm_end; > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > + goto up; vm_flags should be checked at first. because we don't need nend adjustment. > + if (nstart < vma->vm_start) > + nstart = vma->vm_start; > + > + /* > + * Limit batch size to 256 pages in order to reduce > + * mmap_sem hold time. > + */ > + nfault = nstart + 256 * PAGE_SIZE; We don't need nfault variable. maybe. > + > + /* > + * Now fault in a batch of pages. We need to check the vma > + * flags again, as we've not been holding mmap_sem. > + */ > + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || > + is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) { > + if (nfault < nend) > + nend = nfault; > + make_pages_present(nstart, nend); > + } else if (vma->vm_flags & VM_LOCKED) { > + if (nfault < nend) > + nend = nfault; Both branch has the same "nfault < nend" check. We can unify it. > + error = __mlock_vma_pages_range(vma, nstart, nend); > + } > + up: > + up_read(&mm->mmap_sem); > + if (error) > + return __mlock_posix_error_return(error); Now, __mlock_posix_error_return() can be moved into __mlock_vma_pages_range(). > + } > + return 0; > +} > + > SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) > { > unsigned long locked; > @@ -507,6 +550,8 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) > if ((locked <= lock_limit) || capable(CAP_IPC_LOCK)) > error = do_mlock(start, len, 1); > up_write(¤t->mm->mmap_sem); > + if (!error) > + error = do_mlock_pages(start, len); > return error; > } > > @@ -571,6 +616,8 @@ SYSCALL_DEFINE1(mlockall, int, flags) > capable(CAP_IPC_LOCK)) > ret = do_mlockall(flags); > up_write(¤t->mm->mmap_sem); > + if (!ret && (flags & MCL_CURRENT)) > + ret = do_mlock_pages(0, TASK_SIZE); > out: > return ret; > } So, I'm waiting your testcase. below is for explanation code. but you can ignore this. I don't think your patch doesn't work. Thanks. From e4253f161d9c17ba67cd9a445c8b8f556f87ad8c Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Sat, 25 Dec 2010 04:12:21 +0900 Subject: [PATCH] cleanup Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/mlock.c | 67 ++++++++++++++++++++++++++++++++--------------------------- 1 files changed, 36 insertions(+), 31 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 0aa4df5..e44b7ac 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -142,6 +142,18 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add !vma_stack_continue(vma->vm_prev, addr); } +/* + * convert get_user_pages() return value to posix mlock() error + */ +static int __mlock_posix_error_return(long retval) +{ + if (retval == -EFAULT) + retval = -ENOMEM; + else if (retval == -ENOMEM) + retval = -EAGAIN; + return retval; +} + /** * __mlock_vma_pages_range() - mlock a range of pages in the vma. * @vma: target vma @@ -236,19 +248,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, ret = 0; } - return ret; /* 0 or negative error code */ -} - -/* - * convert get_user_pages() return value to posix mlock() error - */ -static int __mlock_posix_error_return(long retval) -{ - if (retval == -EFAULT) - retval = -ENOMEM; - else if (retval == -ENOMEM) - retval = -EAGAIN; - return retval; + return __mlock_posix_error_return(ret); } /** @@ -476,53 +476,58 @@ static int do_mlock(unsigned long start, size_t len, int on) static int do_mlock_pages(unsigned long start, size_t len) { struct mm_struct *mm = current->mm; - unsigned long end, nstart, nend, nfault; + unsigned long end, nstart, nend; struct vm_area_struct *vma; int error = 0; VM_BUG_ON(start & ~PAGE_MASK); VM_BUG_ON(len != PAGE_ALIGN(len)); - end = start + len; + end = nend = start + len; - for (nstart = start; nstart < end; nstart = nend) { + for (nstart = start; nstart < end; nstart = nend, nend = end) { down_read(&mm->mmap_sem); - nend = end; vma = find_vma_intersection(mm, nstart, nend); + + /* VMA gone. We don't need anything. */ if (!vma) goto up; - if (vma->vm_end < nend) + + /* IO mapped area can't be populated. skip it. */ + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) { nend = vma->vm_end; - if (vma->vm_flags & (VM_IO | VM_PFNMAP)) goto up; - if (nstart < vma->vm_start) - nstart = vma->vm_start; + } /* * Limit batch size to 256 pages in order to reduce * mmap_sem hold time. */ - nfault = nstart + 256 * PAGE_SIZE; + nend = nstart + 256 * PAGE_SIZE; + if (nstart < vma->vm_start) + nstart = vma->vm_start; + if (vma->vm_end < nend) + nend = vma->vm_end; /* * Now fault in a batch of pages. We need to check the vma * flags again, as we've not been holding mmap_sem. */ if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || - is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) { - if (nfault < nend) - nend = nfault; + is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) + /* The vma has special attributes. Don't use PG_mlocked. */ make_pages_present(nstart, nend); - } else if (vma->vm_flags & VM_LOCKED) { - if (nfault < nend) - nend = nfault; + else if (vma->vm_flags & VM_LOCKED) error = __mlock_vma_pages_range(vma, nstart, nend); - } + else + /* VM_LOCKED has been lost. we can skip this vma. */ + nend = vma->vm_end; up: up_read(&mm->mmap_sem); if (error) - return __mlock_posix_error_return(error); + break; } - return 0; + + return error; } SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC] mlock: release mmap_sem every 256 faulted pages 2010-11-23 5:00 [RFC] mlock: release mmap_sem every 256 faulted pages Michel Lespinasse 2010-11-23 5:57 ` Andrew Morton 2010-12-01 0:14 ` KOSAKI Motohiro @ 2010-12-01 4:42 ` KOSAKI Motohiro 2 siblings, 0 replies; 7+ messages in thread From: KOSAKI Motohiro @ 2010-12-01 4:42 UTC (permalink / raw) To: Michel Lespinasse Cc: kosaki.motohiro, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel minor additionl comments. > + /* > + * Limit batch size to 256 pages in order to reduce > + * mmap_sem hold time. > + */ > + nfault = nstart + 256 * PAGE_SIZE; You made 256 pages batch and __mlock_vma_pages_range() has 16 pages another batch. Can we unify this two batch? Plus, PeterZ implemeted mutex contention detect method (see "[PATCH 18/21] mutex: Provide mutex_is_contended"). now you can easily implemnt akpm proposed efficient batching. Thanks. > + > + /* > + * Now fault in a batch of pages. We need to check the vma > + * flags again, as we've not been holding mmap_sem. > + */ > + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || > + is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) { > + if (nfault < nend) > + nend = nfault; > + make_pages_present(nstart, nend); > + } else if (vma->vm_flags & VM_LOCKED) { > + if (nfault < nend) > + nend = nfault; > + error = __mlock_vma_pages_range(vma, nstart, nend); > + } > + up: > + up_read(&mm->mmap_sem); > + if (error) > + return __mlock_posix_error_return(error); > + } > + return 0; > +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-12-01 4:42 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-23 5:00 [RFC] mlock: release mmap_sem every 256 faulted pages Michel Lespinasse 2010-11-23 5:57 ` Andrew Morton 2010-11-23 7:49 ` KOSAKI Motohiro 2010-11-23 20:55 ` Michel Lespinasse 2010-11-23 23:57 ` KOSAKI Motohiro 2010-12-01 0:14 ` KOSAKI Motohiro 2010-12-01 4:42 ` KOSAKI Motohiro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).