[patch 17/20] mlock vma pages under mmap_sem held for read

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rik van Riel <riel@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: [patch 17/20] mlock vma pages under mmap_sem held for read
Date: Tue, 04 Mar 2008 17:52:14 -0500	[thread overview]
Message-ID: <20080304225227.847718844@redhat.com> (raw)
In-Reply-To: 20080304225157.573336066@redhat.com

[-- Attachment #1: noreclaim-04.1a-lock-vma-pages-under-read-lock.patch --]
[-- Type: text/plain, Size: 7254 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.25-rc3-mm1/mm/mlock.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/mlock.c	2008-03-04 16:19:46.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/mlock.c	2008-03-04 17:29:19.000000000 -0500
@@ -199,6 +199,37 @@ int __mlock_vma_pages_range(struct vm_ar
 	return ret;
 }
 
+/**
+ * mlock_vma_pages_range - lock the pages of a VMA in memory
+ * @vma: vm area to mlock into memory
+ * @start: start address in @vma of range to mlock,
+ * @end: end address in @vma of range
+ *
+ * Called with current->mm->mmap_sem held write locked.  Downgrade to read
+ * for faulting in pages.  This can take a looong time for large segments.
+ *
+ * We need to restore the mmap_sem to write locked because our callers'
+ * callers expect this.	 However, because the mmap could have changed
+ * [in a multi-threaded process], we need to recheck.
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	downgrade_write(&mm->mmap_sem);
+	__mlock_vma_pages_range(vma, start, end);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return 0;
+}
+
 #else /* CONFIG_NORECLAIM_MLOCK */
 
 /*
@@ -265,14 +296,38 @@ success:
 	mm->locked_vm += nr_pages;
 
 	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * vm_flags is protected by the mmap_sem held for write.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
+	/*
+	 * mmap_sem is currently held for write.  If we're locking pages,
+	 * downgrade the write lock to a read lock so that other faults,
+	 * mmap scans, ... while we fault in all pages.
+	 */
+	if (lock)
+		downgrade_write(&mm->mmap_sem);
+
 	__mlock_vma_pages_range(vma, start, end);
 
+	if (lock) {
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for ranges while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	}
+
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: linux-2.6.25-rc3-mm1/mm/internal.h
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/internal.h	2008-03-04 16:19:46.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/internal.h	2008-03-04 17:29:19.000000000 -0500
@@ -61,24 +61,21 @@ extern int __mlock_vma_pages_range(struc
 /*
  * mlock all pages in this vma range.  For mmap()/mremap()/...
  */
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end)
-{
-	__mlock_vma_pages_range(vma, start, end);
-}
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 /*
  * munlock range of pages.   For munmap() and exit().
  * Always called to operate on a full vma that is being unmapped.
  */
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
 // TODO:  verify my assumption.  Should we just drop the start/end args?
 	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
 
 	vma->vm_flags &= ~VM_LOCKED;
-	__mlock_vma_pages_range(vma, start, end);
+	return __mlock_vma_pages_range(vma, start, end);
 }
 
 extern void clear_page_mlock(struct page *page);
@@ -90,10 +87,10 @@ static inline int is_mlocked_vma(struct 
 }
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
+static inline int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 
Index: linux-2.6.25-rc3-mm1/mm/mmap.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/mmap.c	2008-03-04 17:29:19.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/mmap.c	2008-03-04 17:30:00.000000000 -0500
@@ -2007,8 +2007,9 @@ unsigned long do_brk(unsigned long addr,
 		return -ENOMEM;
 
 	/* Can we just expand an old private anonymous mapping? */
-	if (vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL))
+	vma = vma_merge(mm, prev, addr, addr + len, flags,
+					NULL, NULL, pgoff, NULL);
+	if (vma)
 		goto out;
 
 	/*

-- 
All Rights Reversed

WARNING: multiple messages have this Message-ID (diff)

From: Rik van Riel <riel@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: [patch 17/20] mlock vma pages under mmap_sem held for read
Date: Tue, 04 Mar 2008 17:52:14 -0500	[thread overview]
Message-ID: <20080304225227.847718844@redhat.com> (raw)
In-Reply-To: 20080304225157.573336066@redhat.com

[-- Attachment #1: noreclaim-04.1a-lock-vma-pages-under-read-lock.patch --]
[-- Type: text/plain, Size: 7479 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.25-rc3-mm1/mm/mlock.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/mlock.c	2008-03-04 16:19:46.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/mlock.c	2008-03-04 17:29:19.000000000 -0500
@@ -199,6 +199,37 @@ int __mlock_vma_pages_range(struct vm_ar
 	return ret;
 }
 
+/**
+ * mlock_vma_pages_range - lock the pages of a VMA in memory
+ * @vma: vm area to mlock into memory
+ * @start: start address in @vma of range to mlock,
+ * @end: end address in @vma of range
+ *
+ * Called with current->mm->mmap_sem held write locked.  Downgrade to read
+ * for faulting in pages.  This can take a looong time for large segments.
+ *
+ * We need to restore the mmap_sem to write locked because our callers'
+ * callers expect this.	 However, because the mmap could have changed
+ * [in a multi-threaded process], we need to recheck.
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	downgrade_write(&mm->mmap_sem);
+	__mlock_vma_pages_range(vma, start, end);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return 0;
+}
+
 #else /* CONFIG_NORECLAIM_MLOCK */
 
 /*
@@ -265,14 +296,38 @@ success:
 	mm->locked_vm += nr_pages;
 
 	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * vm_flags is protected by the mmap_sem held for write.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
+	/*
+	 * mmap_sem is currently held for write.  If we're locking pages,
+	 * downgrade the write lock to a read lock so that other faults,
+	 * mmap scans, ... while we fault in all pages.
+	 */
+	if (lock)
+		downgrade_write(&mm->mmap_sem);
+
 	__mlock_vma_pages_range(vma, start, end);
 
+	if (lock) {
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for ranges while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	}
+
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: linux-2.6.25-rc3-mm1/mm/internal.h
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/internal.h	2008-03-04 16:19:46.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/internal.h	2008-03-04 17:29:19.000000000 -0500
@@ -61,24 +61,21 @@ extern int __mlock_vma_pages_range(struc
 /*
  * mlock all pages in this vma range.  For mmap()/mremap()/...
  */
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end)
-{
-	__mlock_vma_pages_range(vma, start, end);
-}
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 /*
  * munlock range of pages.   For munmap() and exit().
  * Always called to operate on a full vma that is being unmapped.
  */
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
 // TODO:  verify my assumption.  Should we just drop the start/end args?
 	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
 
 	vma->vm_flags &= ~VM_LOCKED;
-	__mlock_vma_pages_range(vma, start, end);
+	return __mlock_vma_pages_range(vma, start, end);
 }
 
 extern void clear_page_mlock(struct page *page);
@@ -90,10 +87,10 @@ static inline int is_mlocked_vma(struct 
 }
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
+static inline int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 
Index: linux-2.6.25-rc3-mm1/mm/mmap.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/mmap.c	2008-03-04 17:29:19.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/mmap.c	2008-03-04 17:30:00.000000000 -0500
@@ -2007,8 +2007,9 @@ unsigned long do_brk(unsigned long addr,
 		return -ENOMEM;
 
 	/* Can we just expand an old private anonymous mapping? */
-	if (vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL))
+	vma = vma_merge(mm, prev, addr, addr + len, flags,
+					NULL, NULL, pgoff, NULL);
+	if (vma)
 		goto out;
 
 	/*

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-03-04 23:14 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-04 22:51 [patch 00/20] VM pageout scalability improvements (V5) Rik van Riel
2008-03-04 22:51 ` Rik van Riel
2008-03-04 22:51 ` [patch 01/20] move isolate_lru_page() to vmscan.c Rik van Riel
2008-03-04 22:51   ` Rik van Riel
2008-03-04 22:51 ` [patch 02/20] Use an indexed array for LRU variables Rik van Riel
2008-03-04 22:51   ` Rik van Riel
2008-03-05  0:31   ` Johannes Weiner
2008-03-05  0:31     ` Johannes Weiner
2008-03-04 22:52 ` [patch 03/20] use an array for the LRU pagevecs Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 04/20] free swap space on swap-in/activation Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 05/20] define page_file_cache() function Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 06/20] split LRU lists into anon & file sets Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 07/20] SEQ replacement for anonymous pages Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 08/20] add some sanity checks to get_scan_ratio Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 09/20] add newly swapped in pages to the inactive list Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 10/20] more aggressively use lumpy reclaim Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 11/20] No Reclaim LRU Infrastructure Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-05  0:34   ` minchan Kim
2008-03-05  0:34     ` minchan Kim
2008-03-05  4:21     ` Rik van Riel
2008-03-05  4:21       ` Rik van Riel
2008-03-04 22:52 ` [patch 12/20] Non-reclaimable page statistics Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 13/20] scan noreclaim list for reclaimable pages Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 14/20] ramfs pages are non-reclaimable Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 15/20] SHM_LOCKED pages are nonreclaimable Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 16/20] non-reclaimable mlocked pages Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-05  0:28   ` minchan Kim
2008-03-05  0:28     ` minchan Kim
2008-03-05  4:18     ` Rik van Riel
2008-03-05  4:18       ` Rik van Riel
2008-03-04 22:52 ` Rik van Riel [this message]
2008-03-04 22:52   ` [patch 17/20] mlock vma pages under mmap_sem held for read Rik van Riel
2008-03-04 22:52 ` [patch 18/20] handle mlocked pages during map/unmap and truncate Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 19/20] account mlocked pages Rik van Riel
2008-03-04 22:52   ` Rik van Riel
2008-03-04 22:52 ` [patch 20/20] cull non-reclaimable anon pages from the LRU at fault time Rik van Riel
2008-03-04 22:52   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080304225227.847718844@redhat.com \
    --to=riel@redhat.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.