All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rik van Riel <riel@redhat.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, lee.shermerhorn@hp.com,
	Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: [patch 18/20] mlock vma pages under mmap_sem held for read
Date: Tue, 18 Dec 2007 16:15:57 -0500	[thread overview]
Message-ID: <20071218211550.292178257@redhat.com> (raw)
In-Reply-To: 20071218211539.250334036@redhat.com

[-- Attachment #1: noreclaim-04.1a-lock-vma-pages-under-read-lock.patch --]
[-- Type: text/plain, Size: 6588 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: Linux/mm/mlock.c
===================================================================
--- Linux.orig/mm/mlock.c	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/mlock.c	2007-11-12 16:22:01.000000000 -0500
@@ -215,6 +215,37 @@ int __mlock_vma_pages_range(struct vm_ar
 	return ret;
 }
 
+/**
+ * mlock_vma_pages_range
+ * @vma - vm area to mlock into memory
+ * @start - start address in @vma of range to mlock,
+ * @end   - end address in @vma of range
+ *
+ * Called with current->mm->mmap_sem held write locked.  Downgrade to read
+ * for faulting in pages.  This can take a looong time for large segments.
+ *
+ * We need to restore the mmap_sem to write locked because our callers'
+ * callers expect this.	 However, because the mmap could have changed
+ * [in a multi-threaded process], we need to recheck.
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	downgrade_write(&mm->mmap_sem);
+	__mlock_vma_pages_range(vma, start, end, 1);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return 0;
+}
+
 #else /* CONFIG_NORECLAIM_MLOCK */
 
 /*
@@ -281,14 +312,38 @@ success:
 	mm->locked_vm += nr_pages;
 
 	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * vm_flags is protected by the mmap_sem held for write.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
+	/*
+	 * mmap_sem is currently held for write.  If we're locking pages,
+	 * downgrade the write lock to a read lock so that other faults,
+	 * mmap scans, ... while we fault in all pages.
+	 */
+	if (lock)
+		downgrade_write(&mm->mmap_sem);
+
 	__mlock_vma_pages_range(vma, start, end, lock);
 
+	if (lock) {
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for changes while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	}
+
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: Linux/mm/internal.h
===================================================================
--- Linux.orig/mm/internal.h	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/internal.h	2007-11-12 16:22:01.000000000 -0500
@@ -53,24 +53,21 @@ extern int __mlock_vma_pages_range(struc
 /*
  * mlock all pages in this vma range.  For mmap()/mremap()/...
  */
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end)
-{
-	__mlock_vma_pages_range(vma, start, end, 1);
-}
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 /*
  * munlock range of pages.   For munmap() and exit().
  * Always called to operate on a full vma that is being unmapped.
  */
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
 // TODO:  verify my assumption.  Should we just drop the start/end args?
 	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
 
 	vma->vm_flags &= ~VM_LOCKED;	/* try_to_unlock() needs this */
-	__mlock_vma_pages_range(vma, start, end, 0);
+	return __mlock_vma_pages_range(vma, start, end, 0);
 }
 
 extern void clear_page_mlock(struct page *page);
@@ -82,10 +79,10 @@ static inline int is_mlocked_vma(struct 
 }
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
+static inline int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 

-- 
All Rights Reversed


WARNING: multiple messages have this Message-ID (diff)
From: Rik van Riel <riel@redhat.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, lee.shermerhorn@hp.com,
	Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: [patch 18/20] mlock vma pages under mmap_sem held for read
Date: Tue, 18 Dec 2007 16:15:57 -0500	[thread overview]
Message-ID: <20071218211550.292178257@redhat.com> (raw)
In-Reply-To: 20071218211539.250334036@redhat.com

[-- Attachment #1: noreclaim-04.1a-lock-vma-pages-under-read-lock.patch --]
[-- Type: text/plain, Size: 6813 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: Linux/mm/mlock.c
===================================================================
--- Linux.orig/mm/mlock.c	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/mlock.c	2007-11-12 16:22:01.000000000 -0500
@@ -215,6 +215,37 @@ int __mlock_vma_pages_range(struct vm_ar
 	return ret;
 }
 
+/**
+ * mlock_vma_pages_range
+ * @vma - vm area to mlock into memory
+ * @start - start address in @vma of range to mlock,
+ * @end   - end address in @vma of range
+ *
+ * Called with current->mm->mmap_sem held write locked.  Downgrade to read
+ * for faulting in pages.  This can take a looong time for large segments.
+ *
+ * We need to restore the mmap_sem to write locked because our callers'
+ * callers expect this.	 However, because the mmap could have changed
+ * [in a multi-threaded process], we need to recheck.
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	downgrade_write(&mm->mmap_sem);
+	__mlock_vma_pages_range(vma, start, end, 1);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return 0;
+}
+
 #else /* CONFIG_NORECLAIM_MLOCK */
 
 /*
@@ -281,14 +312,38 @@ success:
 	mm->locked_vm += nr_pages;
 
 	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * vm_flags is protected by the mmap_sem held for write.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
+	/*
+	 * mmap_sem is currently held for write.  If we're locking pages,
+	 * downgrade the write lock to a read lock so that other faults,
+	 * mmap scans, ... while we fault in all pages.
+	 */
+	if (lock)
+		downgrade_write(&mm->mmap_sem);
+
 	__mlock_vma_pages_range(vma, start, end, lock);
 
+	if (lock) {
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for changes while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	}
+
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: Linux/mm/internal.h
===================================================================
--- Linux.orig/mm/internal.h	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/internal.h	2007-11-12 16:22:01.000000000 -0500
@@ -53,24 +53,21 @@ extern int __mlock_vma_pages_range(struc
 /*
  * mlock all pages in this vma range.  For mmap()/mremap()/...
  */
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end)
-{
-	__mlock_vma_pages_range(vma, start, end, 1);
-}
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 /*
  * munlock range of pages.   For munmap() and exit().
  * Always called to operate on a full vma that is being unmapped.
  */
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
 // TODO:  verify my assumption.  Should we just drop the start/end args?
 	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
 
 	vma->vm_flags &= ~VM_LOCKED;	/* try_to_unlock() needs this */
-	__mlock_vma_pages_range(vma, start, end, 0);
+	return __mlock_vma_pages_range(vma, start, end, 0);
 }
 
 extern void clear_page_mlock(struct page *page);
@@ -82,10 +79,10 @@ static inline int is_mlocked_vma(struct 
 }
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
+static inline int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-12-18 21:19 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
2007-12-18 21:15 ` Rik van Riel
2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-20  7:07   ` Christoph Lameter
2007-12-20  7:07     ` Christoph Lameter
2007-12-18 21:15 ` [patch 02/20] make the inode i_mmap_lock a reader/writer lock Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-19  0:48   ` Nick Piggin
2007-12-19  0:48     ` Nick Piggin
2007-12-19  4:09     ` KOSAKI Motohiro
2007-12-19  4:09       ` KOSAKI Motohiro
2007-12-19 15:52     ` Lee Schermerhorn
2007-12-19 15:52       ` Lee Schermerhorn
2007-12-19 16:31       ` Rik van Riel
2007-12-19 16:31         ` Rik van Riel
2007-12-19 16:53         ` Lee Schermerhorn
2007-12-19 16:53           ` Lee Schermerhorn
2007-12-19 19:28           ` Peter Zijlstra
2007-12-19 19:28             ` Peter Zijlstra
2007-12-19 23:40             ` Nick Piggin
2007-12-19 23:40               ` Nick Piggin
2007-12-20  7:04               ` Christoph Lameter
2007-12-20  7:04                 ` Christoph Lameter
2007-12-20  7:59                 ` Nick Piggin
2007-12-20  7:59                   ` Nick Piggin
2008-01-02 23:35                   ` Mike Travis
2008-01-02 23:35                     ` Mike Travis
2008-01-03  6:07                     ` Nick Piggin
2008-01-03  6:07                       ` Nick Piggin
2008-01-03  8:55                       ` Ingo Molnar
2008-01-03  8:55                         ` Ingo Molnar
2008-01-07  9:01                         ` Nick Piggin
2008-01-07  9:01                           ` Nick Piggin
2007-12-18 21:15 ` [patch 03/20] move isolate_lru_page() to vmscan.c Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-20  7:08   ` Christoph Lameter
2007-12-20  7:08     ` Christoph Lameter
2007-12-18 21:15 ` [patch 04/20] free swap space on swap-in/activation Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 05/20] define page_file_cache() function Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 06/20] debugging checks for page_file_cache() Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 07/20] Use an indexed array for LRU variables Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 08/20] split LRU lists into anon & file sets Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 09/20] split anon & file LRUs for memcontrol code Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 10/20] SEQ replacement for anonymous pages Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-19  5:17   ` KOSAKI Motohiro
2007-12-19  5:17     ` KOSAKI Motohiro
2007-12-19 13:40     ` Rik van Riel
2007-12-19 13:40       ` Rik van Riel
2007-12-20  2:04       ` KOSAKI Motohiro
2007-12-20  2:04         ` KOSAKI Motohiro
2007-12-18 21:15 ` [patch 11/20] add newly swapped in pages to the inactive list Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 12/20] No Reclaim LRU Infrastructure Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 13/20] Non-reclaimable page statistics Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 14/20] Scan noreclaim list for reclaimable pages Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 15/20] ramfs pages are non-reclaimable Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 16/20] SHM_LOCKED pages are nonreclaimable Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 17/20] non-reclaimable mlocked pages Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-19  0:56   ` Nick Piggin
2007-12-19  0:56     ` Nick Piggin
2007-12-19 13:45     ` Rik van Riel
2007-12-19 13:45       ` Rik van Riel
2007-12-19 14:24       ` Peter Zijlstra
2007-12-19 14:24         ` Peter Zijlstra
2007-12-19 14:53         ` Rik van Riel
2007-12-19 14:53           ` Rik van Riel
2007-12-19 16:08           ` Lee Schermerhorn
2007-12-19 16:08             ` Lee Schermerhorn
2007-12-19 16:04       ` Lee Schermerhorn
2007-12-19 16:04         ` Lee Schermerhorn
2007-12-20 20:56         ` Rik van Riel
2007-12-20 20:56           ` Rik van Riel
2007-12-21 10:52           ` Nick Piggin
2007-12-21 10:52             ` Nick Piggin
2007-12-21 14:17             ` Rik van Riel
2007-12-21 14:17               ` Rik van Riel
2007-12-23 12:22               ` Nick Piggin
2007-12-24  1:00                 ` Rik van Riel
2007-12-24  1:00                   ` Rik van Riel
2007-12-19 23:34       ` Nick Piggin
2007-12-19 23:34         ` Nick Piggin
2007-12-20  7:19     ` Christoph Lameter
2007-12-20  7:19       ` Christoph Lameter
2007-12-20 15:33       ` Rik van Riel
2007-12-20 15:33         ` Rik van Riel
2007-12-21 17:13         ` Lee Schermerhorn
2007-12-21 17:13           ` Lee Schermerhorn
2007-12-18 21:15 ` Rik van Riel [this message]
2007-12-18 21:15   ` [patch 18/20] mlock vma pages under mmap_sem held for read Rik van Riel
2007-12-18 21:15 ` [patch 19/20] handle mlocked pages during map/unmap and truncate Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-18 21:15 ` [patch 20/20] account mlocked pages Rik van Riel
2007-12-18 21:15   ` Rik van Riel
2007-12-22 20:27 ` [patch 00/20] VM pageout scalability improvements Balbir Singh
2007-12-22 20:27   ` Balbir Singh
2007-12-23  0:21   ` Rik van Riel
2007-12-23  0:21     ` Rik van Riel
2007-12-23 22:59     ` Balbir Singh
2007-12-23 22:59       ` Balbir Singh
2007-12-24  1:11       ` Rik van Riel
2007-12-24  1:11         ` Rik van Riel
2007-12-28  3:20         ` Matt Mackall
2007-12-28  3:20           ` Matt Mackall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071218211550.292178257@redhat.com \
    --to=riel@redhat.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=lee.shermerhorn@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.