public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas
@ 2013-05-02 12:10 Cliff Wickman
  2013-05-02 16:44 ` Naoya Horiguchi
  0 siblings, 1 reply; 3+ messages in thread
From: Cliff Wickman @ 2013-05-02 12:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, mgorman, aarcange, dave.hansen, dsterba, hannes,
	kosaki.motohiro, kirill.shutemov, mpm, n-horiguchi, rdunlap


/proc/<pid>/smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.

This is v2: 
- moves the VM_BUG_ON out of the loop
- adds the needed test for  vma->vm_start <= addr

Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures. And this is
not usually true for VM_PFNMAP areas. This can result in panics on kernel
page faults when attempting to address those page structures.

There are a half dozen callers of walk_page_range() that walk through
a task's entire page table (as N. Horiguchi pointed out). So rather than
change all of them, this patch changes just walk_page_range() to ignore 
VM_PFNMAP areas.

The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.

VM_PFNMAP areas are used by:
- graphics memory manager   gpu/drm/drm_gem.c
- global reference unit     sgi-gru/grufile.c
- sgi special memory        char/mspec.c
- and probably several out-of-tree modules

I'm copying everyone who has changed this file recently, in case
there is some reason that I am not aware of to provide
/proc/<pid>/smaps|clear_refs|maps|numa_maps for these VM_PFNMAP areas.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
---
 mm/pagewalk.c |   62 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 33 insertions(+), 29 deletions(-)

Index: linux/mm/pagewalk.c
===================================================================
--- linux.orig/mm/pagewalk.c
+++ linux/mm/pagewalk.c
@@ -127,22 +127,6 @@ static int walk_hugetlb_range(struct vm_
 	return 0;
 }
 
-static struct vm_area_struct* hugetlb_vma(unsigned long addr, struct mm_walk *walk)
-{
-	struct vm_area_struct *vma;
-
-	/* We don't need vma lookup at all. */
-	if (!walk->hugetlb_entry)
-		return NULL;
-
-	VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
-	vma = find_vma(walk->mm, addr);
-	if (vma && vma->vm_start <= addr && is_vm_hugetlb_page(vma))
-		return vma;
-
-	return NULL;
-}
-
 #else /* CONFIG_HUGETLB_PAGE */
 static struct vm_area_struct* hugetlb_vma(unsigned long addr, struct mm_walk *walk)
 {
@@ -198,30 +182,50 @@ int walk_page_range(unsigned long addr,
 	if (!walk->mm)
 		return -EINVAL;
 
+	VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
+
 	pgd = pgd_offset(walk->mm, addr);
 	do {
-		struct vm_area_struct *vma;
+		struct vm_area_struct *vma = NULL;
 
 		next = pgd_addr_end(addr, end);
 
 		/*
-		 * handle hugetlb vma individually because pagetable walk for
-		 * the hugetlb page is dependent on the architecture and
-		 * we can't handled it in the same manner as non-huge pages.
+		 * Check any special vma's within this range.
 		 */
-		vma = hugetlb_vma(addr, walk);
+		vma = find_vma(walk->mm, addr);
 		if (vma) {
-			if (vma->vm_end < next)
+			/*
+			 * There are no page structures backing a VM_PFNMAP
+			 * range, so do not allow split_huge_page_pmd().
+			 */
+			if ((vma->vm_start <= addr) &&
+			    (vma->vm_flags & VM_PFNMAP)) {
 				next = vma->vm_end;
+				pgd = pgd_offset(walk->mm, next);
+				continue;
+			}
 			/*
-			 * Hugepage is very tightly coupled with vma, so
-			 * walk through hugetlb entries within a given vma.
+			 * Handle hugetlb vma individually because pagetable
+			 * walk for the hugetlb page is dependent on the
+			 * architecture and we can't handled it in the same
+			 * manner as non-huge pages.
 			 */
-			err = walk_hugetlb_range(vma, addr, next, walk);
-			if (err)
-				break;
-			pgd = pgd_offset(walk->mm, next);
-			continue;
+			if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
+			    is_vm_hugetlb_page(vma)) {
+				if (vma->vm_end < next)
+					next = vma->vm_end;
+				/*
+				 * Hugepage is very tightly coupled with vma,
+				 * so walk through hugetlb entries within a
+				 * given vma.
+				 */
+				err = walk_hugetlb_range(vma, addr, next, walk);
+				if (err)
+					break;
+				pgd = pgd_offset(walk->mm, next);
+				continue;
+			}
 		}
 
 		if (pgd_none_or_clear_bad(pgd)) {

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas
  2013-05-02 12:10 [PATCH v2] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas Cliff Wickman
@ 2013-05-02 16:44 ` Naoya Horiguchi
  2013-05-02 17:16   ` Cliff Wickman
  0 siblings, 1 reply; 3+ messages in thread
From: Naoya Horiguchi @ 2013-05-02 16:44 UTC (permalink / raw)
  To: Cliff Wickman
  Cc: linux-kernel, akpm, mgorman, aarcange, dave.hansen, dsterba,
	hannes, kosaki.motohiro, kirill.shutemov, mpm, rdunlap

On Thu, May 02, 2013 at 07:10:48AM -0500, Cliff Wickman wrote:
> 
> /proc/<pid>/smaps and similar walks through a user page table should not
> be looking at VM_PFNMAP areas.
> 
> This is v2: 
> - moves the VM_BUG_ON out of the loop
> - adds the needed test for  vma->vm_start <= addr
> 
> Certain tests in walk_page_range() (specifically split_huge_page_pmd())
> assume that all the mapped PFN's are backed with page structures. And this is
> not usually true for VM_PFNMAP areas. This can result in panics on kernel
> page faults when attempting to address those page structures.
> 
> There are a half dozen callers of walk_page_range() that walk through
> a task's entire page table (as N. Horiguchi pointed out). So rather than
> change all of them, this patch changes just walk_page_range() to ignore 
> VM_PFNMAP areas.
> 
> The logic of hugetlb_vma() is moved back into walk_page_range(), as we
> want to test any vma in the range.
> 
> VM_PFNMAP areas are used by:
> - graphics memory manager   gpu/drm/drm_gem.c
> - global reference unit     sgi-gru/grufile.c
> - sgi special memory        char/mspec.c
> - and probably several out-of-tree modules
> 
> I'm copying everyone who has changed this file recently, in case
> there is some reason that I am not aware of to provide
> /proc/<pid>/smaps|clear_refs|maps|numa_maps for these VM_PFNMAP areas.
> 
> Signed-off-by: Cliff Wickman <cpw@sgi.com>

walk_page_range() does vma-based walk only for address ranges backed by
hugetlbfs, and it doesn't see vma for address ranges backed by normal pages
and thps (in those case we just walk over page table hierarchy).

I think that vma-based walk was introduced as a kind of dirty hack to
handle hugetlbfs, and it can be cleaned up in the future. So I'm afraid
it's not a good idea to extend or adding code heavily depending on this hack.
I recommend that you check VM_PFNMAP in the possible callers' side.
But this patch seems to solve your problem, so with properly commenting
this somewhere, I do not oppose it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas
  2013-05-02 16:44 ` Naoya Horiguchi
@ 2013-05-02 17:16   ` Cliff Wickman
  0 siblings, 0 replies; 3+ messages in thread
From: Cliff Wickman @ 2013-05-02 17:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, akpm, mgorman, aarcange, dave.hansen, dsterba,
	hannes, kosaki.motohiro, kirill.shutemov, mpm, rdunlap

On Thu, May 02, 2013 at 12:44:04PM -0400, Naoya Horiguchi wrote:
> On Thu, May 02, 2013 at 07:10:48AM -0500, Cliff Wickman wrote:
> > 
> > /proc/<pid>/smaps and similar walks through a user page table should not
> > be looking at VM_PFNMAP areas.
> > 
> > This is v2: 
> > - moves the VM_BUG_ON out of the loop
> > - adds the needed test for  vma->vm_start <= addr
> > 
> > Certain tests in walk_page_range() (specifically split_huge_page_pmd())
> > assume that all the mapped PFN's are backed with page structures. And this is
> > not usually true for VM_PFNMAP areas. This can result in panics on kernel
> > page faults when attempting to address those page structures.
> > 
> > There are a half dozen callers of walk_page_range() that walk through
> > a task's entire page table (as N. Horiguchi pointed out). So rather than
> > change all of them, this patch changes just walk_page_range() to ignore 
> > VM_PFNMAP areas.
> > 
> > The logic of hugetlb_vma() is moved back into walk_page_range(), as we
> > want to test any vma in the range.
> > 
> > VM_PFNMAP areas are used by:
> > - graphics memory manager   gpu/drm/drm_gem.c
> > - global reference unit     sgi-gru/grufile.c
> > - sgi special memory        char/mspec.c
> > - and probably several out-of-tree modules
> > 
> > I'm copying everyone who has changed this file recently, in case
> > there is some reason that I am not aware of to provide
> > /proc/<pid>/smaps|clear_refs|maps|numa_maps for these VM_PFNMAP areas.
> > 
> > Signed-off-by: Cliff Wickman <cpw@sgi.com>
> 
> walk_page_range() does vma-based walk only for address ranges backed by
> hugetlbfs, and it doesn't see vma for address ranges backed by normal pages
> and thps (in those case we just walk over page table hierarchy).

Agreed, walk_page_range() only checks for a hugetlbfs-type vma as it
scans an address range.

The problem I'm seeing comes in when it calls walk_pud_range() for any address
range that is not within a hugetlbfs vma:
   walk_pmd_range()
     split_huge_page_pmd_mm()
       split_huge_page_pmd()
         __split_huge_page_pmd()
           page = pmd_page(*pmd)
And such a page structure does not exist for a VM_PFNMAP area.
 
> I think that vma-based walk was introduced as a kind of dirty hack to
> handle hugetlbfs, and it can be cleaned up in the future. So I'm afraid
> it's not a good idea to extend or adding code heavily depending on this hack.

walk_page_range() looks like generic infrastructure to scan any range
of a user's address space - as in /proc/<pid>/smaps and similar. And the
hugetlbfs check seems to have been added as an exception.  
Huge page exceptional cases occur further down the chain.  And
when a corresponding page structure is needed for those cases we
run into the problem.

I'm not depending on walk_page_range(). I'm just trying to survive the
case where it is scanning a VM_PFNMAP range.

> I recommend that you check VM_PFNMAP in the possible callers' side.
> But this patch seems to solve your problem, so with properly commenting
> this somewhere, I do not oppose it.

Agreed, it could be handled by checking at several points higher up. But
checking at this common point seems more straightforward to me.

-Cliff
> 
> Thanks,
> Naoya Horiguchi

-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-05-02 17:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-02 12:10 [PATCH v2] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas Cliff Wickman
2013-05-02 16:44 ` Naoya Horiguchi
2013-05-02 17:16   ` Cliff Wickman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox