[PATCH 0/3] fixes on page table walker and hugepage rmapping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] fixes on page table walker and hugepage rmapping
@ 2014-02-27  4:39 Naoya Horiguchi
  2014-02-27  4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27  4:39 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel

Hi,

Sasha, could you test if the bug you reported recently [1] reproduces
on the latest next tree with this patchset? (I'm not sure of this
because the problem looks differently in my own testing...)

[1] http://thread.gmane.org/gmane.linux.kernel.mm/113374/focus=113
---
Summary:

Naoya Horiguchi (3):
      mm/pagewalk.c: fix end address calculation in walk_page_range()
      mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
      mm: call vma_adjust_trans_huge() only for thp-enabled vma

 include/linux/pagemap.h | 13 +++++++++++++
 mm/huge_memory.c        |  2 +-
 mm/hugetlb.c            |  5 +++++
 mm/memory-failure.c     |  4 ++--
 mm/mmap.c               |  3 ++-
 mm/pagewalk.c           |  5 +++--
 mm/rmap.c               |  8 ++------
 7 files changed, 28 insertions(+), 12 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range()
  2014-02-27  4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi
@ 2014-02-27  4:39 ` Naoya Horiguchi
  2014-02-27 21:03   ` Andrew Morton
  2014-02-27  4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi
  2014-02-27  4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi
  2 siblings, 1 reply; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27  4:39 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel

When we try to walk over inside a vma, walk_page_range() tries to walk
until vma->vm_end even if a given end is before that point.
So this patch takes the smaller one as an end address.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/pagewalk.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git next-20140220.orig/mm/pagewalk.c next-20140220/mm/pagewalk.c
index 416e981243b1..b418407ff4da 100644
--- next-20140220.orig/mm/pagewalk.c
+++ next-20140220/mm/pagewalk.c
@@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end,
 			next = vma->vm_start;
 		} else { /* inside the found vma */
 			walk->vma = vma;
-			next = vma->vm_end;
-			err = walk_page_test(start, end, walk);
+			next = min_t(unsigned long, end, vma->vm_end);
+
+			err = walk_page_test(start, next, walk);
 			if (skip_lower_level_walking(walk))
 				continue;
 			if (err)
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
  2014-02-27  4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi
  2014-02-27  4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi
@ 2014-02-27  4:39 ` Naoya Horiguchi
  2014-02-27 21:19   ` Andrew Morton
  2014-02-27  4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi
  2 siblings, 1 reply; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27  4:39 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel

page->index stores pagecache index when the page is mapped into file mapping
region, and the index is in pagecache size unit, so it depends on the page
size. Some of users of reverse mapping obviously assumes that page->index
is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.

For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node. Then the vma is split and migrate_page()
is called for the 2nd hugepage (belonging to the middle vma.)
In migrate operation, rmap_walk_anon() tries to find the relevant vma to
which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.

This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index. Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.

Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org # 3.12+
---
 include/linux/pagemap.h | 13 +++++++++++++
 mm/huge_memory.c        |  2 +-
 mm/hugetlb.c            |  5 +++++
 mm/memory-failure.c     |  4 ++--
 mm/rmap.c               |  8 ++------
 5 files changed, 23 insertions(+), 9 deletions(-)

diff --git next-20140220.orig/include/linux/pagemap.h next-20140220/include/linux/pagemap.h
index 4f591df66778..a8bd14f42032 100644
--- next-20140220.orig/include/linux/pagemap.h
+++ next-20140220/include/linux/pagemap.h
@@ -316,6 +316,19 @@ static inline loff_t page_file_offset(struct page *page)
 	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
+extern pgoff_t hugepage_pgoff(struct page *page);
+
+/*
+ * page->index stores pagecache index whose unit is not always PAGE_SIZE.
+ * This function converts it into PAGE_SIZE offset.
+ */
+#define page_pgoff(page)					\
+({								\
+	unlikely(PageHuge(page)) ?				\
+		hugepage_pgoff(page) :				\
+		page->index >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);	\
+})
+
 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address);
 
diff --git next-20140220.orig/mm/huge_memory.c next-20140220/mm/huge_memory.c
index 6ac89e9f82ef..ef96763a6abf 100644
--- next-20140220.orig/mm/huge_memory.c
+++ next-20140220/mm/huge_memory.c
@@ -1800,7 +1800,7 @@ static void __split_huge_page(struct page *page,
 			      struct list_head *list)
 {
 	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page_pgoff(page);
 	struct anon_vma_chain *avc;
 
 	BUG_ON(!PageHead(page));
diff --git next-20140220.orig/mm/hugetlb.c next-20140220/mm/hugetlb.c
index 2252cacf98e8..e159e593d99f 100644
--- next-20140220.orig/mm/hugetlb.c
+++ next-20140220/mm/hugetlb.c
@@ -764,6 +764,11 @@ pgoff_t __basepage_index(struct page *page)
 	return (index << compound_order(page_head)) + compound_idx;
 }
 
+pgoff_t hugepage_pgoff(struct page *page)
+{
+	return page->index << huge_page_order(page_hstate(page));
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
diff --git next-20140220.orig/mm/memory-failure.c next-20140220/mm/memory-failure.c
index 35ef28acf137..5d85a4afb22c 100644
--- next-20140220.orig/mm/memory-failure.c
+++ next-20140220/mm/memory-failure.c
@@ -404,7 +404,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
-	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff = page_pgoff(page);
 	read_lock(&tasklist_lock);
 	for_each_process (tsk) {
 		struct anon_vma_chain *vmac;
@@ -437,7 +437,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	mutex_lock(&mapping->i_mmap_mutex);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+		pgoff_t pgoff = page_pgoff(page);
 
 		if (!task_early_kill(tsk))
 			continue;
diff --git next-20140220.orig/mm/rmap.c next-20140220/mm/rmap.c
index 9056a1f00b87..78405051474a 100644
--- next-20140220.orig/mm/rmap.c
+++ next-20140220/mm/rmap.c
@@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 static inline unsigned long
 __vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		pgoff = page->index << huge_page_order(page_hstate(page));
-
+	pgoff_t pgoff = page_pgoff(page);
 	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 }
 
@@ -1598,7 +1594,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
 static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
 {
 	struct anon_vma *anon_vma;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page_pgoff(page);
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma
  2014-02-27  4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi
  2014-02-27  4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi
  2014-02-27  4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi
@ 2014-02-27  4:39 ` Naoya Horiguchi
  2014-02-27 21:23   ` Andrew Morton
  2014-02-27 22:56   ` Kirill A. Shutemov
  2 siblings, 2 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27  4:39 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel

vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that
we happen to try to split hugetlbfs hugepage. So exclude the possibility.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git next-20140220.orig/mm/mmap.c next-20140220/mm/mmap.c
index f53397806d7f..45a9c0d51e3f 100644
--- next-20140220.orig/mm/mmap.c
+++ next-20140220/mm/mmap.c
@@ -772,7 +772,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
-	vma_adjust_trans_huge(vma, start, end, adjust_next);
+	if (transparent_hugepage_enabled(vma))
+		vma_adjust_trans_huge(vma, start, end, adjust_next);
 
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range()
  2014-02-27  4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi
@ 2014-02-27 21:03   ` Andrew Morton
  2014-02-27 21:19     ` Naoya Horiguchi
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2014-02-27 21:03 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Sasha Levin, linux-mm, linux-kernel

On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> When we try to walk over inside a vma, walk_page_range() tries to walk
> until vma->vm_end even if a given end is before that point.
> So this patch takes the smaller one as an end address.
> 
> ...
>
> --- next-20140220.orig/mm/pagewalk.c
> +++ next-20140220/mm/pagewalk.c
> @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end,
>  			next = vma->vm_start;
>  		} else { /* inside the found vma */
>  			walk->vma = vma;
> -			next = vma->vm_end;
> -			err = walk_page_test(start, end, walk);
> +			next = min_t(unsigned long, end, vma->vm_end);

min_t is unneeded, isn't it?  Everything here has type unsigned long.

> +			err = walk_page_test(start, next, walk);
>  			if (skip_lower_level_walking(walk))
>  				continue;
>  			if (err)

I'm assuming this is a fix against
pagewalk-update-page-table-walker-core.patch and shall eventually be
folded into that patch.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range()
  2014-02-27 21:03   ` Andrew Morton
@ 2014-02-27 21:19     ` Naoya Horiguchi
  2014-02-27 21:20       ` Kirill A. Shutemov
  0 siblings, 1 reply; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27 21:19 UTC (permalink / raw)
  To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel

On Thu, Feb 27, 2014 at 01:03:23PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > When we try to walk over inside a vma, walk_page_range() tries to walk
> > until vma->vm_end even if a given end is before that point.
> > So this patch takes the smaller one as an end address.
> > 
> > ...
> >
> > --- next-20140220.orig/mm/pagewalk.c
> > +++ next-20140220/mm/pagewalk.c
> > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end,
> >  			next = vma->vm_start;
> >  		} else { /* inside the found vma */
> >  			walk->vma = vma;
> > -			next = vma->vm_end;
> > -			err = walk_page_test(start, end, walk);
> > +			next = min_t(unsigned long, end, vma->vm_end);
> 
> min_t is unneeded, isn't it?  Everything here has type unsigned long.

Yes, so simply (end < vma->vm_end ? end: vma->vm_end) is enough.
# I just considered min_t as simple minimum getter without thinking type check.

> > +			err = walk_page_test(start, next, walk);
> >  			if (skip_lower_level_walking(walk))
> >  				continue;
> >  			if (err)
> 
> I'm assuming this is a fix against
> pagewalk-update-page-table-walker-core.patch and shall eventually be
> folded into that patch.  

Yes, next-20140220 has this patch.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
  2014-02-27  4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi
@ 2014-02-27 21:19   ` Andrew Morton
  2014-02-27 21:53     ` Naoya Horiguchi
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2014-02-27 21:19 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Sasha Levin, linux-mm, linux-kernel, Rik van Riel

On Wed, 26 Feb 2014 23:39:36 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> page->index stores pagecache index when the page is mapped into file mapping
> region, and the index is in pagecache size unit, so it depends on the page
> size. Some of users of reverse mapping obviously assumes that page->index
> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
> 
> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> hugepage to migrate to another node. Then the vma is split and migrate_page()
> is called for the 2nd hugepage (belonging to the middle vma.)
> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> which the target hugepage belongs, but here we miscalculate pgoff.
> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
> 
> This patch introduces a new API that is usable both for normal page and
> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> distinguish page_index for pagecache index and page_pgoff for page offset.

So this patch is really independent of the page-walker changes, but the
page walker changes need it.  So it is appropriate that this patch be
staged before that series, and separately.  Agree?

> Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: stable@vger.kernel.org # 3.12+

Why cc:stable?  This problem will only cause runtime issues when the
page walker patches are present?

> index 4f591df66778..a8bd14f42032 100644
> --- next-20140220.orig/include/linux/pagemap.h
> +++ next-20140220/include/linux/pagemap.h
> @@ -316,6 +316,19 @@ static inline loff_t page_file_offset(struct page *page)
>  	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
>  }
>  
> +extern pgoff_t hugepage_pgoff(struct page *page);
> +
> +/*
> + * page->index stores pagecache index whose unit is not always PAGE_SIZE.
> + * This function converts it into PAGE_SIZE offset.
> + */
> +#define page_pgoff(page)					\
> +({								\
> +	unlikely(PageHuge(page)) ?				\
> +		hugepage_pgoff(page) :				\
> +		page->index >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);	\
> +})

- I don't think this needs to be implemented in a macro?  Can we do
  it in good old C?

- Is PageHuge() the appropriate test?
  /*
   * PageHuge() only returns true for hugetlbfs pages, but not for normal or
   * transparent huge pages.  See the PageTransHuge() documentation for more
   * details.
   */

- Should page->index be shifted right or left?  Probably left - I
  doubt if PAGE_CACHE_SHIFT will ever be less than PAGE_SHIFT.

- I'm surprised we don't have a general what-is-this-page's-order
  function, so you can just do

	static inline pgoff_t page_pgoff(struct page *page)
	{
		return page->index << page_size_order(page);
	}

  And I think this would be a better implementation, as the (new)
  page_size_order() could be used elsewhere.

  page_size_order() would be a crappy name - can't think of anything
  better at present.

> --- next-20140220.orig/mm/memory-failure.c
> +++ next-20140220/mm/memory-failure.c
> @@ -404,7 +404,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
>  	if (av == NULL)	/* Not actually mapped anymore */
>  		return;
>  
> -	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

See this did a left shift.

> +	pgoff = page_pgoff(page);
>  	read_lock(&tasklist_lock);
>  	for_each_process (tsk) {
>  		struct anon_vma_chain *vmac;
> @@ -437,7 +437,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	read_lock(&tasklist_lock);
>  	for_each_process(tsk) {
> -		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

him too.

> +		pgoff_t pgoff = page_pgoff(page);
>  
>  		if (!task_early_kill(tsk))
>  			continue;
> diff --git next-20140220.orig/mm/rmap.c next-20140220/mm/rmap.c
> index 9056a1f00b87..78405051474a 100644
> --- next-20140220.orig/mm/rmap.c
> +++ next-20140220/mm/rmap.c
> @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>  static inline unsigned long
>  __vma_address(struct page *page, struct vm_area_struct *vma)
>  {
> -	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

And him.

> -
> -	if (unlikely(is_vm_hugetlb_page(vma)))
> -		pgoff = page->index << huge_page_order(page_hstate(page));
> -
> +	pgoff_t pgoff = page_pgoff(page);
>  	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  }
>  
> @@ -1598,7 +1594,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
>  static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
>  {
>  	struct anon_vma *anon_vma;
> -	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

again.

> +	pgoff_t pgoff = page_pgoff(page);
>  	struct anon_vma_chain *avc;
>  	int ret = SWAP_AGAIN;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range()
  2014-02-27 21:19     ` Naoya Horiguchi
@ 2014-02-27 21:20       ` Kirill A. Shutemov
  2014-02-27 21:54         ` Naoya Horiguchi
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 21:20 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: akpm, sasha.levin, linux-mm, linux-kernel

On Thu, Feb 27, 2014 at 04:19:01PM -0500, Naoya Horiguchi wrote:
> On Thu, Feb 27, 2014 at 01:03:23PM -0800, Andrew Morton wrote:
> > On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> > 
> > > When we try to walk over inside a vma, walk_page_range() tries to walk
> > > until vma->vm_end even if a given end is before that point.
> > > So this patch takes the smaller one as an end address.
> > > 
> > > ...
> > >
> > > --- next-20140220.orig/mm/pagewalk.c
> > > +++ next-20140220/mm/pagewalk.c
> > > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end,
> > >  			next = vma->vm_start;
> > >  		} else { /* inside the found vma */
> > >  			walk->vma = vma;
> > > -			next = vma->vm_end;
> > > -			err = walk_page_test(start, end, walk);
> > > +			next = min_t(unsigned long, end, vma->vm_end);
> > 
> > min_t is unneeded, isn't it?  Everything here has type unsigned long.
> 
> Yes, so simply (end < vma->vm_end ? end: vma->vm_end) is enough.
> # I just considered min_t as simple minimum getter without thinking type check.

We have non-typed min() for that.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma
  2014-02-27  4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi
@ 2014-02-27 21:23   ` Andrew Morton
  2014-02-27 22:08     ` Naoya Horiguchi
  2014-02-27 22:56   ` Kirill A. Shutemov
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2014-02-27 21:23 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Sasha Levin, linux-mm, linux-kernel

On Wed, 26 Feb 2014 23:39:37 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that
> we happen to try to split hugetlbfs hugepage. So exclude the possibility.
> 

It would be nice to have a more complete changelog here please.  Under
what circumstances does this cause problems and what are the
user-observable effects?

> --- next-20140220.orig/mm/mmap.c
> +++ next-20140220/mm/mmap.c
> @@ -772,7 +772,8 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> -	vma_adjust_trans_huge(vma, start, end, adjust_next);
> +	if (transparent_hugepage_enabled(vma))
> +		vma_adjust_trans_huge(vma, start, end, adjust_next);
>  
>  	anon_vma = vma->anon_vma;
>  	if (!anon_vma && adjust_next)
> -- 
> 1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
  2014-02-27 21:19   ` Andrew Morton
@ 2014-02-27 21:53     ` Naoya Horiguchi
  2014-02-28 19:59       ` [PATCH v2] " Naoya Horiguchi
       [not found]       ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27 21:53 UTC (permalink / raw)
  To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel

On Thu, Feb 27, 2014 at 01:19:57PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 23:39:36 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > page->index stores pagecache index when the page is mapped into file mapping
> > region, and the index is in pagecache size unit, so it depends on the page
> > size. Some of users of reverse mapping obviously assumes that page->index
> > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
> > 
> > For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> > hugepage to migrate to another node. Then the vma is split and migrate_page()
> > is called for the 2nd hugepage (belonging to the middle vma.)
> > In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> > which the target hugepage belongs, but here we miscalculate pgoff.
> > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
> > 
> > This patch introduces a new API that is usable both for normal page and
> > hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> > distinguish page_index for pagecache index and page_pgoff for page offset.
> 
> So this patch is really independent of the page-walker changes, but the
> page walker changes need it.  So it is appropriate that this patch be
> staged before that series, and separately.  Agree?

I think that this patch is independen of page walker patch and vice versa.
But I agree that we can push this patch separately.

> > Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: stable@vger.kernel.org # 3.12+
> 
> Why cc:stable?  This problem will only cause runtime issues when the
> page walker patches are present?

I think that we had this bug after "hugepage migration extension" patches
was merged at 3.12.
I did reproduce this bug on 3.14-rc1 (without page walker patch.)

> > index 4f591df66778..a8bd14f42032 100644
> > --- next-20140220.orig/include/linux/pagemap.h
> > +++ next-20140220/include/linux/pagemap.h
> > @@ -316,6 +316,19 @@ static inline loff_t page_file_offset(struct page *page)
> >  	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
> >  }
> >  
> > +extern pgoff_t hugepage_pgoff(struct page *page);
> > +
> > +/*
> > + * page->index stores pagecache index whose unit is not always PAGE_SIZE.
> > + * This function converts it into PAGE_SIZE offset.
> > + */
> > +#define page_pgoff(page)					\
> > +({								\
> > +	unlikely(PageHuge(page)) ?				\
> > +		hugepage_pgoff(page) :				\
> > +		page->index >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);	\
> > +})
> 
> - I don't think this needs to be implemented in a macro?  Can we do
>   it in good old C?

Yes, I have a headache on this because defining this here as a simple
C function causes build error.
We can't use PageHuge in include/linux/pagemap.h, but we can in callers'
side (they include linux/hugetlb.h.)
But yes, I think this looks like a dirty workaround, so it's nice if
we could have any better idea.

> - Is PageHuge() the appropriate test?
>   /*
>    * PageHuge() only returns true for hugetlbfs pages, but not for normal or
>    * transparent huge pages.  See the PageTransHuge() documentation for more
>    * details.
>    */

I think yes at least for now, because we have page->index in page unit
for transparent hugepage. And as long as I know, we don't do rmapping for
transparent hugepage (callers should split thps in advancde if they need do
rmapping.) If this situation changes, we will need change on this function.

> - Should page->index be shifted right or left?  Probably left - I
>   doubt if PAGE_CACHE_SHIFT will ever be less than PAGE_SHIFT.

My big mistake, sorry. It must be a left shift.

> - I'm surprised we don't have a general what-is-this-page's-order
>   function, so you can just do
> 
> 	static inline pgoff_t page_pgoff(struct page *page)
> 	{
> 		return page->index << page_size_order(page);
> 	}
> 
>   And I think this would be a better implementation, as the (new)
>   page_size_order() could be used elsewhere.
> 
>   page_size_order() would be a crappy name - can't think of anything
>   better at present.

I can't, either. If nobody suggests a better one, I'll take yours.

Thanks,
Naoya

> > --- next-20140220.orig/mm/memory-failure.c
> > +++ next-20140220/mm/memory-failure.c
> > @@ -404,7 +404,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> >  	if (av == NULL)	/* Not actually mapped anymore */
> >  		return;
> >  
> > -	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> 
> See this did a left shift.
> 
> > +	pgoff = page_pgoff(page);
> >  	read_lock(&tasklist_lock);
> >  	for_each_process (tsk) {
> >  		struct anon_vma_chain *vmac;
> > @@ -437,7 +437,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> >  	mutex_lock(&mapping->i_mmap_mutex);
> >  	read_lock(&tasklist_lock);
> >  	for_each_process(tsk) {
> > -		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> 
> him too.
> 
> > +		pgoff_t pgoff = page_pgoff(page);
> >  
> >  		if (!task_early_kill(tsk))
> >  			continue;
> > diff --git next-20140220.orig/mm/rmap.c next-20140220/mm/rmap.c
> > index 9056a1f00b87..78405051474a 100644
> > --- next-20140220.orig/mm/rmap.c
> > +++ next-20140220/mm/rmap.c
> > @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
> >  static inline unsigned long
> >  __vma_address(struct page *page, struct vm_area_struct *vma)
> >  {
> > -	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> 
> And him.
> 
> > -
> > -	if (unlikely(is_vm_hugetlb_page(vma)))
> > -		pgoff = page->index << huge_page_order(page_hstate(page));
> > -
> > +	pgoff_t pgoff = page_pgoff(page);
> >  	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> >  }
> >  
> > @@ -1598,7 +1594,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
> >  static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
> >  {
> >  	struct anon_vma *anon_vma;
> > -	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> 
> again.
> 
> > +	pgoff_t pgoff = page_pgoff(page);
> >  	struct anon_vma_chain *avc;
> >  	int ret = SWAP_AGAIN;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range()
  2014-02-27 21:20       ` Kirill A. Shutemov
@ 2014-02-27 21:54         ` Naoya Horiguchi
  0 siblings, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27 21:54 UTC (permalink / raw)
  To: kirill; +Cc: akpm, sasha.levin, linux-mm, linux-kernel

On Thu, Feb 27, 2014 at 11:20:34PM +0200, Kirill A. Shutemov wrote:
> On Thu, Feb 27, 2014 at 04:19:01PM -0500, Naoya Horiguchi wrote:
> > On Thu, Feb 27, 2014 at 01:03:23PM -0800, Andrew Morton wrote:
> > > On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> > > 
> > > > When we try to walk over inside a vma, walk_page_range() tries to walk
> > > > until vma->vm_end even if a given end is before that point.
> > > > So this patch takes the smaller one as an end address.
> > > > 
> > > > ...
> > > >
> > > > --- next-20140220.orig/mm/pagewalk.c
> > > > +++ next-20140220/mm/pagewalk.c
> > > > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end,
> > > >  			next = vma->vm_start;
> > > >  		} else { /* inside the found vma */
> > > >  			walk->vma = vma;
> > > > -			next = vma->vm_end;
> > > > -			err = walk_page_test(start, end, walk);
> > > > +			next = min_t(unsigned long, end, vma->vm_end);
> > > 
> > > min_t is unneeded, isn't it?  Everything here has type unsigned long.
> > 
> > Yes, so simply (end < vma->vm_end ? end: vma->vm_end) is enough.
> > # I just considered min_t as simple minimum getter without thinking type check.
> 
> We have non-typed min() for that.

Thanks. This is what I wanted :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma
  2014-02-27 21:23   ` Andrew Morton
@ 2014-02-27 22:08     ` Naoya Horiguchi
  0 siblings, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-27 22:08 UTC (permalink / raw)
  To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel

On Thu, Feb 27, 2014 at 01:23:50PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 23:39:37 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that
> > we happen to try to split hugetlbfs hugepage. So exclude the possibility.
> > 
> 
> It would be nice to have a more complete changelog here please.  Under
> what circumstances does this cause problems and what are the
> user-observable effects?

I have no user-visible problem.
This is a kind of precaution to confirm that transparent hugepage code
should not be called for hugetlbfs workload.
.. but I think that this patch doesn't work as I intended because
transparent_hugepage_enabled() can return true for vma(VM_HUGETLB).
So I separate out this patch from this series for now.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma
  2014-02-27  4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi
  2014-02-27 21:23   ` Andrew Morton
@ 2014-02-27 22:56   ` Kirill A. Shutemov
  1 sibling, 0 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 22:56 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Sasha Levin, Andrew Morton, linux-mm, linux-kernel

On Wed, Feb 26, 2014 at 11:39:37PM -0500, Naoya Horiguchi wrote:
> vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that
> we happen to try to split hugetlbfs hugepage. So exclude the possibility.

NAK.

It can't happen: vma_adjust_trans_huge checks for vma->vm_ops and hugetlb
VMA always have it set unlike THP.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
  2014-02-27 21:53     ` Naoya Horiguchi
@ 2014-02-28 19:59       ` Naoya Horiguchi
       [not found]       ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-02-28 19:59 UTC (permalink / raw)
  To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel

page->index stores pagecache index when the page is mapped into file mapping
region, and the index is in pagecache size unit, so it depends on the page
size. Some of users of reverse mapping obviously assumes that page->index
is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.

For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node. Then the vma is split and migrate_page()
is called for the 2nd hugepage (belonging to the middle vma.)
In migrate operation, rmap_walk_anon() tries to find the relevant vma to
which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.

This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index. Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.

ChangeLog v2:
- fix wrong shift direction
- introduce page_size_order() and huge_page_size_order()
- move the declaration of PageHuge() to include/linux/hugetlb_inline.h
  to avoid macro definition.

Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org # 3.12+
---
 include/linux/hugetlb.h        |  7 -------
 include/linux/hugetlb_inline.h | 13 +++++++++++++
 include/linux/pagemap.h        | 16 ++++++++++++++++
 mm/huge_memory.c               |  2 +-
 mm/hugetlb.c                   |  6 ++++++
 mm/memory-failure.c            |  4 ++--
 mm/rmap.c                      |  8 ++------
 7 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc469d78..91fffa4fbc57 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -31,8 +31,6 @@ extern int hugetlb_max_hstate __read_mostly;
 struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
-int PageHuge(struct page *page);
-
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -99,11 +97,6 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
-static inline int PageHuge(struct page *page)
-{
-	return 0;
-}
-
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 2bb681fbeb35..e939975f0cba 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -10,6 +10,9 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return !!(vma->vm_flags & VM_HUGETLB);
 }
 
+int PageHuge(struct page *page);
+unsigned int huge_page_size_order(struct page *page);
+
 #else
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -17,6 +20,16 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return 0;
 }
 
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+
+static inline unsigned int huge_page_size_order(struct page *page)
+{
+	return 0;
+}
+
 #endif
 
 #endif
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f0ef8826acf1..4a01dd3c304d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page)
 	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
+static inline unsigned int page_size_order(struct page *page)
+{
+	return unlikely(PageHuge(page)) ?
+		huge_page_size_order(page) :
+		(PAGE_CACHE_SHIFT - PAGE_SHIFT);
+}
+
+/*
+ * page->index stores pagecache index whose unit is not always PAGE_SIZE.
+ * This function converts it into PAGE_SIZE offset.
+ */
+static inline pgoff_t page_pgoff(struct page *page)
+{
+        return page->index << page_size_order(page);
+}
+
 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf974e1..12faa32f4d6d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1877,7 +1877,7 @@ static void __split_huge_page(struct page *page,
 			      struct list_head *list)
 {
 	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page_pgoff(page);
 	struct anon_vma_chain *avc;
 
 	BUG_ON(!PageHead(page));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9fedb18..39d7461e9ba7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@
 
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/hugetlb_inline.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/node.h>
 #include "internal.h"
@@ -727,6 +728,11 @@ pgoff_t __basepage_index(struct page *page)
 	return (index << compound_order(page_head)) + compound_idx;
 }
 
+unsigned int huge_page_size_order(struct page *page)
+{
+	return huge_page_order(page_hstate(page));
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47e500c2f258..3be725304703 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -409,7 +409,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
-	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff = page_pgoff(page);
 	read_lock(&tasklist_lock);
 	for_each_process (tsk) {
 		struct anon_vma_chain *vmac;
@@ -442,7 +442,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	mutex_lock(&mapping->i_mmap_mutex);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+		pgoff_t pgoff = page_pgoff(page);
 
 		if (!task_early_kill(tsk))
 			continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index d9d42316a99a..d3538a913285 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 static inline unsigned long
 __vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		pgoff = page->index << huge_page_order(page_hstate(page));
-
+	pgoff_t pgoff = page_pgoff(page);
 	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 }
 
@@ -1588,7 +1584,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
 static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
 {
 	struct anon_vma *anon_vma;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page_pgoff(page);
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
       [not found]       ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-02-28 23:14         ` Andrew Morton
  2014-03-01  3:35           ` [PATCH v3] " Naoya Horiguchi
       [not found]           ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com>
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2014-02-28 23:14 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: sasha.levin, linux-mm, linux-kernel, riel

On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> page->index stores pagecache index when the page is mapped into file mapping
> region, and the index is in pagecache size unit, so it depends on the page
> size. Some of users of reverse mapping obviously assumes that page->index
> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
> 
> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> hugepage to migrate to another node. Then the vma is split and migrate_page()
> is called for the 2nd hugepage (belonging to the middle vma.)
> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> which the target hugepage belongs, but here we miscalculate pgoff.
> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
> 
> This patch introduces a new API that is usable both for normal page and
> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> distinguish page_index for pagecache index and page_pgoff for page offset.
> 
> ..
>
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page)
>  	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
>  }
>  
> +static inline unsigned int page_size_order(struct page *page)
> +{
> +	return unlikely(PageHuge(page)) ?
> +		huge_page_size_order(page) :
> +		(PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +}

Could use some nice documentation, please.  Why it exists, what it
does.  Particularly: what sort of pages it can and can't operate on,
and why.

The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies
that the page is a pagecache page.  I dunno, maybe just use "0"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
  2014-02-28 23:14         ` Andrew Morton
@ 2014-03-01  3:35           ` Naoya Horiguchi
       [not found]           ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com>
  1 sibling, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-01  3:35 UTC (permalink / raw)
  To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel

On Fri, Feb 28, 2014 at 03:14:27PM -0800, Andrew Morton wrote:
> On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > page->index stores pagecache index when the page is mapped into file mapping
> > region, and the index is in pagecache size unit, so it depends on the page
> > size. Some of users of reverse mapping obviously assumes that page->index
> > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
> > 
> > For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> > hugepage to migrate to another node. Then the vma is split and migrate_page()
> > is called for the 2nd hugepage (belonging to the middle vma.)
> > In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> > which the target hugepage belongs, but here we miscalculate pgoff.
> > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
> > 
> > This patch introduces a new API that is usable both for normal page and
> > hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> > distinguish page_index for pagecache index and page_pgoff for page offset.
> > 
> > ..
> >
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page)
> >  	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
> >  }
> >  
> > +static inline unsigned int page_size_order(struct page *page)
> > +{
> > +	return unlikely(PageHuge(page)) ?
> > +		huge_page_size_order(page) :

I found that we have compound_order(page) for the same purpose, so we don't
have to define this new function.

> > +		(PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > +}
> 
> Could use some nice documentation, please.  Why it exists, what it
> does.  Particularly: what sort of pages it can and can't operate on,
> and why.

OK.

> The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies
> that the page is a pagecache page.  I dunno, maybe just use "0"?

Yes, PAGE_CACHE_SHIFT makes code messy if PAGE_CACHE_SHIFT is always PAGE_SHIFT.
But I guess that recently people start to thinking of changing the size of
pagecache (in the discussion around >4kB sector device.)
And from readabilitie's perspective, "pagecache size" and "page size" are
different things, so keeping it is better in a long run.

Anyway, I revised the patch again, could you take a look?

Thanks,
Naoya
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Fri, 28 Feb 2014 21:56:24 -0500
Subject: [PATCH] mm, hugetlbfs: fix rmapping for anonymous hugepages with
 page_pgoff()

page->index stores pagecache index when the page is mapped into file mapping
region, and the index is in pagecache size unit, so it depends on the page
size. Some of users of reverse mapping obviously assumes that page->index
is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.

For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node. Then the vma is split and migrate_page()
is called for the 2nd hugepage (belonging to the middle vma.)
In migrate operation, rmap_walk_anon() tries to find the relevant vma to
which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.

This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index. Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.

ChangeLog v3:
- add comment on page_size_order()
- use compound_order(compound_head(page)) instead of huge_page_size_order()
- use page_pgoff() in rmap_walk_file() too
- use page_size_order() in kill_proc()
- fix space indent

ChangeLog v2:
- fix wrong shift direction
- introduce page_size_order() and huge_page_size_order()
- move the declaration of PageHuge() to include/linux/hugetlb_inline.h
  to avoid macro definition.

Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org # 3.12+
---
 include/linux/hugetlb.h        |  7 -------
 include/linux/hugetlb_inline.h |  7 +++++++
 include/linux/pagemap.h        | 28 ++++++++++++++++++++++++++++
 mm/huge_memory.c               |  2 +-
 mm/hugetlb.c                   |  1 +
 mm/memory-failure.c            |  6 +++---
 mm/rmap.c                      | 10 +++-------
 7 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc469d78..91fffa4fbc57 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -31,8 +31,6 @@ extern int hugetlb_max_hstate __read_mostly;
 struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
-int PageHuge(struct page *page);
-
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -99,11 +97,6 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
-static inline int PageHuge(struct page *page)
-{
-	return 0;
-}
-
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 2bb681fbeb35..4d60c82e9fda 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -10,6 +10,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return !!(vma->vm_flags & VM_HUGETLB);
 }
 
+int PageHuge(struct page *page);
+
 #else
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -17,6 +19,11 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return 0;
 }
 
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+
 #endif
 
 #endif
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f0ef8826acf1..715962f7ea7a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -307,6 +307,34 @@ static inline loff_t page_file_offset(struct page *page)
 	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
+/*
+ * Getting page order of a given page in the context of the pagecache which
+ * each page belongs to.
+ *
+ * Pagecache unit size is not a fixed value (hugetlbfs is an example), but
+ * vma_interval_tree and anon_vma_internval_tree APIs assume that its indices
+ * are in PAGE_SIZE unit. So this routine helps us to get normalized indices.
+ *
+ * This page should be called only for pagecache pages/hugepages and anonymous
+ * pages/hugepages, because pagecache unit size is irrelevant except for those
+ * pages.
+ */
+static inline unsigned int page_size_order(struct page *page)
+{
+	return unlikely(PageHuge(page)) ?
+		compound_order(compound_head(page)) :
+		(PAGE_CACHE_SHIFT - PAGE_SHIFT);
+}
+
+/*
+ * page->index stores pagecache index whose unit is not always PAGE_SIZE.
+ * This function converts it into PAGE_SIZE offset.
+ */
+static inline pgoff_t page_pgoff(struct page *page)
+{
+	return page->index << page_size_order(page);
+}
+
 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf974e1..12faa32f4d6d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1877,7 +1877,7 @@ static void __split_huge_page(struct page *page,
 			      struct list_head *list)
 {
 	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page_pgoff(page);
 	struct anon_vma_chain *avc;
 
 	BUG_ON(!PageHead(page));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9fedb18..7222247a590b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@
 
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/hugetlb_inline.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/node.h>
 #include "internal.h"
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47e500c2f258..516dd4202248 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -207,7 +207,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno,
 #ifdef __ARCH_SI_TRAPNO
 	si.si_trapno = trapno;
 #endif
-	si.si_addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
+	si.si_addr_lsb = page_size_order(page) + PAGE_SHIFT;
 
 	if ((flags & MF_ACTION_REQUIRED) && t == current) {
 		si.si_code = BUS_MCEERR_AR;
@@ -409,7 +409,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
-	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff = page_pgoff(page);
 	read_lock(&tasklist_lock);
 	for_each_process (tsk) {
 		struct anon_vma_chain *vmac;
@@ -442,7 +442,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	mutex_lock(&mapping->i_mmap_mutex);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+		pgoff_t pgoff = page_pgoff(page);
 
 		if (!task_early_kill(tsk))
 			continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index d9d42316a99a..e344012cc20a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 static inline unsigned long
 __vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		pgoff = page->index << huge_page_order(page_hstate(page));
-
+	pgoff_t pgoff = page_pgoff(page);
 	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 }
 
@@ -1588,7 +1584,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
 static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
 {
 	struct anon_vma *anon_vma;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page_pgoff(page);
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1629,7 +1625,7 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
 static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 {
 	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff = page->index << compound_order(page);
+	pgoff_t pgoff = page_pgoff(page);
 	struct vm_area_struct *vma;
 	int ret = SWAP_AGAIN;
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
       [not found]           ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com>
@ 2014-03-01 23:08             ` Sasha Levin
  2014-03-03  5:02               ` [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Naoya Horiguchi
  0 siblings, 1 reply; 28+ messages in thread
From: Sasha Levin @ 2014-03-01 23:08 UTC (permalink / raw)
  To: Naoya Horiguchi, akpm; +Cc: linux-mm, linux-kernel, riel

On 02/28/2014 10:35 PM, Naoya Horiguchi wrote:
> On Fri, Feb 28, 2014 at 03:14:27PM -0800, Andrew Morton wrote:
>> On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>>
>>> page->index stores pagecache index when the page is mapped into file mapping
>>> region, and the index is in pagecache size unit, so it depends on the page
>>> size. Some of users of reverse mapping obviously assumes that page->index
>>> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
>>>
>>> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
>>> hugepage to migrate to another node. Then the vma is split and migrate_page()
>>> is called for the 2nd hugepage (belonging to the middle vma.)
>>> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
>>> which the target hugepage belongs, but here we miscalculate pgoff.
>>> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
>>>
>>> This patch introduces a new API that is usable both for normal page and
>>> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
>>> distinguish page_index for pagecache index and page_pgoff for page offset.
>>>
>>> ..
>>>
>>> --- a/include/linux/pagemap.h
>>> +++ b/include/linux/pagemap.h
>>> @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page)
>>>   	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
>>>   }
>>>   
>>> +static inline unsigned int page_size_order(struct page *page)
>>> +{
>>> +	return unlikely(PageHuge(page)) ?
>>> +		huge_page_size_order(page) :
> 
> I found that we have compound_order(page) for the same purpose, so we don't
> have to define this new function.
> 
>>> +		(PAGE_CACHE_SHIFT - PAGE_SHIFT);
>>> +}
>>
>> Could use some nice documentation, please.  Why it exists, what it
>> does.  Particularly: what sort of pages it can and can't operate on,
>> and why.
> 
> OK.
> 
>> The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies
>> that the page is a pagecache page.  I dunno, maybe just use "0"?
> 
> Yes, PAGE_CACHE_SHIFT makes code messy if PAGE_CACHE_SHIFT is always PAGE_SHIFT.
> But I guess that recently people start to thinking of changing the size of
> pagecache (in the discussion around >4kB sector device.)
> And from readabilitie's perspective, "pagecache size" and "page size" are
> different things, so keeping it is better in a long run.
> 
> Anyway, I revised the patch again, could you take a look?
> 
> Thanks,
> Naoya
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Fri, 28 Feb 2014 21:56:24 -0500
> Subject: [PATCH] mm, hugetlbfs: fix rmapping for anonymous hugepages with
>   page_pgoff()
> 
> page->index stores pagecache index when the page is mapped into file mapping
> region, and the index is in pagecache size unit, so it depends on the page
> size. Some of users of reverse mapping obviously assumes that page->index
> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
> 
> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> hugepage to migrate to another node. Then the vma is split and migrate_page()
> is called for the 2nd hugepage (belonging to the middle vma.)
> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> which the target hugepage belongs, but here we miscalculate pgoff.
> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
> 
> This patch introduces a new API that is usable both for normal page and
> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> distinguish page_index for pagecache index and page_pgoff for page offset.
> 
> ChangeLog v3:
> - add comment on page_size_order()
> - use compound_order(compound_head(page)) instead of huge_page_size_order()
> - use page_pgoff() in rmap_walk_file() too
> - use page_size_order() in kill_proc()
> - fix space indent
> 
> ChangeLog v2:
> - fix wrong shift direction
> - introduce page_size_order() and huge_page_size_order()
> - move the declaration of PageHuge() to include/linux/hugetlb_inline.h
>    to avoid macro definition.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: stable@vger.kernel.org # 3.12+

I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-01 23:08             ` Sasha Levin
@ 2014-03-03  5:02               ` Naoya Horiguchi
  2014-03-03 20:06                 ` Sasha Levin
  0 siblings, 1 reply; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-03  5:02 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel

Hi Sasha,

> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.

I spotted the cause of this problem.
Could you try testing if this patch fixes it?

Thanks,
Naoya
---
Page table walker doesn't check non-present hugetlb entry in common path,
so hugetlb_entry() callbacks must check it. The reason for this behavior
is that some callers want to handle it in its own way.

However, some callers don't check it now, which causes unpredictable result,
for example when we have a race between migrating hugepage and reading
/proc/pid/numa_maps. This patch fixes it by adding pte_present checks on
buggy callbacks.

This bug exists for long and got visible by introducing hugepage migration.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org # 3.12+
---
 fs/proc/task_mmu.c | 3 +++
 mm/mempolicy.c     | 6 +++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git next-20140228.orig/fs/proc/task_mmu.c next-20140228/fs/proc/task_mmu.c
index 3746d89c768b..a4cecadce867 100644
--- next-20140228.orig/fs/proc/task_mmu.c
+++ next-20140228/fs/proc/task_mmu.c
@@ -1299,6 +1299,9 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long addr,
 	if (pte_none(*pte))
 		return 0;
 
+	if (pte_present(*pte))
+		return 0;
+
 	page = pte_page(*pte);
 	if (!page)
 		return 0;
diff --git next-20140228.orig/mm/mempolicy.c next-20140228/mm/mempolicy.c
index c0d1cbd68790..1e171186ee6d 100644
--- next-20140228.orig/mm/mempolicy.c
+++ next-20140228/mm/mempolicy.c
@@ -524,8 +524,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
 	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
+	pte_t entry;
 
-	page = pte_page(huge_ptep_get(pte));
+	entry = huge_ptep_get(pte);
+	if (pte_present(entry))
+		return 0;
+	page = pte_page(entry);
 	nid = page_to_nid(page);
 	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
 		return 0;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-03  5:02               ` [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Naoya Horiguchi
@ 2014-03-03 20:06                 ` Sasha Levin
  2014-03-03 21:38                   ` Sasha Levin
  0 siblings, 1 reply; 28+ messages in thread
From: Sasha Levin @ 2014-03-03 20:06 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel

On 03/03/2014 12:02 AM, Naoya Horiguchi wrote:
> Hi Sasha,
>
>> >I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
>> >walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.
> I spotted the cause of this problem.
> Could you try testing if this patch fixes it?

I'm seeing a different failure with this patch:

[ 1860.669114] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[ 1860.670498] IP: [<ffffffff8129c0bf>] vm_normal_page+0x3f/0x90
[ 1860.672795] PGD 6c1c84067 PUD 6e0a3d067 PMD 0
[ 1860.672795] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1860.672795] Dumping ftrace buffer:
[ 1860.672795]    (ftrace buffer empty)
[ 1860.672795] Modules linked in:
[ 1860.672795] CPU: 4 PID: 34914 Comm: trinity-c184 Tainted: G        W    3.14.0-rc4-ne
[ 1860.672795] task: ffff880717d90000 ti: ffff88070b3da000 task.ti: ffff88070b3da000
[ 1860.672795] RIP: 0010:[<ffffffff8129c0bf>]  [<ffffffff8129c0bf>] vm_normal_page+0x3f/
[ 1860.672795] RSP: 0018:ffff88070b3dbba8  EFLAGS: 00010202
[ 1860.672795] RAX: 000000000000767f RBX: ffff88070b3dbdd8 RCX: ffff88070b3dbd78
[ 1860.672795] RDX: 800000000767f225 RSI: 0100000000699000 RDI: 800000000767f225
[ 1860.672795] RBP: ffff88070b3dbba8 R08: 0000000000000000 R09: 0000000000000000
[ 1860.672795] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880717df24c8
[ 1860.672795] R13: 0000000000000020 R14: 0100000000699000 R15: 0100000000800000
[ 1860.672795] FS:  00007f20a3584700(0000) GS:ffff88052b800000(0000) knlGS:0000000000000
[ 1860.672795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1860.672795] CR2: 0000000000000050 CR3: 00000006d73cf000 CR4: 00000000000006e0
[ 1860.672795] Stack:
[ 1860.672795]  ffff88070b3dbbd8 ffffffff812c2f3d ffffffff812b2dc0 010000000069a000
[ 1860.672795]  ffff880717df24c8 ffff88070b3dbd78 ffff88070b3dbc28 ffffffff812b2e00
[ 1860.672795]  0000000000000000 ffff88072956bcf0 ffff88070b3dbc28 ffff8806e0a3d018
[ 1860.672795] Call Trace:
[ 1860.672795]  [<ffffffff812c2f3d>] queue_pages_pte+0x3d/0xd0
[ 1860.672795]  [<ffffffff812b2dc0>] ? walk_pte_range+0xc0/0x180
[ 1860.672795]  [<ffffffff812b2e00>] walk_pte_range+0x100/0x180
[ 1860.672795]  [<ffffffff812b3091>] walk_pmd_range+0x211/0x240
[ 1860.672795]  [<ffffffff812b31eb>] walk_pud_range+0x12b/0x160
[ 1860.672795]  [<ffffffff812d2ee4>] ? __slab_free+0x384/0x5e0
[ 1860.672795]  [<ffffffff812b3329>] walk_pgd_range+0x109/0x140
[ 1860.672795]  [<ffffffff812b3395>] __walk_page_range+0x35/0x40
[ 1860.672795]  [<ffffffff812b3552>] walk_page_range+0xf2/0x130
[ 1860.672795]  [<ffffffff812c2e41>] queue_pages_range+0x71/0x90
[ 1860.672795]  [<ffffffff812c2f00>] ? queue_pages_hugetlb+0xa0/0xa0
[ 1860.672795]  [<ffffffff812c2e60>] ? queue_pages_range+0x90/0x90
[ 1860.672795]  [<ffffffff812c30d0>] ? change_prot_numa+0x30/0x30
[ 1860.672795]  [<ffffffff812c6c61>] do_mbind+0x321/0x340
[ 1860.672795]  [<ffffffff8129af2f>] ? might_fault+0x9f/0xb0
[ 1860.672795]  [<ffffffff8129aee6>] ? might_fault+0x56/0xb0
[ 1860.672795]  [<ffffffff812c6d09>] SYSC_mbind+0x89/0xb0
[ 1860.672795]  [<ffffffff81268db5>] ? context_tracking_user_exit+0x195/0x1d0
[ 1860.672795]  [<ffffffff812c6d3e>] SyS_mbind+0xe/0x10
[ 1860.672795]  [<ffffffff8447d890>] tracesys+0xdd/0xe2


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-03 20:06                 ` Sasha Levin
@ 2014-03-03 21:38                   ` Sasha Levin
  2014-03-04 21:32                     ` Naoya Horiguchi
       [not found]                     ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com>
  0 siblings, 2 replies; 28+ messages in thread
From: Sasha Levin @ 2014-03-03 21:38 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel

On 03/03/2014 03:06 PM, Sasha Levin wrote:
> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote:
>> Hi Sasha,
>>
>>> >I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
>>> >walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.
>> I spotted the cause of this problem.
>> Could you try testing if this patch fixes it?
>
> I'm seeing a different failure with this patch:

And the NULL deref still happens.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-03 21:38                   ` Sasha Levin
@ 2014-03-04 21:32                     ` Naoya Horiguchi
       [not found]                     ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com>
  1 sibling, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-04 21:32 UTC (permalink / raw)
  To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel

# sorry if duplicate message

On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote:
> On 03/03/2014 03:06 PM, Sasha Levin wrote:
> >On 03/03/2014 12:02 AM, Naoya Horiguchi wrote:
> >>Hi Sasha,
> >>
> >>>>I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
> >>>>walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.
> >>I spotted the cause of this problem.
> >>Could you try testing if this patch fixes it?
> >
> >I'm seeing a different failure with this patch:
> 
> And the NULL deref still happens.

I don't yet find out the root reason why this issue remains.
So I tried to run trinity myself but the problem didn't reproduce.
(I did simply like "./trinity --group vm --dangerous" a few hours.)
Could you show more detail or tips about how the problem occurs?

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
       [not found]                     ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com>
@ 2014-03-04 22:46                       ` Sasha Levin
  2014-03-04 23:49                         ` Naoya Horiguchi
       [not found]                         ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com>
  0 siblings, 2 replies; 28+ messages in thread
From: Sasha Levin @ 2014-03-04 22:46 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: akpm, linux-mm, linux-kernel, riel

On 03/04/2014 04:32 PM, Naoya Horiguchi wrote:
> # sorry if duplicate message
> 
> On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote:
>> On 03/03/2014 03:06 PM, Sasha Levin wrote:
>>> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote:
>>>> Hi Sasha,
>>>>
>>>>>> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
>>>>>> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.
>>>> I spotted the cause of this problem.
>>>> Could you try testing if this patch fixes it?
>>>
>>> I'm seeing a different failure with this patch:
>>
>> And the NULL deref still happens.
> 
> I don't yet find out the root reason why this issue remains.
> So I tried to run trinity myself but the problem didn't reproduce.
> (I did simply like "./trinity --group vm --dangerous" a few hours.)
> Could you show more detail or tips about how the problem occurs?

I run it as root in a disposable vm, that may be the difference here.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-04 22:46                       ` Sasha Levin
@ 2014-03-04 23:49                         ` Naoya Horiguchi
       [not found]                         ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com>
  1 sibling, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-04 23:49 UTC (permalink / raw)
  To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel

On Tue, Mar 04, 2014 at 05:46:52PM -0500, Sasha Levin wrote:
> On 03/04/2014 04:32 PM, Naoya Horiguchi wrote:
> > # sorry if duplicate message
> > 
> > On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote:
> >> On 03/03/2014 03:06 PM, Sasha Levin wrote:
> >>> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote:
> >>>> Hi Sasha,
> >>>>
> >>>>>> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
> >>>>>> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.
> >>>> I spotted the cause of this problem.
> >>>> Could you try testing if this patch fixes it?
> >>>
> >>> I'm seeing a different failure with this patch:
> >>
> >> And the NULL deref still happens.
> > 
> > I don't yet find out the root reason why this issue remains.
> > So I tried to run trinity myself but the problem didn't reproduce.
> > (I did simply like "./trinity --group vm --dangerous" a few hours.)
> > Could you show more detail or tips about how the problem occurs?
> 
> I run it as root in a disposable vm, that may be the difference here.

Sorry, I didn't write it but I also run it as root on VM, so condition is
the same. It might depend on kernel config, so I'm now trying the config
you previously gave me, but it doesn't boot correctly on my environment
(panic in initialization). I may need some time to get over this.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
       [not found]                         ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com>
@ 2014-03-06  4:31                           ` Sasha Levin
  2014-03-06 16:08                             ` Naoya Horiguchi
       [not found]                             ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com>
  0 siblings, 2 replies; 28+ messages in thread
From: Sasha Levin @ 2014-03-06  4:31 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: akpm, linux-mm, linux-kernel, riel

On 03/04/2014 06:49 PM, Naoya Horiguchi wrote:
> On Tue, Mar 04, 2014 at 05:46:52PM -0500, Sasha Levin wrote:
>> On 03/04/2014 04:32 PM, Naoya Horiguchi wrote:
>>> # sorry if duplicate message
>>>
>>> On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote:
>>>> On 03/03/2014 03:06 PM, Sasha Levin wrote:
>>>>> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote:
>>>>>> Hi Sasha,
>>>>>>
>>>>>>>> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
>>>>>>>> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.
>>>>>> I spotted the cause of this problem.
>>>>>> Could you try testing if this patch fixes it?
>>>>>
>>>>> I'm seeing a different failure with this patch:
>>>>
>>>> And the NULL deref still happens.
>>>
>>> I don't yet find out the root reason why this issue remains.
>>> So I tried to run trinity myself but the problem didn't reproduce.
>>> (I did simply like "./trinity --group vm --dangerous" a few hours.)
>>> Could you show more detail or tips about how the problem occurs?
>>
>> I run it as root in a disposable vm, that may be the difference here.
> 
> Sorry, I didn't write it but I also run it as root on VM, so condition is
> the same. It might depend on kernel config, so I'm now trying the config
> you previously gave me, but it doesn't boot correctly on my environment
> (panic in initialization). I may need some time to get over this.

I'd be happy to help with anything off-list, it shouldn't be too difficult
to get that kernel to boot :)

I've also reverted the page walker series for now, it makes it impossible
to test anything else since it seems that hitting one of the issues is quite
easy.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-06  4:31                           ` Sasha Levin
@ 2014-03-06 16:08                             ` Naoya Horiguchi
       [not found]                             ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com>
  1 sibling, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-06 16:08 UTC (permalink / raw)
  To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel

On Wed, Mar 05, 2014 at 11:31:55PM -0500, Sasha Levin wrote:
...
> > Sorry, I didn't write it but I also run it as root on VM, so condition is
> > the same. It might depend on kernel config, so I'm now trying the config
> > you previously gave me, but it doesn't boot correctly on my environment
> > (panic in initialization). I may need some time to get over this.
> 
> I'd be happy to help with anything off-list, it shouldn't be too difficult
> to get that kernel to boot :)

Thanks. I did reproduce this on my kernel although it's only once and
I needed many trials due to hitting other bugs.

And I found my patch was totally wrong because it should check
!pte_present(), not pte_present().
I'm testing fixed one (see below), and the problem seems not to reproduce
in my environment at least for now.
But I'm not 100% sure, so I need your double checking.

> I've also reverted the page walker series for now, it makes it impossible
> to test anything else since it seems that hitting one of the issues is quite
> easy.

OK. Sorry for the bother.

Thanks,
Naoya
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Thu, 6 Mar 2014 07:08:24 -0500
Subject: [PATCH] mm: add pte_present() check on existing hugetlb_entry
 callbacks

Page table walker doesn't check non-present hugetlb entry in common path,
so hugetlb_entry() callbacks must check it. The reason for this behavior
is that some callers want to handle it in its own way.

However, some callers don't check it now, which causes unpredictable result,
for example when we have a race between migrating hugepage and reading
/proc/pid/numa_maps. This patch fixes it by adding pte_present checks on
buggy callbacks.

This bug exists for long and got visible by introducing hugepage migration.

ChangeLog v2:
- fix if condition (check pte_present() instead of pte_present())

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org # 3.12+
---
 fs/proc/task_mmu.c | 3 +++
 mm/mempolicy.c     | 6 +++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f819d0d4a0e8..762026098381 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1301,6 +1301,9 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long addr,
 	if (pte_none(*pte))
 		return 0;
 
+	if (!pte_present(*pte))
+		return 0;
+
 	page = pte_page(*pte);
 	if (!page)
 		return 0;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b2155b8adbae..494f401bbf6c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -524,8 +524,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
 	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
+	pte_t entry;
 
-	page = pte_page(huge_ptep_get(pte));
+	entry = huge_ptep_get(pte);
+	if (!pte_present(entry))
+		return 0;
+	page = pte_page(entry);
 	nid = page_to_nid(page);
 	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
 		return 0;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
       [not found]                             ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com>
@ 2014-03-06 21:16                               ` Sasha Levin
  2014-03-07  6:35                                 ` Naoya Horiguchi
  0 siblings, 1 reply; 28+ messages in thread
From: Sasha Levin @ 2014-03-06 21:16 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: akpm, linux-mm, linux-kernel, riel

On 03/06/2014 11:08 AM, Naoya Horiguchi wrote:
> And I found my patch was totally wrong because it should check
> !pte_present(), not pte_present().
> I'm testing fixed one (see below), and the problem seems not to reproduce
> in my environment at least for now.
> But I'm not 100% sure, so I need your double checking.

Nope, I still see the problem. Same NULL deref and trace as before.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-06 21:16                               ` Sasha Levin
@ 2014-03-07  6:35                                 ` Naoya Horiguchi
  2014-03-15  6:45                                   ` Naoya Horiguchi
  0 siblings, 1 reply; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-07  6:35 UTC (permalink / raw)
  To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel

On Thu, Mar 06, 2014 at 04:16:29PM -0500, Sasha Levin wrote:
> On 03/06/2014 11:08 AM, Naoya Horiguchi wrote:
> > And I found my patch was totally wrong because it should check
> > !pte_present(), not pte_present().
> > I'm testing fixed one (see below), and the problem seems not to reproduce
> > in my environment at least for now.
> > But I'm not 100% sure, so I need your double checking.
> 
> Nope, I still see the problem. Same NULL deref and trace as before.

Hmm, that's unfortunate.
I tried to find out how this reproduces and the root cause, but no luck.
So I suggest to add !PageHuge check before entering isolate_huge_page(),
which certainly gets over this problem.

I think "[PATCH] mm: add pte_present() check on existing hugetlb_entry"
is correct itself although it didn't fix this race.

Thanks,
Naoya
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Fri, 7 Mar 2014 00:59:41 -0500
Subject: [PATCH] mm/mempolicy.c: add comment in queue_pages_hugetlb()

We have a race where we try to migrate an invalid page, resulting in
hitting VM_BUG_ON_PAGE in isolate_huge_page().
queue_pages_hugetlb() is OK to fail, so let's check !PageHuge before
queuing it with some comment as a todo reminder.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 494f401bbf6c..175353eb7396 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -530,6 +530,17 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
 	if (!pte_present(entry))
 		return 0;
 	page = pte_page(entry);
+
+	/*
+	 * TODO: Trinity found that page could be a non-hugepage. This is an
+	 * unexpected behavior, but it's not clear how this problem happens.
+	 * So let's simply skip such corner case. Page migration can often
+	 * fail for various reasons, so it's ok to just skip the address
+	 * unsuitable to hugepage migration.
+	 */
+	if (!PageHeadHuge(page))
+		return 0;
+
 	nid = page_to_nid(page);
 	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
 		return 0;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks
  2014-03-07  6:35                                 ` Naoya Horiguchi
@ 2014-03-15  6:45                                   ` Naoya Horiguchi
  0 siblings, 0 replies; 28+ messages in thread
From: Naoya Horiguchi @ 2014-03-15  6:45 UTC (permalink / raw)
  To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel

On Fri, Mar 07, 2014 at 01:35:02AM -0500, Naoya Horiguchi wrote:
> On Thu, Mar 06, 2014 at 04:16:29PM -0500, Sasha Levin wrote:
> > On 03/06/2014 11:08 AM, Naoya Horiguchi wrote:
> > > And I found my patch was totally wrong because it should check
> > > !pte_present(), not pte_present().
> > > I'm testing fixed one (see below), and the problem seems not to reproduce
> > > in my environment at least for now.
> > > But I'm not 100% sure, so I need your double checking.
> > 
> > Nope, I still see the problem. Same NULL deref and trace as before.
> 
> Hmm, that's unfortunate.
> I tried to find out how this reproduces and the root cause, but no luck.
> So I suggest to add !PageHuge check before entering isolate_huge_page(),
> which certainly gets over this problem.
> 
> I think "[PATCH] mm: add pte_present() check on existing hugetlb_entry"
> is correct itself although it didn't fix this race.

Andrew, could you consider picking up this patch (below) and "[PATCH] mm:
add pte_present() check on existing hugetlb_entry" (previously posted in
this thread) into linux-mm?

This patch is to be folded into "mempolicy: apply page table walker on
queue_pages_range()," and another one is into "pagewalk: update page
table walker core."

Or do I need to repost them?

Thanks,
Naoya

> Thanks,
> Naoya
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Fri, 7 Mar 2014 00:59:41 -0500
> Subject: [PATCH] mm/mempolicy.c: add comment in queue_pages_hugetlb()
> 
> We have a race where we try to migrate an invalid page, resulting in
> hitting VM_BUG_ON_PAGE in isolate_huge_page().
> queue_pages_hugetlb() is OK to fail, so let's check !PageHuge before
> queuing it with some comment as a todo reminder.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/mempolicy.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 494f401bbf6c..175353eb7396 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -530,6 +530,17 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
>  	if (!pte_present(entry))
>  		return 0;
>  	page = pte_page(entry);
> +
> +	/*
> +	 * TODO: Trinity found that page could be a non-hugepage. This is an
> +	 * unexpected behavior, but it's not clear how this problem happens.
> +	 * So let's simply skip such corner case. Page migration can often
> +	 * fail for various reasons, so it's ok to just skip the address
> +	 * unsuitable to hugepage migration.
> +	 */
> +	if (!PageHeadHuge(page))
> +		return 0;
> +
>  	nid = page_to_nid(page);
>  	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
>  		return 0;
> -- 
> 1.8.5.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-03-15  6:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-27  4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi
2014-02-27  4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi
2014-02-27 21:03   ` Andrew Morton
2014-02-27 21:19     ` Naoya Horiguchi
2014-02-27 21:20       ` Kirill A. Shutemov
2014-02-27 21:54         ` Naoya Horiguchi
2014-02-27  4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi
2014-02-27 21:19   ` Andrew Morton
2014-02-27 21:53     ` Naoya Horiguchi
2014-02-28 19:59       ` [PATCH v2] " Naoya Horiguchi
     [not found]       ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com>
2014-02-28 23:14         ` Andrew Morton
2014-03-01  3:35           ` [PATCH v3] " Naoya Horiguchi
     [not found]           ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com>
2014-03-01 23:08             ` Sasha Levin
2014-03-03  5:02               ` [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Naoya Horiguchi
2014-03-03 20:06                 ` Sasha Levin
2014-03-03 21:38                   ` Sasha Levin
2014-03-04 21:32                     ` Naoya Horiguchi
     [not found]                     ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com>
2014-03-04 22:46                       ` Sasha Levin
2014-03-04 23:49                         ` Naoya Horiguchi
     [not found]                         ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com>
2014-03-06  4:31                           ` Sasha Levin
2014-03-06 16:08                             ` Naoya Horiguchi
     [not found]                             ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com>
2014-03-06 21:16                               ` Sasha Levin
2014-03-07  6:35                                 ` Naoya Horiguchi
2014-03-15  6:45                                   ` Naoya Horiguchi
2014-02-27  4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi
2014-02-27 21:23   ` Andrew Morton
2014-02-27 22:08     ` Naoya Horiguchi
2014-02-27 22:56   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).