* [PATCH 0/3] fixes on page table walker and hugepage rmapping @ 2014-02-27 4:39 Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 4:39 UTC (permalink / raw) To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel Hi, Sasha, could you test if the bug you reported recently [1] reproduces on the latest next tree with this patchset? (I'm not sure of this because the problem looks differently in my own testing...) [1] http://thread.gmane.org/gmane.linux.kernel.mm/113374/focus=113 --- Summary: Naoya Horiguchi (3): mm/pagewalk.c: fix end address calculation in walk_page_range() mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() mm: call vma_adjust_trans_huge() only for thp-enabled vma include/linux/pagemap.h | 13 +++++++++++++ mm/huge_memory.c | 2 +- mm/hugetlb.c | 5 +++++ mm/memory-failure.c | 4 ++-- mm/mmap.c | 3 ++- mm/pagewalk.c | 5 +++-- mm/rmap.c | 8 ++------ 7 files changed, 28 insertions(+), 12 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() 2014-02-27 4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi @ 2014-02-27 4:39 ` Naoya Horiguchi 2014-02-27 21:03 ` Andrew Morton 2014-02-27 4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi 2 siblings, 1 reply; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 4:39 UTC (permalink / raw) To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel When we try to walk over inside a vma, walk_page_range() tries to walk until vma->vm_end even if a given end is before that point. So this patch takes the smaller one as an end address. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- mm/pagewalk.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git next-20140220.orig/mm/pagewalk.c next-20140220/mm/pagewalk.c index 416e981243b1..b418407ff4da 100644 --- next-20140220.orig/mm/pagewalk.c +++ next-20140220/mm/pagewalk.c @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end, next = vma->vm_start; } else { /* inside the found vma */ walk->vma = vma; - next = vma->vm_end; - err = walk_page_test(start, end, walk); + next = min_t(unsigned long, end, vma->vm_end); + + err = walk_page_test(start, next, walk); if (skip_lower_level_walking(walk)) continue; if (err) -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() 2014-02-27 4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi @ 2014-02-27 21:03 ` Andrew Morton 2014-02-27 21:19 ` Naoya Horiguchi 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2014-02-27 21:03 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: Sasha Levin, linux-mm, linux-kernel On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > When we try to walk over inside a vma, walk_page_range() tries to walk > until vma->vm_end even if a given end is before that point. > So this patch takes the smaller one as an end address. > > ... > > --- next-20140220.orig/mm/pagewalk.c > +++ next-20140220/mm/pagewalk.c > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end, > next = vma->vm_start; > } else { /* inside the found vma */ > walk->vma = vma; > - next = vma->vm_end; > - err = walk_page_test(start, end, walk); > + next = min_t(unsigned long, end, vma->vm_end); min_t is unneeded, isn't it? Everything here has type unsigned long. > + err = walk_page_test(start, next, walk); > if (skip_lower_level_walking(walk)) > continue; > if (err) I'm assuming this is a fix against pagewalk-update-page-table-walker-core.patch and shall eventually be folded into that patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() 2014-02-27 21:03 ` Andrew Morton @ 2014-02-27 21:19 ` Naoya Horiguchi 2014-02-27 21:20 ` Kirill A. Shutemov 0 siblings, 1 reply; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 21:19 UTC (permalink / raw) To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel On Thu, Feb 27, 2014 at 01:03:23PM -0800, Andrew Morton wrote: > On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > When we try to walk over inside a vma, walk_page_range() tries to walk > > until vma->vm_end even if a given end is before that point. > > So this patch takes the smaller one as an end address. > > > > ... > > > > --- next-20140220.orig/mm/pagewalk.c > > +++ next-20140220/mm/pagewalk.c > > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end, > > next = vma->vm_start; > > } else { /* inside the found vma */ > > walk->vma = vma; > > - next = vma->vm_end; > > - err = walk_page_test(start, end, walk); > > + next = min_t(unsigned long, end, vma->vm_end); > > min_t is unneeded, isn't it? Everything here has type unsigned long. Yes, so simply (end < vma->vm_end ? end: vma->vm_end) is enough. # I just considered min_t as simple minimum getter without thinking type check. > > + err = walk_page_test(start, next, walk); > > if (skip_lower_level_walking(walk)) > > continue; > > if (err) > > I'm assuming this is a fix against > pagewalk-update-page-table-walker-core.patch and shall eventually be > folded into that patch. Yes, next-20140220 has this patch. Thanks, Naoya -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() 2014-02-27 21:19 ` Naoya Horiguchi @ 2014-02-27 21:20 ` Kirill A. Shutemov 2014-02-27 21:54 ` Naoya Horiguchi 0 siblings, 1 reply; 28+ messages in thread From: Kirill A. Shutemov @ 2014-02-27 21:20 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: akpm, sasha.levin, linux-mm, linux-kernel On Thu, Feb 27, 2014 at 04:19:01PM -0500, Naoya Horiguchi wrote: > On Thu, Feb 27, 2014 at 01:03:23PM -0800, Andrew Morton wrote: > > On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > > > When we try to walk over inside a vma, walk_page_range() tries to walk > > > until vma->vm_end even if a given end is before that point. > > > So this patch takes the smaller one as an end address. > > > > > > ... > > > > > > --- next-20140220.orig/mm/pagewalk.c > > > +++ next-20140220/mm/pagewalk.c > > > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end, > > > next = vma->vm_start; > > > } else { /* inside the found vma */ > > > walk->vma = vma; > > > - next = vma->vm_end; > > > - err = walk_page_test(start, end, walk); > > > + next = min_t(unsigned long, end, vma->vm_end); > > > > min_t is unneeded, isn't it? Everything here has type unsigned long. > > Yes, so simply (end < vma->vm_end ? end: vma->vm_end) is enough. > # I just considered min_t as simple minimum getter without thinking type check. We have non-typed min() for that. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() 2014-02-27 21:20 ` Kirill A. Shutemov @ 2014-02-27 21:54 ` Naoya Horiguchi 0 siblings, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 21:54 UTC (permalink / raw) To: kirill; +Cc: akpm, sasha.levin, linux-mm, linux-kernel On Thu, Feb 27, 2014 at 11:20:34PM +0200, Kirill A. Shutemov wrote: > On Thu, Feb 27, 2014 at 04:19:01PM -0500, Naoya Horiguchi wrote: > > On Thu, Feb 27, 2014 at 01:03:23PM -0800, Andrew Morton wrote: > > > On Wed, 26 Feb 2014 23:39:35 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > > > > > When we try to walk over inside a vma, walk_page_range() tries to walk > > > > until vma->vm_end even if a given end is before that point. > > > > So this patch takes the smaller one as an end address. > > > > > > > > ... > > > > > > > > --- next-20140220.orig/mm/pagewalk.c > > > > +++ next-20140220/mm/pagewalk.c > > > > @@ -321,8 +321,9 @@ int walk_page_range(unsigned long start, unsigned long end, > > > > next = vma->vm_start; > > > > } else { /* inside the found vma */ > > > > walk->vma = vma; > > > > - next = vma->vm_end; > > > > - err = walk_page_test(start, end, walk); > > > > + next = min_t(unsigned long, end, vma->vm_end); > > > > > > min_t is unneeded, isn't it? Everything here has type unsigned long. > > > > Yes, so simply (end < vma->vm_end ? end: vma->vm_end) is enough. > > # I just considered min_t as simple minimum getter without thinking type check. > > We have non-typed min() for that. Thanks. This is what I wanted :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() 2014-02-27 4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi @ 2014-02-27 4:39 ` Naoya Horiguchi 2014-02-27 21:19 ` Andrew Morton 2014-02-27 4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi 2 siblings, 1 reply; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 4:39 UTC (permalink / raw) To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel page->index stores pagecache index when the page is mapped into file mapping region, and the index is in pagecache size unit, so it depends on the page size. Some of users of reverse mapping obviously assumes that page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. For example, consider that we have 3-hugepage vma and try to mbind the 2nd hugepage to migrate to another node. Then the vma is split and migrate_page() is called for the 2nd hugepage (belonging to the middle vma.) In migrate operation, rmap_walk_anon() tries to find the relevant vma to which the target hugepage belongs, but here we miscalculate pgoff. So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. This patch introduces a new API that is usable both for normal page and hugepage to get PAGE_SIZE offset from page->index. Users should clearly distinguish page_index for pagecache index and page_pgoff for page offset. Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org # 3.12+ --- include/linux/pagemap.h | 13 +++++++++++++ mm/huge_memory.c | 2 +- mm/hugetlb.c | 5 +++++ mm/memory-failure.c | 4 ++-- mm/rmap.c | 8 ++------ 5 files changed, 23 insertions(+), 9 deletions(-) diff --git next-20140220.orig/include/linux/pagemap.h next-20140220/include/linux/pagemap.h index 4f591df66778..a8bd14f42032 100644 --- next-20140220.orig/include/linux/pagemap.h +++ next-20140220/include/linux/pagemap.h @@ -316,6 +316,19 @@ static inline loff_t page_file_offset(struct page *page) return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; } +extern pgoff_t hugepage_pgoff(struct page *page); + +/* + * page->index stores pagecache index whose unit is not always PAGE_SIZE. + * This function converts it into PAGE_SIZE offset. + */ +#define page_pgoff(page) \ +({ \ + unlikely(PageHuge(page)) ? \ + hugepage_pgoff(page) : \ + page->index >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); \ +}) + extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, unsigned long address); diff --git next-20140220.orig/mm/huge_memory.c next-20140220/mm/huge_memory.c index 6ac89e9f82ef..ef96763a6abf 100644 --- next-20140220.orig/mm/huge_memory.c +++ next-20140220/mm/huge_memory.c @@ -1800,7 +1800,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) { int mapcount, mapcount2; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); struct anon_vma_chain *avc; BUG_ON(!PageHead(page)); diff --git next-20140220.orig/mm/hugetlb.c next-20140220/mm/hugetlb.c index 2252cacf98e8..e159e593d99f 100644 --- next-20140220.orig/mm/hugetlb.c +++ next-20140220/mm/hugetlb.c @@ -764,6 +764,11 @@ pgoff_t __basepage_index(struct page *page) return (index << compound_order(page_head)) + compound_idx; } +pgoff_t hugepage_pgoff(struct page *page) +{ + return page->index << huge_page_order(page_hstate(page)); +} + static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) { struct page *page; diff --git next-20140220.orig/mm/memory-failure.c next-20140220/mm/memory-failure.c index 35ef28acf137..5d85a4afb22c 100644 --- next-20140220.orig/mm/memory-failure.c +++ next-20140220/mm/memory-failure.c @@ -404,7 +404,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, if (av == NULL) /* Not actually mapped anymore */ return; - pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff = page_pgoff(page); read_lock(&tasklist_lock); for_each_process (tsk) { struct anon_vma_chain *vmac; @@ -437,7 +437,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, mutex_lock(&mapping->i_mmap_mutex); read_lock(&tasklist_lock); for_each_process(tsk) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); if (!task_early_kill(tsk)) continue; diff --git next-20140220.orig/mm/rmap.c next-20140220/mm/rmap.c index 9056a1f00b87..78405051474a 100644 --- next-20140220.orig/mm/rmap.c +++ next-20140220/mm/rmap.c @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma) static inline unsigned long __vma_address(struct page *page, struct vm_area_struct *vma) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - - if (unlikely(is_vm_hugetlb_page(vma))) - pgoff = page->index << huge_page_order(page_hstate(page)); - + pgoff_t pgoff = page_pgoff(page); return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); } @@ -1598,7 +1594,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) { struct anon_vma *anon_vma; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); struct anon_vma_chain *avc; int ret = SWAP_AGAIN; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() 2014-02-27 4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi @ 2014-02-27 21:19 ` Andrew Morton 2014-02-27 21:53 ` Naoya Horiguchi 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2014-02-27 21:19 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: Sasha Levin, linux-mm, linux-kernel, Rik van Riel On Wed, 26 Feb 2014 23:39:36 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > page->index stores pagecache index when the page is mapped into file mapping > region, and the index is in pagecache size unit, so it depends on the page > size. Some of users of reverse mapping obviously assumes that page->index > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. > > For example, consider that we have 3-hugepage vma and try to mbind the 2nd > hugepage to migrate to another node. Then the vma is split and migrate_page() > is called for the 2nd hugepage (belonging to the middle vma.) > In migrate operation, rmap_walk_anon() tries to find the relevant vma to > which the target hugepage belongs, but here we miscalculate pgoff. > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. > > This patch introduces a new API that is usable both for normal page and > hugepage to get PAGE_SIZE offset from page->index. Users should clearly > distinguish page_index for pagecache index and page_pgoff for page offset. So this patch is really independent of the page-walker changes, but the page walker changes need it. So it is appropriate that this patch be staged before that series, and separately. Agree? > Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: stable@vger.kernel.org # 3.12+ Why cc:stable? This problem will only cause runtime issues when the page walker patches are present? > index 4f591df66778..a8bd14f42032 100644 > --- next-20140220.orig/include/linux/pagemap.h > +++ next-20140220/include/linux/pagemap.h > @@ -316,6 +316,19 @@ static inline loff_t page_file_offset(struct page *page) > return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; > } > > +extern pgoff_t hugepage_pgoff(struct page *page); > + > +/* > + * page->index stores pagecache index whose unit is not always PAGE_SIZE. > + * This function converts it into PAGE_SIZE offset. > + */ > +#define page_pgoff(page) \ > +({ \ > + unlikely(PageHuge(page)) ? \ > + hugepage_pgoff(page) : \ > + page->index >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); \ > +}) - I don't think this needs to be implemented in a macro? Can we do it in good old C? - Is PageHuge() the appropriate test? /* * PageHuge() only returns true for hugetlbfs pages, but not for normal or * transparent huge pages. See the PageTransHuge() documentation for more * details. */ - Should page->index be shifted right or left? Probably left - I doubt if PAGE_CACHE_SHIFT will ever be less than PAGE_SHIFT. - I'm surprised we don't have a general what-is-this-page's-order function, so you can just do static inline pgoff_t page_pgoff(struct page *page) { return page->index << page_size_order(page); } And I think this would be a better implementation, as the (new) page_size_order() could be used elsewhere. page_size_order() would be a crappy name - can't think of anything better at present. > --- next-20140220.orig/mm/memory-failure.c > +++ next-20140220/mm/memory-failure.c > @@ -404,7 +404,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, > if (av == NULL) /* Not actually mapped anymore */ > return; > > - pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); See this did a left shift. > + pgoff = page_pgoff(page); > read_lock(&tasklist_lock); > for_each_process (tsk) { > struct anon_vma_chain *vmac; > @@ -437,7 +437,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, > mutex_lock(&mapping->i_mmap_mutex); > read_lock(&tasklist_lock); > for_each_process(tsk) { > - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); him too. > + pgoff_t pgoff = page_pgoff(page); > > if (!task_early_kill(tsk)) > continue; > diff --git next-20140220.orig/mm/rmap.c next-20140220/mm/rmap.c > index 9056a1f00b87..78405051474a 100644 > --- next-20140220.orig/mm/rmap.c > +++ next-20140220/mm/rmap.c > @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma) > static inline unsigned long > __vma_address(struct page *page, struct vm_area_struct *vma) > { > - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); And him. > - > - if (unlikely(is_vm_hugetlb_page(vma))) > - pgoff = page->index << huge_page_order(page_hstate(page)); > - > + pgoff_t pgoff = page_pgoff(page); > return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); > } > > @@ -1598,7 +1594,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, > static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) > { > struct anon_vma *anon_vma; > - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); again. > + pgoff_t pgoff = page_pgoff(page); > struct anon_vma_chain *avc; > int ret = SWAP_AGAIN; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() 2014-02-27 21:19 ` Andrew Morton @ 2014-02-27 21:53 ` Naoya Horiguchi 2014-02-28 19:59 ` [PATCH v2] " Naoya Horiguchi [not found] ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com> 0 siblings, 2 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 21:53 UTC (permalink / raw) To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel On Thu, Feb 27, 2014 at 01:19:57PM -0800, Andrew Morton wrote: > On Wed, 26 Feb 2014 23:39:36 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > page->index stores pagecache index when the page is mapped into file mapping > > region, and the index is in pagecache size unit, so it depends on the page > > size. Some of users of reverse mapping obviously assumes that page->index > > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. > > > > For example, consider that we have 3-hugepage vma and try to mbind the 2nd > > hugepage to migrate to another node. Then the vma is split and migrate_page() > > is called for the 2nd hugepage (belonging to the middle vma.) > > In migrate operation, rmap_walk_anon() tries to find the relevant vma to > > which the target hugepage belongs, but here we miscalculate pgoff. > > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. > > > > This patch introduces a new API that is usable both for normal page and > > hugepage to get PAGE_SIZE offset from page->index. Users should clearly > > distinguish page_index for pagecache index and page_pgoff for page offset. > > So this patch is really independent of the page-walker changes, but the > page walker changes need it. So it is appropriate that this patch be > staged before that series, and separately. Agree? I think that this patch is independen of page walker patch and vice versa. But I agree that we can push this patch separately. > > Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > Cc: stable@vger.kernel.org # 3.12+ > > Why cc:stable? This problem will only cause runtime issues when the > page walker patches are present? I think that we had this bug after "hugepage migration extension" patches was merged at 3.12. I did reproduce this bug on 3.14-rc1 (without page walker patch.) > > index 4f591df66778..a8bd14f42032 100644 > > --- next-20140220.orig/include/linux/pagemap.h > > +++ next-20140220/include/linux/pagemap.h > > @@ -316,6 +316,19 @@ static inline loff_t page_file_offset(struct page *page) > > return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; > > } > > > > +extern pgoff_t hugepage_pgoff(struct page *page); > > + > > +/* > > + * page->index stores pagecache index whose unit is not always PAGE_SIZE. > > + * This function converts it into PAGE_SIZE offset. > > + */ > > +#define page_pgoff(page) \ > > +({ \ > > + unlikely(PageHuge(page)) ? \ > > + hugepage_pgoff(page) : \ > > + page->index >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); \ > > +}) > > - I don't think this needs to be implemented in a macro? Can we do > it in good old C? Yes, I have a headache on this because defining this here as a simple C function causes build error. We can't use PageHuge in include/linux/pagemap.h, but we can in callers' side (they include linux/hugetlb.h.) But yes, I think this looks like a dirty workaround, so it's nice if we could have any better idea. > - Is PageHuge() the appropriate test? > /* > * PageHuge() only returns true for hugetlbfs pages, but not for normal or > * transparent huge pages. See the PageTransHuge() documentation for more > * details. > */ I think yes at least for now, because we have page->index in page unit for transparent hugepage. And as long as I know, we don't do rmapping for transparent hugepage (callers should split thps in advancde if they need do rmapping.) If this situation changes, we will need change on this function. > - Should page->index be shifted right or left? Probably left - I > doubt if PAGE_CACHE_SHIFT will ever be less than PAGE_SHIFT. My big mistake, sorry. It must be a left shift. > - I'm surprised we don't have a general what-is-this-page's-order > function, so you can just do > > static inline pgoff_t page_pgoff(struct page *page) > { > return page->index << page_size_order(page); > } > > And I think this would be a better implementation, as the (new) > page_size_order() could be used elsewhere. > > page_size_order() would be a crappy name - can't think of anything > better at present. I can't, either. If nobody suggests a better one, I'll take yours. Thanks, Naoya > > --- next-20140220.orig/mm/memory-failure.c > > +++ next-20140220/mm/memory-failure.c > > @@ -404,7 +404,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, > > if (av == NULL) /* Not actually mapped anymore */ > > return; > > > > - pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > > See this did a left shift. > > > + pgoff = page_pgoff(page); > > read_lock(&tasklist_lock); > > for_each_process (tsk) { > > struct anon_vma_chain *vmac; > > @@ -437,7 +437,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, > > mutex_lock(&mapping->i_mmap_mutex); > > read_lock(&tasklist_lock); > > for_each_process(tsk) { > > - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > > him too. > > > + pgoff_t pgoff = page_pgoff(page); > > > > if (!task_early_kill(tsk)) > > continue; > > diff --git next-20140220.orig/mm/rmap.c next-20140220/mm/rmap.c > > index 9056a1f00b87..78405051474a 100644 > > --- next-20140220.orig/mm/rmap.c > > +++ next-20140220/mm/rmap.c > > @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma) > > static inline unsigned long > > __vma_address(struct page *page, struct vm_area_struct *vma) > > { > > - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > > And him. > > > - > > - if (unlikely(is_vm_hugetlb_page(vma))) > > - pgoff = page->index << huge_page_order(page_hstate(page)); > > - > > + pgoff_t pgoff = page_pgoff(page); > > return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); > > } > > > > @@ -1598,7 +1594,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, > > static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) > > { > > struct anon_vma *anon_vma; > > - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > > again. > > > + pgoff_t pgoff = page_pgoff(page); > > struct anon_vma_chain *avc; > > int ret = SWAP_AGAIN; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() 2014-02-27 21:53 ` Naoya Horiguchi @ 2014-02-28 19:59 ` Naoya Horiguchi [not found] ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-28 19:59 UTC (permalink / raw) To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel page->index stores pagecache index when the page is mapped into file mapping region, and the index is in pagecache size unit, so it depends on the page size. Some of users of reverse mapping obviously assumes that page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. For example, consider that we have 3-hugepage vma and try to mbind the 2nd hugepage to migrate to another node. Then the vma is split and migrate_page() is called for the 2nd hugepage (belonging to the middle vma.) In migrate operation, rmap_walk_anon() tries to find the relevant vma to which the target hugepage belongs, but here we miscalculate pgoff. So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. This patch introduces a new API that is usable both for normal page and hugepage to get PAGE_SIZE offset from page->index. Users should clearly distinguish page_index for pagecache index and page_pgoff for page offset. ChangeLog v2: - fix wrong shift direction - introduce page_size_order() and huge_page_size_order() - move the declaration of PageHuge() to include/linux/hugetlb_inline.h to avoid macro definition. Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org # 3.12+ --- include/linux/hugetlb.h | 7 ------- include/linux/hugetlb_inline.h | 13 +++++++++++++ include/linux/pagemap.h | 16 ++++++++++++++++ mm/huge_memory.c | 2 +- mm/hugetlb.c | 6 ++++++ mm/memory-failure.c | 4 ++-- mm/rmap.c | 8 ++------ 7 files changed, 40 insertions(+), 16 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8c43cc469d78..91fffa4fbc57 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -31,8 +31,6 @@ extern int hugetlb_max_hstate __read_mostly; struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); -int PageHuge(struct page *page); - void reset_vma_resv_huge_pages(struct vm_area_struct *vma); int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); @@ -99,11 +97,6 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, #else /* !CONFIG_HUGETLB_PAGE */ -static inline int PageHuge(struct page *page) -{ - return 0; -} - static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { } diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h index 2bb681fbeb35..e939975f0cba 100644 --- a/include/linux/hugetlb_inline.h +++ b/include/linux/hugetlb_inline.h @@ -10,6 +10,9 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) return !!(vma->vm_flags & VM_HUGETLB); } +int PageHuge(struct page *page); +unsigned int huge_page_size_order(struct page *page); + #else static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) @@ -17,6 +20,16 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) return 0; } +static inline int PageHuge(struct page *page) +{ + return 0; +} + +static inline unsigned int huge_page_size_order(struct page *page) +{ + return 0; +} + #endif #endif diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index f0ef8826acf1..4a01dd3c304d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page) return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; } +static inline unsigned int page_size_order(struct page *page) +{ + return unlikely(PageHuge(page)) ? + huge_page_size_order(page) : + (PAGE_CACHE_SHIFT - PAGE_SHIFT); +} + +/* + * page->index stores pagecache index whose unit is not always PAGE_SIZE. + * This function converts it into PAGE_SIZE offset. + */ +static inline pgoff_t page_pgoff(struct page *page) +{ + return page->index << page_size_order(page); +} + extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, unsigned long address); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf974e1..12faa32f4d6d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1877,7 +1877,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) { int mapcount, mapcount2; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); struct anon_vma_chain *avc; BUG_ON(!PageHead(page)); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c01cb9fedb18..39d7461e9ba7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -29,6 +29,7 @@ #include <linux/io.h> #include <linux/hugetlb.h> +#include <linux/hugetlb_inline.h> #include <linux/hugetlb_cgroup.h> #include <linux/node.h> #include "internal.h" @@ -727,6 +728,11 @@ pgoff_t __basepage_index(struct page *page) return (index << compound_order(page_head)) + compound_idx; } +unsigned int huge_page_size_order(struct page *page) +{ + return huge_page_order(page_hstate(page)); +} + static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) { struct page *page; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 47e500c2f258..3be725304703 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -409,7 +409,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, if (av == NULL) /* Not actually mapped anymore */ return; - pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff = page_pgoff(page); read_lock(&tasklist_lock); for_each_process (tsk) { struct anon_vma_chain *vmac; @@ -442,7 +442,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, mutex_lock(&mapping->i_mmap_mutex); read_lock(&tasklist_lock); for_each_process(tsk) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); if (!task_early_kill(tsk)) continue; diff --git a/mm/rmap.c b/mm/rmap.c index d9d42316a99a..d3538a913285 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma) static inline unsigned long __vma_address(struct page *page, struct vm_area_struct *vma) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - - if (unlikely(is_vm_hugetlb_page(vma))) - pgoff = page->index << huge_page_order(page_hstate(page)); - + pgoff_t pgoff = page_pgoff(page); return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); } @@ -1588,7 +1584,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) { struct anon_vma *anon_vma; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); struct anon_vma_chain *avc; int ret = SWAP_AGAIN; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH v2] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() [not found] ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com> @ 2014-02-28 23:14 ` Andrew Morton 2014-03-01 3:35 ` [PATCH v3] " Naoya Horiguchi [not found] ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com> 0 siblings, 2 replies; 28+ messages in thread From: Andrew Morton @ 2014-02-28 23:14 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: sasha.levin, linux-mm, linux-kernel, riel On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > page->index stores pagecache index when the page is mapped into file mapping > region, and the index is in pagecache size unit, so it depends on the page > size. Some of users of reverse mapping obviously assumes that page->index > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. > > For example, consider that we have 3-hugepage vma and try to mbind the 2nd > hugepage to migrate to another node. Then the vma is split and migrate_page() > is called for the 2nd hugepage (belonging to the middle vma.) > In migrate operation, rmap_walk_anon() tries to find the relevant vma to > which the target hugepage belongs, but here we miscalculate pgoff. > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. > > This patch introduces a new API that is usable both for normal page and > hugepage to get PAGE_SIZE offset from page->index. Users should clearly > distinguish page_index for pagecache index and page_pgoff for page offset. > > .. > > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page) > return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; > } > > +static inline unsigned int page_size_order(struct page *page) > +{ > + return unlikely(PageHuge(page)) ? > + huge_page_size_order(page) : > + (PAGE_CACHE_SHIFT - PAGE_SHIFT); > +} Could use some nice documentation, please. Why it exists, what it does. Particularly: what sort of pages it can and can't operate on, and why. The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies that the page is a pagecache page. I dunno, maybe just use "0"? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() 2014-02-28 23:14 ` Andrew Morton @ 2014-03-01 3:35 ` Naoya Horiguchi [not found] ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com> 1 sibling, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-01 3:35 UTC (permalink / raw) To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel On Fri, Feb 28, 2014 at 03:14:27PM -0800, Andrew Morton wrote: > On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > page->index stores pagecache index when the page is mapped into file mapping > > region, and the index is in pagecache size unit, so it depends on the page > > size. Some of users of reverse mapping obviously assumes that page->index > > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. > > > > For example, consider that we have 3-hugepage vma and try to mbind the 2nd > > hugepage to migrate to another node. Then the vma is split and migrate_page() > > is called for the 2nd hugepage (belonging to the middle vma.) > > In migrate operation, rmap_walk_anon() tries to find the relevant vma to > > which the target hugepage belongs, but here we miscalculate pgoff. > > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. > > > > This patch introduces a new API that is usable both for normal page and > > hugepage to get PAGE_SIZE offset from page->index. Users should clearly > > distinguish page_index for pagecache index and page_pgoff for page offset. > > > > .. > > > > --- a/include/linux/pagemap.h > > +++ b/include/linux/pagemap.h > > @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page) > > return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; > > } > > > > +static inline unsigned int page_size_order(struct page *page) > > +{ > > + return unlikely(PageHuge(page)) ? > > + huge_page_size_order(page) : I found that we have compound_order(page) for the same purpose, so we don't have to define this new function. > > + (PAGE_CACHE_SHIFT - PAGE_SHIFT); > > +} > > Could use some nice documentation, please. Why it exists, what it > does. Particularly: what sort of pages it can and can't operate on, > and why. OK. > The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies > that the page is a pagecache page. I dunno, maybe just use "0"? Yes, PAGE_CACHE_SHIFT makes code messy if PAGE_CACHE_SHIFT is always PAGE_SHIFT. But I guess that recently people start to thinking of changing the size of pagecache (in the discussion around >4kB sector device.) And from readabilitie's perspective, "pagecache size" and "page size" are different things, so keeping it is better in a long run. Anyway, I revised the patch again, could you take a look? Thanks, Naoya --- From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Date: Fri, 28 Feb 2014 21:56:24 -0500 Subject: [PATCH] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() page->index stores pagecache index when the page is mapped into file mapping region, and the index is in pagecache size unit, so it depends on the page size. Some of users of reverse mapping obviously assumes that page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. For example, consider that we have 3-hugepage vma and try to mbind the 2nd hugepage to migrate to another node. Then the vma is split and migrate_page() is called for the 2nd hugepage (belonging to the middle vma.) In migrate operation, rmap_walk_anon() tries to find the relevant vma to which the target hugepage belongs, but here we miscalculate pgoff. So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. This patch introduces a new API that is usable both for normal page and hugepage to get PAGE_SIZE offset from page->index. Users should clearly distinguish page_index for pagecache index and page_pgoff for page offset. ChangeLog v3: - add comment on page_size_order() - use compound_order(compound_head(page)) instead of huge_page_size_order() - use page_pgoff() in rmap_walk_file() too - use page_size_order() in kill_proc() - fix space indent ChangeLog v2: - fix wrong shift direction - introduce page_size_order() and huge_page_size_order() - move the declaration of PageHuge() to include/linux/hugetlb_inline.h to avoid macro definition. Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org # 3.12+ --- include/linux/hugetlb.h | 7 ------- include/linux/hugetlb_inline.h | 7 +++++++ include/linux/pagemap.h | 28 ++++++++++++++++++++++++++++ mm/huge_memory.c | 2 +- mm/hugetlb.c | 1 + mm/memory-failure.c | 6 +++--- mm/rmap.c | 10 +++------- 7 files changed, 43 insertions(+), 18 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8c43cc469d78..91fffa4fbc57 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -31,8 +31,6 @@ extern int hugetlb_max_hstate __read_mostly; struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); -int PageHuge(struct page *page); - void reset_vma_resv_huge_pages(struct vm_area_struct *vma); int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); @@ -99,11 +97,6 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, #else /* !CONFIG_HUGETLB_PAGE */ -static inline int PageHuge(struct page *page) -{ - return 0; -} - static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { } diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h index 2bb681fbeb35..4d60c82e9fda 100644 --- a/include/linux/hugetlb_inline.h +++ b/include/linux/hugetlb_inline.h @@ -10,6 +10,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) return !!(vma->vm_flags & VM_HUGETLB); } +int PageHuge(struct page *page); + #else static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) @@ -17,6 +19,11 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) return 0; } +static inline int PageHuge(struct page *page) +{ + return 0; +} + #endif #endif diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index f0ef8826acf1..715962f7ea7a 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -307,6 +307,34 @@ static inline loff_t page_file_offset(struct page *page) return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; } +/* + * Getting page order of a given page in the context of the pagecache which + * each page belongs to. + * + * Pagecache unit size is not a fixed value (hugetlbfs is an example), but + * vma_interval_tree and anon_vma_internval_tree APIs assume that its indices + * are in PAGE_SIZE unit. So this routine helps us to get normalized indices. + * + * This page should be called only for pagecache pages/hugepages and anonymous + * pages/hugepages, because pagecache unit size is irrelevant except for those + * pages. + */ +static inline unsigned int page_size_order(struct page *page) +{ + return unlikely(PageHuge(page)) ? + compound_order(compound_head(page)) : + (PAGE_CACHE_SHIFT - PAGE_SHIFT); +} + +/* + * page->index stores pagecache index whose unit is not always PAGE_SIZE. + * This function converts it into PAGE_SIZE offset. + */ +static inline pgoff_t page_pgoff(struct page *page) +{ + return page->index << page_size_order(page); +} + extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, unsigned long address); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf974e1..12faa32f4d6d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1877,7 +1877,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) { int mapcount, mapcount2; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); struct anon_vma_chain *avc; BUG_ON(!PageHead(page)); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c01cb9fedb18..7222247a590b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -29,6 +29,7 @@ #include <linux/io.h> #include <linux/hugetlb.h> +#include <linux/hugetlb_inline.h> #include <linux/hugetlb_cgroup.h> #include <linux/node.h> #include "internal.h" diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 47e500c2f258..516dd4202248 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -207,7 +207,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, #ifdef __ARCH_SI_TRAPNO si.si_trapno = trapno; #endif - si.si_addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT; + si.si_addr_lsb = page_size_order(page) + PAGE_SHIFT; if ((flags & MF_ACTION_REQUIRED) && t == current) { si.si_code = BUS_MCEERR_AR; @@ -409,7 +409,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, if (av == NULL) /* Not actually mapped anymore */ return; - pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff = page_pgoff(page); read_lock(&tasklist_lock); for_each_process (tsk) { struct anon_vma_chain *vmac; @@ -442,7 +442,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, mutex_lock(&mapping->i_mmap_mutex); read_lock(&tasklist_lock); for_each_process(tsk) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); if (!task_early_kill(tsk)) continue; diff --git a/mm/rmap.c b/mm/rmap.c index d9d42316a99a..e344012cc20a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -515,11 +515,7 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma) static inline unsigned long __vma_address(struct page *page, struct vm_area_struct *vma) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - - if (unlikely(is_vm_hugetlb_page(vma))) - pgoff = page->index << huge_page_order(page_hstate(page)); - + pgoff_t pgoff = page_pgoff(page); return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); } @@ -1588,7 +1584,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) { struct anon_vma *anon_vma; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page_pgoff(page); struct anon_vma_chain *avc; int ret = SWAP_AGAIN; @@ -1629,7 +1625,7 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) { struct address_space *mapping = page->mapping; - pgoff_t pgoff = page->index << compound_order(page); + pgoff_t pgoff = page_pgoff(page); struct vm_area_struct *vma; int ret = SWAP_AGAIN; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com>]
* Re: [PATCH v3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() [not found] ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com> @ 2014-03-01 23:08 ` Sasha Levin 2014-03-03 5:02 ` [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Naoya Horiguchi 0 siblings, 1 reply; 28+ messages in thread From: Sasha Levin @ 2014-03-01 23:08 UTC (permalink / raw) To: Naoya Horiguchi, akpm; +Cc: linux-mm, linux-kernel, riel On 02/28/2014 10:35 PM, Naoya Horiguchi wrote: > On Fri, Feb 28, 2014 at 03:14:27PM -0800, Andrew Morton wrote: >> On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: >> >>> page->index stores pagecache index when the page is mapped into file mapping >>> region, and the index is in pagecache size unit, so it depends on the page >>> size. Some of users of reverse mapping obviously assumes that page->index >>> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. >>> >>> For example, consider that we have 3-hugepage vma and try to mbind the 2nd >>> hugepage to migrate to another node. Then the vma is split and migrate_page() >>> is called for the 2nd hugepage (belonging to the middle vma.) >>> In migrate operation, rmap_walk_anon() tries to find the relevant vma to >>> which the target hugepage belongs, but here we miscalculate pgoff. >>> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. >>> >>> This patch introduces a new API that is usable both for normal page and >>> hugepage to get PAGE_SIZE offset from page->index. Users should clearly >>> distinguish page_index for pagecache index and page_pgoff for page offset. >>> >>> .. >>> >>> --- a/include/linux/pagemap.h >>> +++ b/include/linux/pagemap.h >>> @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page) >>> return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; >>> } >>> >>> +static inline unsigned int page_size_order(struct page *page) >>> +{ >>> + return unlikely(PageHuge(page)) ? >>> + huge_page_size_order(page) : > > I found that we have compound_order(page) for the same purpose, so we don't > have to define this new function. > >>> + (PAGE_CACHE_SHIFT - PAGE_SHIFT); >>> +} >> >> Could use some nice documentation, please. Why it exists, what it >> does. Particularly: what sort of pages it can and can't operate on, >> and why. > > OK. > >> The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies >> that the page is a pagecache page. I dunno, maybe just use "0"? > > Yes, PAGE_CACHE_SHIFT makes code messy if PAGE_CACHE_SHIFT is always PAGE_SHIFT. > But I guess that recently people start to thinking of changing the size of > pagecache (in the discussion around >4kB sector device.) > And from readabilitie's perspective, "pagecache size" and "page size" are > different things, so keeping it is better in a long run. > > Anyway, I revised the patch again, could you take a look? > > Thanks, > Naoya > --- > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Date: Fri, 28 Feb 2014 21:56:24 -0500 > Subject: [PATCH] mm, hugetlbfs: fix rmapping for anonymous hugepages with > page_pgoff() > > page->index stores pagecache index when the page is mapped into file mapping > region, and the index is in pagecache size unit, so it depends on the page > size. Some of users of reverse mapping obviously assumes that page->index > is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage. > > For example, consider that we have 3-hugepage vma and try to mbind the 2nd > hugepage to migrate to another node. Then the vma is split and migrate_page() > is called for the 2nd hugepage (belonging to the middle vma.) > In migrate operation, rmap_walk_anon() tries to find the relevant vma to > which the target hugepage belongs, but here we miscalculate pgoff. > So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON. > > This patch introduces a new API that is usable both for normal page and > hugepage to get PAGE_SIZE offset from page->index. Users should clearly > distinguish page_index for pagecache index and page_pgoff for page offset. > > ChangeLog v3: > - add comment on page_size_order() > - use compound_order(compound_head(page)) instead of huge_page_size_order() > - use page_pgoff() in rmap_walk_file() too > - use page_size_order() in kill_proc() > - fix space indent > > ChangeLog v2: > - fix wrong shift direction > - introduce page_size_order() and huge_page_size_order() > - move the declaration of PageHuge() to include/linux/hugetlb_inline.h > to avoid macro definition. > > Reported-by: Sasha Levin <sasha.levin@oracle.com> # if the reported problem is fixed > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: stable@vger.kernel.org # 3.12+ I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-01 23:08 ` Sasha Levin @ 2014-03-03 5:02 ` Naoya Horiguchi 2014-03-03 20:06 ` Sasha Levin 0 siblings, 1 reply; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-03 5:02 UTC (permalink / raw) To: Sasha Levin, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel Hi Sasha, > I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in > walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. I spotted the cause of this problem. Could you try testing if this patch fixes it? Thanks, Naoya --- Page table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding pte_present checks on buggy callbacks. This bug exists for long and got visible by introducing hugepage migration. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org # 3.12+ --- fs/proc/task_mmu.c | 3 +++ mm/mempolicy.c | 6 +++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git next-20140228.orig/fs/proc/task_mmu.c next-20140228/fs/proc/task_mmu.c index 3746d89c768b..a4cecadce867 100644 --- next-20140228.orig/fs/proc/task_mmu.c +++ next-20140228/fs/proc/task_mmu.c @@ -1299,6 +1299,9 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, if (pte_none(*pte)) return 0; + if (pte_present(*pte)) + return 0; + page = pte_page(*pte); if (!page) return 0; diff --git next-20140228.orig/mm/mempolicy.c next-20140228/mm/mempolicy.c index c0d1cbd68790..1e171186ee6d 100644 --- next-20140228.orig/mm/mempolicy.c +++ next-20140228/mm/mempolicy.c @@ -524,8 +524,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr, unsigned long flags = qp->flags; int nid; struct page *page; + pte_t entry; - page = pte_page(huge_ptep_get(pte)); + entry = huge_ptep_get(pte); + if (pte_present(entry)) + return 0; + page = pte_page(entry); nid = page_to_nid(page); if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) return 0; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-03 5:02 ` [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Naoya Horiguchi @ 2014-03-03 20:06 ` Sasha Levin 2014-03-03 21:38 ` Sasha Levin 0 siblings, 1 reply; 28+ messages in thread From: Sasha Levin @ 2014-03-03 20:06 UTC (permalink / raw) To: Naoya Horiguchi, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel On 03/03/2014 12:02 AM, Naoya Horiguchi wrote: > Hi Sasha, > >> >I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in >> >walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. > I spotted the cause of this problem. > Could you try testing if this patch fixes it? I'm seeing a different failure with this patch: [ 1860.669114] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 1860.670498] IP: [<ffffffff8129c0bf>] vm_normal_page+0x3f/0x90 [ 1860.672795] PGD 6c1c84067 PUD 6e0a3d067 PMD 0 [ 1860.672795] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1860.672795] Dumping ftrace buffer: [ 1860.672795] (ftrace buffer empty) [ 1860.672795] Modules linked in: [ 1860.672795] CPU: 4 PID: 34914 Comm: trinity-c184 Tainted: G W 3.14.0-rc4-ne [ 1860.672795] task: ffff880717d90000 ti: ffff88070b3da000 task.ti: ffff88070b3da000 [ 1860.672795] RIP: 0010:[<ffffffff8129c0bf>] [<ffffffff8129c0bf>] vm_normal_page+0x3f/ [ 1860.672795] RSP: 0018:ffff88070b3dbba8 EFLAGS: 00010202 [ 1860.672795] RAX: 000000000000767f RBX: ffff88070b3dbdd8 RCX: ffff88070b3dbd78 [ 1860.672795] RDX: 800000000767f225 RSI: 0100000000699000 RDI: 800000000767f225 [ 1860.672795] RBP: ffff88070b3dbba8 R08: 0000000000000000 R09: 0000000000000000 [ 1860.672795] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880717df24c8 [ 1860.672795] R13: 0000000000000020 R14: 0100000000699000 R15: 0100000000800000 [ 1860.672795] FS: 00007f20a3584700(0000) GS:ffff88052b800000(0000) knlGS:0000000000000 [ 1860.672795] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1860.672795] CR2: 0000000000000050 CR3: 00000006d73cf000 CR4: 00000000000006e0 [ 1860.672795] Stack: [ 1860.672795] ffff88070b3dbbd8 ffffffff812c2f3d ffffffff812b2dc0 010000000069a000 [ 1860.672795] ffff880717df24c8 ffff88070b3dbd78 ffff88070b3dbc28 ffffffff812b2e00 [ 1860.672795] 0000000000000000 ffff88072956bcf0 ffff88070b3dbc28 ffff8806e0a3d018 [ 1860.672795] Call Trace: [ 1860.672795] [<ffffffff812c2f3d>] queue_pages_pte+0x3d/0xd0 [ 1860.672795] [<ffffffff812b2dc0>] ? walk_pte_range+0xc0/0x180 [ 1860.672795] [<ffffffff812b2e00>] walk_pte_range+0x100/0x180 [ 1860.672795] [<ffffffff812b3091>] walk_pmd_range+0x211/0x240 [ 1860.672795] [<ffffffff812b31eb>] walk_pud_range+0x12b/0x160 [ 1860.672795] [<ffffffff812d2ee4>] ? __slab_free+0x384/0x5e0 [ 1860.672795] [<ffffffff812b3329>] walk_pgd_range+0x109/0x140 [ 1860.672795] [<ffffffff812b3395>] __walk_page_range+0x35/0x40 [ 1860.672795] [<ffffffff812b3552>] walk_page_range+0xf2/0x130 [ 1860.672795] [<ffffffff812c2e41>] queue_pages_range+0x71/0x90 [ 1860.672795] [<ffffffff812c2f00>] ? queue_pages_hugetlb+0xa0/0xa0 [ 1860.672795] [<ffffffff812c2e60>] ? queue_pages_range+0x90/0x90 [ 1860.672795] [<ffffffff812c30d0>] ? change_prot_numa+0x30/0x30 [ 1860.672795] [<ffffffff812c6c61>] do_mbind+0x321/0x340 [ 1860.672795] [<ffffffff8129af2f>] ? might_fault+0x9f/0xb0 [ 1860.672795] [<ffffffff8129aee6>] ? might_fault+0x56/0xb0 [ 1860.672795] [<ffffffff812c6d09>] SYSC_mbind+0x89/0xb0 [ 1860.672795] [<ffffffff81268db5>] ? context_tracking_user_exit+0x195/0x1d0 [ 1860.672795] [<ffffffff812c6d3e>] SyS_mbind+0xe/0x10 [ 1860.672795] [<ffffffff8447d890>] tracesys+0xdd/0xe2 Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-03 20:06 ` Sasha Levin @ 2014-03-03 21:38 ` Sasha Levin 2014-03-04 21:32 ` Naoya Horiguchi [not found] ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com> 0 siblings, 2 replies; 28+ messages in thread From: Sasha Levin @ 2014-03-03 21:38 UTC (permalink / raw) To: Naoya Horiguchi, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel On 03/03/2014 03:06 PM, Sasha Levin wrote: > On 03/03/2014 12:02 AM, Naoya Horiguchi wrote: >> Hi Sasha, >> >>> >I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in >>> >walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. >> I spotted the cause of this problem. >> Could you try testing if this patch fixes it? > > I'm seeing a different failure with this patch: And the NULL deref still happens. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-03 21:38 ` Sasha Levin @ 2014-03-04 21:32 ` Naoya Horiguchi [not found] ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com> 1 sibling, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-04 21:32 UTC (permalink / raw) To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel # sorry if duplicate message On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote: > On 03/03/2014 03:06 PM, Sasha Levin wrote: > >On 03/03/2014 12:02 AM, Naoya Horiguchi wrote: > >>Hi Sasha, > >> > >>>>I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in > >>>>walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. > >>I spotted the cause of this problem. > >>Could you try testing if this patch fixes it? > > > >I'm seeing a different failure with this patch: > > And the NULL deref still happens. I don't yet find out the root reason why this issue remains. So I tried to run trinity myself but the problem didn't reproduce. (I did simply like "./trinity --group vm --dangerous" a few hours.) Could you show more detail or tips about how the problem occurs? Thanks, Naoya -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com>]
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks [not found] ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com> @ 2014-03-04 22:46 ` Sasha Levin 2014-03-04 23:49 ` Naoya Horiguchi [not found] ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com> 0 siblings, 2 replies; 28+ messages in thread From: Sasha Levin @ 2014-03-04 22:46 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: akpm, linux-mm, linux-kernel, riel On 03/04/2014 04:32 PM, Naoya Horiguchi wrote: > # sorry if duplicate message > > On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote: >> On 03/03/2014 03:06 PM, Sasha Levin wrote: >>> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote: >>>> Hi Sasha, >>>> >>>>>> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in >>>>>> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. >>>> I spotted the cause of this problem. >>>> Could you try testing if this patch fixes it? >>> >>> I'm seeing a different failure with this patch: >> >> And the NULL deref still happens. > > I don't yet find out the root reason why this issue remains. > So I tried to run trinity myself but the problem didn't reproduce. > (I did simply like "./trinity --group vm --dangerous" a few hours.) > Could you show more detail or tips about how the problem occurs? I run it as root in a disposable vm, that may be the difference here. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-04 22:46 ` Sasha Levin @ 2014-03-04 23:49 ` Naoya Horiguchi [not found] ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com> 1 sibling, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-04 23:49 UTC (permalink / raw) To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel On Tue, Mar 04, 2014 at 05:46:52PM -0500, Sasha Levin wrote: > On 03/04/2014 04:32 PM, Naoya Horiguchi wrote: > > # sorry if duplicate message > > > > On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote: > >> On 03/03/2014 03:06 PM, Sasha Levin wrote: > >>> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote: > >>>> Hi Sasha, > >>>> > >>>>>> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in > >>>>>> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. > >>>> I spotted the cause of this problem. > >>>> Could you try testing if this patch fixes it? > >>> > >>> I'm seeing a different failure with this patch: > >> > >> And the NULL deref still happens. > > > > I don't yet find out the root reason why this issue remains. > > So I tried to run trinity myself but the problem didn't reproduce. > > (I did simply like "./trinity --group vm --dangerous" a few hours.) > > Could you show more detail or tips about how the problem occurs? > > I run it as root in a disposable vm, that may be the difference here. Sorry, I didn't write it but I also run it as root on VM, so condition is the same. It might depend on kernel config, so I'm now trying the config you previously gave me, but it doesn't boot correctly on my environment (panic in initialization). I may need some time to get over this. Thanks, Naoya -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com>]
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks [not found] ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com> @ 2014-03-06 4:31 ` Sasha Levin 2014-03-06 16:08 ` Naoya Horiguchi [not found] ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com> 0 siblings, 2 replies; 28+ messages in thread From: Sasha Levin @ 2014-03-06 4:31 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: akpm, linux-mm, linux-kernel, riel On 03/04/2014 06:49 PM, Naoya Horiguchi wrote: > On Tue, Mar 04, 2014 at 05:46:52PM -0500, Sasha Levin wrote: >> On 03/04/2014 04:32 PM, Naoya Horiguchi wrote: >>> # sorry if duplicate message >>> >>> On Mon, Mar 03, 2014 at 04:38:41PM -0500, Sasha Levin wrote: >>>> On 03/03/2014 03:06 PM, Sasha Levin wrote: >>>>> On 03/03/2014 12:02 AM, Naoya Horiguchi wrote: >>>>>> Hi Sasha, >>>>>> >>>>>>>> I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in >>>>>>>> walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear. >>>>>> I spotted the cause of this problem. >>>>>> Could you try testing if this patch fixes it? >>>>> >>>>> I'm seeing a different failure with this patch: >>>> >>>> And the NULL deref still happens. >>> >>> I don't yet find out the root reason why this issue remains. >>> So I tried to run trinity myself but the problem didn't reproduce. >>> (I did simply like "./trinity --group vm --dangerous" a few hours.) >>> Could you show more detail or tips about how the problem occurs? >> >> I run it as root in a disposable vm, that may be the difference here. > > Sorry, I didn't write it but I also run it as root on VM, so condition is > the same. It might depend on kernel config, so I'm now trying the config > you previously gave me, but it doesn't boot correctly on my environment > (panic in initialization). I may need some time to get over this. I'd be happy to help with anything off-list, it shouldn't be too difficult to get that kernel to boot :) I've also reverted the page walker series for now, it makes it impossible to test anything else since it seems that hitting one of the issues is quite easy. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-06 4:31 ` Sasha Levin @ 2014-03-06 16:08 ` Naoya Horiguchi [not found] ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com> 1 sibling, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-06 16:08 UTC (permalink / raw) To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel On Wed, Mar 05, 2014 at 11:31:55PM -0500, Sasha Levin wrote: ... > > Sorry, I didn't write it but I also run it as root on VM, so condition is > > the same. It might depend on kernel config, so I'm now trying the config > > you previously gave me, but it doesn't boot correctly on my environment > > (panic in initialization). I may need some time to get over this. > > I'd be happy to help with anything off-list, it shouldn't be too difficult > to get that kernel to boot :) Thanks. I did reproduce this on my kernel although it's only once and I needed many trials due to hitting other bugs. And I found my patch was totally wrong because it should check !pte_present(), not pte_present(). I'm testing fixed one (see below), and the problem seems not to reproduce in my environment at least for now. But I'm not 100% sure, so I need your double checking. > I've also reverted the page walker series for now, it makes it impossible > to test anything else since it seems that hitting one of the issues is quite > easy. OK. Sorry for the bother. Thanks, Naoya --- From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Date: Thu, 6 Mar 2014 07:08:24 -0500 Subject: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Page table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding pte_present checks on buggy callbacks. This bug exists for long and got visible by introducing hugepage migration. ChangeLog v2: - fix if condition (check pte_present() instead of pte_present()) Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org # 3.12+ --- fs/proc/task_mmu.c | 3 +++ mm/mempolicy.c | 6 +++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f819d0d4a0e8..762026098381 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1301,6 +1301,9 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, if (pte_none(*pte)) return 0; + if (!pte_present(*pte)) + return 0; + page = pte_page(*pte); if (!page) return 0; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b2155b8adbae..494f401bbf6c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -524,8 +524,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr, unsigned long flags = qp->flags; int nid; struct page *page; + pte_t entry; - page = pte_page(huge_ptep_get(pte)); + entry = huge_ptep_get(pte); + if (!pte_present(entry)) + return 0; + page = pte_page(entry); nid = page_to_nid(page); if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) return 0; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com>]
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks [not found] ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com> @ 2014-03-06 21:16 ` Sasha Levin 2014-03-07 6:35 ` Naoya Horiguchi 0 siblings, 1 reply; 28+ messages in thread From: Sasha Levin @ 2014-03-06 21:16 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: akpm, linux-mm, linux-kernel, riel On 03/06/2014 11:08 AM, Naoya Horiguchi wrote: > And I found my patch was totally wrong because it should check > !pte_present(), not pte_present(). > I'm testing fixed one (see below), and the problem seems not to reproduce > in my environment at least for now. > But I'm not 100% sure, so I need your double checking. Nope, I still see the problem. Same NULL deref and trace as before. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-06 21:16 ` Sasha Levin @ 2014-03-07 6:35 ` Naoya Horiguchi 2014-03-15 6:45 ` Naoya Horiguchi 0 siblings, 1 reply; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-07 6:35 UTC (permalink / raw) To: sasha.levin; +Cc: akpm, linux-mm, linux-kernel, riel On Thu, Mar 06, 2014 at 04:16:29PM -0500, Sasha Levin wrote: > On 03/06/2014 11:08 AM, Naoya Horiguchi wrote: > > And I found my patch was totally wrong because it should check > > !pte_present(), not pte_present(). > > I'm testing fixed one (see below), and the problem seems not to reproduce > > in my environment at least for now. > > But I'm not 100% sure, so I need your double checking. > > Nope, I still see the problem. Same NULL deref and trace as before. Hmm, that's unfortunate. I tried to find out how this reproduces and the root cause, but no luck. So I suggest to add !PageHuge check before entering isolate_huge_page(), which certainly gets over this problem. I think "[PATCH] mm: add pte_present() check on existing hugetlb_entry" is correct itself although it didn't fix this race. Thanks, Naoya --- From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Date: Fri, 7 Mar 2014 00:59:41 -0500 Subject: [PATCH] mm/mempolicy.c: add comment in queue_pages_hugetlb() We have a race where we try to migrate an invalid page, resulting in hitting VM_BUG_ON_PAGE in isolate_huge_page(). queue_pages_hugetlb() is OK to fail, so let's check !PageHuge before queuing it with some comment as a todo reminder. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- mm/mempolicy.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 494f401bbf6c..175353eb7396 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -530,6 +530,17 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr, if (!pte_present(entry)) return 0; page = pte_page(entry); + + /* + * TODO: Trinity found that page could be a non-hugepage. This is an + * unexpected behavior, but it's not clear how this problem happens. + * So let's simply skip such corner case. Page migration can often + * fail for various reasons, so it's ok to just skip the address + * unsuitable to hugepage migration. + */ + if (!PageHeadHuge(page)) + return 0; + nid = page_to_nid(page); if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) return 0; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks 2014-03-07 6:35 ` Naoya Horiguchi @ 2014-03-15 6:45 ` Naoya Horiguchi 0 siblings, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-03-15 6:45 UTC (permalink / raw) To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel, riel On Fri, Mar 07, 2014 at 01:35:02AM -0500, Naoya Horiguchi wrote: > On Thu, Mar 06, 2014 at 04:16:29PM -0500, Sasha Levin wrote: > > On 03/06/2014 11:08 AM, Naoya Horiguchi wrote: > > > And I found my patch was totally wrong because it should check > > > !pte_present(), not pte_present(). > > > I'm testing fixed one (see below), and the problem seems not to reproduce > > > in my environment at least for now. > > > But I'm not 100% sure, so I need your double checking. > > > > Nope, I still see the problem. Same NULL deref and trace as before. > > Hmm, that's unfortunate. > I tried to find out how this reproduces and the root cause, but no luck. > So I suggest to add !PageHuge check before entering isolate_huge_page(), > which certainly gets over this problem. > > I think "[PATCH] mm: add pte_present() check on existing hugetlb_entry" > is correct itself although it didn't fix this race. Andrew, could you consider picking up this patch (below) and "[PATCH] mm: add pte_present() check on existing hugetlb_entry" (previously posted in this thread) into linux-mm? This patch is to be folded into "mempolicy: apply page table walker on queue_pages_range()," and another one is into "pagewalk: update page table walker core." Or do I need to repost them? Thanks, Naoya > Thanks, > Naoya > --- > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Date: Fri, 7 Mar 2014 00:59:41 -0500 > Subject: [PATCH] mm/mempolicy.c: add comment in queue_pages_hugetlb() > > We have a race where we try to migrate an invalid page, resulting in > hitting VM_BUG_ON_PAGE in isolate_huge_page(). > queue_pages_hugetlb() is OK to fail, so let's check !PageHuge before > queuing it with some comment as a todo reminder. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > --- > mm/mempolicy.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 494f401bbf6c..175353eb7396 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -530,6 +530,17 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long addr, > if (!pte_present(entry)) > return 0; > page = pte_page(entry); > + > + /* > + * TODO: Trinity found that page could be a non-hugepage. This is an > + * unexpected behavior, but it's not clear how this problem happens. > + * So let's simply skip such corner case. Page migration can often > + * fail for various reasons, so it's ok to just skip the address > + * unsuitable to hugepage migration. > + */ > + if (!PageHeadHuge(page)) > + return 0; > + > nid = page_to_nid(page); > if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) > return 0; > -- > 1.8.5.3 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma 2014-02-27 4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi @ 2014-02-27 4:39 ` Naoya Horiguchi 2014-02-27 21:23 ` Andrew Morton 2014-02-27 22:56 ` Kirill A. Shutemov 2 siblings, 2 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 4:39 UTC (permalink / raw) To: Sasha Levin; +Cc: Andrew Morton, linux-mm, linux-kernel vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that we happen to try to split hugetlbfs hugepage. So exclude the possibility. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- mm/mmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git next-20140220.orig/mm/mmap.c next-20140220/mm/mmap.c index f53397806d7f..45a9c0d51e3f 100644 --- next-20140220.orig/mm/mmap.c +++ next-20140220/mm/mmap.c @@ -772,7 +772,8 @@ again: remove_next = 1 + (end > next->vm_end); } } - vma_adjust_trans_huge(vma, start, end, adjust_next); + if (transparent_hugepage_enabled(vma)) + vma_adjust_trans_huge(vma, start, end, adjust_next); anon_vma = vma->anon_vma; if (!anon_vma && adjust_next) -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma 2014-02-27 4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi @ 2014-02-27 21:23 ` Andrew Morton 2014-02-27 22:08 ` Naoya Horiguchi 2014-02-27 22:56 ` Kirill A. Shutemov 1 sibling, 1 reply; 28+ messages in thread From: Andrew Morton @ 2014-02-27 21:23 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: Sasha Levin, linux-mm, linux-kernel On Wed, 26 Feb 2014 23:39:37 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that > we happen to try to split hugetlbfs hugepage. So exclude the possibility. > It would be nice to have a more complete changelog here please. Under what circumstances does this cause problems and what are the user-observable effects? > --- next-20140220.orig/mm/mmap.c > +++ next-20140220/mm/mmap.c > @@ -772,7 +772,8 @@ again: remove_next = 1 + (end > next->vm_end); > } > } > > - vma_adjust_trans_huge(vma, start, end, adjust_next); > + if (transparent_hugepage_enabled(vma)) > + vma_adjust_trans_huge(vma, start, end, adjust_next); > > anon_vma = vma->anon_vma; > if (!anon_vma && adjust_next) > -- > 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma 2014-02-27 21:23 ` Andrew Morton @ 2014-02-27 22:08 ` Naoya Horiguchi 0 siblings, 0 replies; 28+ messages in thread From: Naoya Horiguchi @ 2014-02-27 22:08 UTC (permalink / raw) To: akpm; +Cc: sasha.levin, linux-mm, linux-kernel On Thu, Feb 27, 2014 at 01:23:50PM -0800, Andrew Morton wrote: > On Wed, 26 Feb 2014 23:39:37 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that > > we happen to try to split hugetlbfs hugepage. So exclude the possibility. > > > > It would be nice to have a more complete changelog here please. Under > what circumstances does this cause problems and what are the > user-observable effects? I have no user-visible problem. This is a kind of precaution to confirm that transparent hugepage code should not be called for hugetlbfs workload. .. but I think that this patch doesn't work as I intended because transparent_hugepage_enabled() can return true for vma(VM_HUGETLB). So I separate out this patch from this series for now. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma 2014-02-27 4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi 2014-02-27 21:23 ` Andrew Morton @ 2014-02-27 22:56 ` Kirill A. Shutemov 1 sibling, 0 replies; 28+ messages in thread From: Kirill A. Shutemov @ 2014-02-27 22:56 UTC (permalink / raw) To: Naoya Horiguchi; +Cc: Sasha Levin, Andrew Morton, linux-mm, linux-kernel On Wed, Feb 26, 2014 at 11:39:37PM -0500, Naoya Horiguchi wrote: > vma_adjust() is called also for vma(VM_HUGETLB) and it could happen that > we happen to try to split hugetlbfs hugepage. So exclude the possibility. NAK. It can't happen: vma_adjust_trans_huge checks for vma->vm_ops and hugetlb VMA always have it set unlike THP. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2014-03-15 6:46 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-27 4:39 [PATCH 0/3] fixes on page table walker and hugepage rmapping Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 1/3] mm/pagewalk.c: fix end address calculation in walk_page_range() Naoya Horiguchi 2014-02-27 21:03 ` Andrew Morton 2014-02-27 21:19 ` Naoya Horiguchi 2014-02-27 21:20 ` Kirill A. Shutemov 2014-02-27 21:54 ` Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 2/3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff() Naoya Horiguchi 2014-02-27 21:19 ` Andrew Morton 2014-02-27 21:53 ` Naoya Horiguchi 2014-02-28 19:59 ` [PATCH v2] " Naoya Horiguchi [not found] ` <5310ea8b.c425e00a.2cd9.ffffe097SMTPIN_ADDED_BROKEN@mx.google.com> 2014-02-28 23:14 ` Andrew Morton 2014-03-01 3:35 ` [PATCH v3] " Naoya Horiguchi [not found] ` <1393644926-49vw3qw9@n-horiguchi@ah.jp.nec.com> 2014-03-01 23:08 ` Sasha Levin 2014-03-03 5:02 ` [PATCH] mm: add pte_present() check on existing hugetlb_entry callbacks Naoya Horiguchi 2014-03-03 20:06 ` Sasha Levin 2014-03-03 21:38 ` Sasha Levin 2014-03-04 21:32 ` Naoya Horiguchi [not found] ` <1393968743-imrxpynb@n-horiguchi@ah.jp.nec.com> 2014-03-04 22:46 ` Sasha Levin 2014-03-04 23:49 ` Naoya Horiguchi [not found] ` <1393976967-lnmm5xcs@n-horiguchi@ah.jp.nec.com> 2014-03-06 4:31 ` Sasha Levin 2014-03-06 16:08 ` Naoya Horiguchi [not found] ` <1394122113-xsq3i6vw@n-horiguchi@ah.jp.nec.com> 2014-03-06 21:16 ` Sasha Levin 2014-03-07 6:35 ` Naoya Horiguchi 2014-03-15 6:45 ` Naoya Horiguchi 2014-02-27 4:39 ` [PATCH 3/3] mm: call vma_adjust_trans_huge() only for thp-enabled vma Naoya Horiguchi 2014-02-27 21:23 ` Andrew Morton 2014-02-27 22:08 ` Naoya Horiguchi 2014-02-27 22:56 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).