linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/1] numa: fix /proc/<pid>/numa_maps for THP
@ 2016-02-05 15:33 Gerald Schaefer
  2016-02-05 15:34 ` [PATCH RFC 1/1] " Gerald Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Gerald Schaefer @ 2016-02-05 15:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, Kirill A. Shutemov,
	Konstantin Khlebnikov, Michal Hocko, Vlastimil Babka,
	Jerome Marchand, Johannes Weiner, Dave Hansen, Mel Gorman,
	Dan Williams, Ross Zwisler, Benjamin Herrenschmidt,
	Paul Mackerras, Martin Schwidefsky, Heiko Carstens,
	Michael Holzheu

In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the
layouts may differ depending on the architecture. On s390 this will lead to
inaccurate numap_maps accounting in /proc because of misguided pte_present()
and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance, but
there will be an issue with direct-access (dax) mappings w/o underlying struct
pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page()
the fake pte will be checked with pte_special() and because there is no
"special" bit in a pmd, this will always return false and the VM_PFNMAP |
VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an
invalid struct page pointer will then be returned that can crash the kernel.

This crash may be a theoretical issue so far, the RAM block device driver
seems to be safe as there should be struct pages present. Not sure about the
axonram or nvdimm (putting Maintainers on cc), but the dcssblk on s390 is safe
until there will be large page support in z/VM.

This patch fixes the numa_maps THP handling by introducing new "_pmd" variants
of the can_gather_numa_stats() and vm_normal_page() functions.

Any thoughts?

Gerald Schaefer (1):
  numa: fix /proc/<pid>/numa_maps for THP

 fs/proc/task_mmu.c | 29 ++++++++++++++++++++++++++---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 3 deletions(-)

-- 
2.3.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC 1/1] numa: fix /proc/<pid>/numa_maps for THP
  2016-02-05 15:33 [PATCH RFC 0/1] numa: fix /proc/<pid>/numa_maps for THP Gerald Schaefer
@ 2016-02-05 15:34 ` Gerald Schaefer
  2016-02-05 15:38   ` Dave Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: Gerald Schaefer @ 2016-02-05 15:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, Kirill A. Shutemov,
	Konstantin Khlebnikov, Michal Hocko, Vlastimil Babka,
	Jerome Marchand, Johannes Weiner, Dave Hansen, Mel Gorman,
	Dan Williams, Ross Zwisler, Benjamin Herrenschmidt,
	Paul Mackerras, Martin Schwidefsky, Heiko Carstens,
	Michael Holzheu

In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the
layouts may differ depending on the architecture. On s390 this will lead to
inaccurate numap_maps accounting in /proc because of misguided pte_present()
and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance, but
there will be an issue with direct-access (dax) mappings w/o underlying struct
pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page()
the fake pte will be checked with pte_special() and because there is no
"special" bit in a pmd, this will always return false and the VM_PFNMAP |
VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an
invalid struct page pointer will then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd" variants
of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: <stable@vger.kernel.org>        [4.3+]
---
 fs/proc/task_mmu.c | 29 ++++++++++++++++++++++++++---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fa95ab2..0cd0eed 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1504,6 +1504,30 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
 	return page;
 }
 
+static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
+					      struct vm_area_struct *vma,
+					      unsigned long addr)
+{
+	struct page *page;
+	int nid;
+
+	if (!pmd_present(pmd))
+		return NULL;
+
+	page = vm_normal_page_pmd(vma, addr, pmd);
+	if (!page)
+		return NULL;
+
+	if (PageReserved(page))
+		return NULL;
+
+	nid = page_to_nid(page);
+	if (!node_isset(nid, node_states[N_MEMORY]))
+		return NULL;
+
+	return page;
+}
+
 static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
@@ -1515,12 +1539,11 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
-		page = can_gather_numa_stats(huge_pte, vma, addr);
+		page = can_gather_numa_stats_pmd(*pmd, vma, addr);
 		if (page)
-			gather_stats(page, md, pte_dirty(huge_pte),
+			gather_stats(page, md, pmd_dirty(*pmd),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(ptl);
 		return 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0ad7af..7809b4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1111,6 +1111,8 @@ struct zap_details {
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		pte_t pte);
+struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
+				pmd_t pmd);
 
 int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size);
diff --git a/mm/memory.c b/mm/memory.c
index d279350..ce0a386 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -792,6 +792,44 @@ out:
 	return pfn_to_page(pfn);
 }
 
+struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
+				pmd_t pmd)
+{
+	unsigned long pfn = pmd_pfn(pmd);
+
+	/*
+	 * There is no pmd_special() but there may be special pmds, e.g.
+	 * in a direct-access (dax) mapping, so let's just replicate the
+	 * !HAVE_PTE_SPECIAL case from vm_normal_page() here.
+	 */
+	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+		if (vma->vm_flags & VM_MIXEDMAP) {
+			if (!pfn_valid(pfn))
+				return NULL;
+			goto out;
+		} else {
+			unsigned long off;
+			off = (addr - vma->vm_start) >> PAGE_SHIFT;
+			if (pfn == vma->vm_pgoff + off)
+				return NULL;
+			if (!is_cow_mapping(vma->vm_flags))
+				return NULL;
+		}
+	}
+
+	if (is_zero_pfn(pfn))
+		return NULL;
+	if (unlikely(pfn > highest_memmap_pfn))
+		return NULL;
+
+	/*
+	 * NOTE! We still have PageReserved() pages in the page tables.
+	 * eg. VDSO mappings can cause them to exist.
+	 */
+out:
+	return pfn_to_page(pfn);
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
-- 
2.3.9

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC 1/1] numa: fix /proc/<pid>/numa_maps for THP
  2016-02-05 15:34 ` [PATCH RFC 1/1] " Gerald Schaefer
@ 2016-02-05 15:38   ` Dave Hansen
  2016-02-05 16:03     ` Gerald Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2016-02-05 15:38 UTC (permalink / raw)
  To: Gerald Schaefer, Andrew Morton
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, Kirill A. Shutemov,
	Konstantin Khlebnikov, Michal Hocko, Vlastimil Babka,
	Jerome Marchand, Johannes Weiner, Mel Gorman, Dan Williams,
	Ross Zwisler, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens, Michael Holzheu

On 02/05/2016 07:34 AM, Gerald Schaefer wrote:
> +static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
> +					      struct vm_area_struct *vma,
> +					      unsigned long addr)
> +{

Is there a way to do this without making a copy of most of
can_gather_numa_stats()?  Seems like the kind of thing where the pmd
version will bitrot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC 1/1] numa: fix /proc/<pid>/numa_maps for THP
  2016-02-05 15:38   ` Dave Hansen
@ 2016-02-05 16:03     ` Gerald Schaefer
  2016-02-08 19:05       ` Gerald Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Gerald Schaefer @ 2016-02-05 16:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-mm, linux-kernel, Naoya Horiguchi,
	Kirill A. Shutemov, Konstantin Khlebnikov, Michal Hocko,
	Vlastimil Babka, Jerome Marchand, Johannes Weiner, Mel Gorman,
	Dan Williams, Ross Zwisler, Benjamin Herrenschmidt,
	Paul Mackerras, Martin Schwidefsky, Heiko Carstens,
	Michael Holzheu

On Fri, 5 Feb 2016 07:38:09 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 02/05/2016 07:34 AM, Gerald Schaefer wrote:
> > +static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
> > +					      struct vm_area_struct *vma,
> > +					      unsigned long addr)
> > +{
> 
> Is there a way to do this without making a copy of most of
> can_gather_numa_stats()?  Seems like the kind of thing where the pmd
> version will bitrot.
> 

Yes, that also gave me a little headache, even more with the vm_normal_page()
code duplication, but I didn't see a much better way. Separate _pte/_pmd
functions that largely do the same thing seem not so uncommon to me.

The best I could think of would be splitting the !HAVE_PTE_SPECIAL path
in vm_normal_page() into a separate function, but I see not much room for
improvement for can_gather_numa_stats(), other than maybe not having
a _pmd version at all and doing all the work inside gather_pte_stats(),
but that would probably just relocate the code duplication.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC 1/1] numa: fix /proc/<pid>/numa_maps for THP
  2016-02-05 16:03     ` Gerald Schaefer
@ 2016-02-08 19:05       ` Gerald Schaefer
  0 siblings, 0 replies; 5+ messages in thread
From: Gerald Schaefer @ 2016-02-08 19:05 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Dave Hansen, Andrew Morton, linux-mm, linux-kernel,
	Naoya Horiguchi, Kirill A. Shutemov, Konstantin Khlebnikov,
	Michal Hocko, Vlastimil Babka, Jerome Marchand, Johannes Weiner,
	Mel Gorman, Dan Williams, Ross Zwisler, Benjamin Herrenschmidt,
	Paul Mackerras, Martin Schwidefsky, Heiko Carstens,
	Michael Holzheu

On Fri, 5 Feb 2016 17:03:17 +0100
Gerald Schaefer <gerald.schaefer@de.ibm.com> wrote:

> On Fri, 5 Feb 2016 07:38:09 -0800
> Dave Hansen <dave.hansen@intel.com> wrote:
> 
> > On 02/05/2016 07:34 AM, Gerald Schaefer wrote:
> > > +static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
> > > +					      struct vm_area_struct *vma,
> > > +					      unsigned long addr)
> > > +{
> > 
> > Is there a way to do this without making a copy of most of
> > can_gather_numa_stats()?  Seems like the kind of thing where the pmd
> > version will bitrot.
> > 
> 
> Yes, that also gave me a little headache, even more with the vm_normal_page()
> code duplication, but I didn't see a much better way. Separate _pte/_pmd
> functions that largely do the same thing seem not so uncommon to me.
> 
> The best I could think of would be splitting the !HAVE_PTE_SPECIAL path
> in vm_normal_page() into a separate function, but I see not much room for
> improvement for can_gather_numa_stats(), other than maybe not having
> a _pmd version at all and doing all the work inside gather_pte_stats(),
> but that would probably just relocate the code duplication.

Nope, can't see any sane way to prevent the (trivial) code duplication in
can_gather_numa_stats_pmd(). Adding a "common" function for _pte and _pmd
handling (using void *) would be very ugly and given that the duplicated
code is just trivial sanity checks it also seems very disproportionate.
BTW, we also (should) have no "pmd version bitrot" in the countless other
places where we have separate _pte/_pmd versions.

So, any ideas or feedback on the vm_normal_page(_pmd) part?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-08 19:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-05 15:33 [PATCH RFC 0/1] numa: fix /proc/<pid>/numa_maps for THP Gerald Schaefer
2016-02-05 15:34 ` [PATCH RFC 1/1] " Gerald Schaefer
2016-02-05 15:38   ` Dave Hansen
2016-02-05 16:03     ` Gerald Schaefer
2016-02-08 19:05       ` Gerald Schaefer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).