* [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
@ 2014-04-08 13:09 Mel Gorman
2014-04-08 13:09 ` [PATCH 1/5] x86: Require x86-64 for automatic NUMA balancing Mel Gorman
` (5 more replies)
0 siblings, 6 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 13:09 UTC (permalink / raw)
To: Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Mel Gorman, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Peter Zijlstra, Andrea Arcangeli, Dave Hansen,
Srikar Dronamraju, Linux-MM, LKML
Using unused physical bits is something that will break eventually.
Changelog since V1
o Reuse software-bits
o Use paravirt ops when modifying PTEs in the NUMA helpers
Aliasing _PAGE_NUMA and _PAGE_PROTNONE had some convenient properties but
it ultimately gave Xen a headache and pisses almost everybody off that
looks closely at it. Two discussions on "why this makes sense" is one
discussion too many so rather than having a third so here is this series.
This series reuses the PTE bits that are available to the programmer.
This adds some contraints on how and when automatic NUMA balancing can be
enabled but it should go away again when Xen stops using _PAGE_IOMAP.
The series also converts the NUMA helpers to use paravirt-friendly operations
but it needs a Tested-by from the Xen and powerpc people.
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable.h | 5 +++
arch/x86/include/asm/pgtable_types.h | 66 ++++++++++++++++++++----------------
include/asm-generic/pgtable.h | 31 ++++++++++++-----
mm/memory.c | 12 -------
5 files changed, 66 insertions(+), 50 deletions(-)
--
1.8.4.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/5] x86: Require x86-64 for automatic NUMA balancing
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
@ 2014-04-08 13:09 ` Mel Gorman
2014-04-08 13:09 ` [PATCH 2/5] x86: Define _PAGE_NUMA by reusing software bits on the PMD and PTE levels Mel Gorman
` (4 subsequent siblings)
5 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 13:09 UTC (permalink / raw)
To: Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Mel Gorman, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Peter Zijlstra, Andrea Arcangeli, Dave Hansen,
Srikar Dronamraju, Linux-MM, LKML
32-bit support for NUMA is an oddity on its own but with automatic NUMA
balancing on top there is a reasonable risk that the CPUPID information
cannot be stored in the page flags. This patch removes support for
automatic NUMA support on 32-bit x86.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
arch/x86/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..084b1c1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,7 +26,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
- select ARCH_SUPPORTS_NUMA_BALANCING
+ select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
select HAVE_IDE
--
1.8.4.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/5] x86: Define _PAGE_NUMA by reusing software bits on the PMD and PTE levels
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
2014-04-08 13:09 ` [PATCH 1/5] x86: Require x86-64 for automatic NUMA balancing Mel Gorman
@ 2014-04-08 13:09 ` Mel Gorman
2014-04-08 13:09 ` [PATCH 3/5] mm: Allow FOLL_NUMA on FOLL_FORCE Mel Gorman
` (3 subsequent siblings)
5 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 13:09 UTC (permalink / raw)
To: Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Mel Gorman, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Peter Zijlstra, Andrea Arcangeli, Dave Hansen,
Srikar Dronamraju, Linux-MM, LKML
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults. Care is taken such that _PAGE_NUMA is used only in situations where
the VMA flags distinguish between NUMA hinting faults and prot_none faults.
Conceptually this is difficult and it has caused problems.
Fundamentally, we only need the _PAGE_NUMA bit to tell the difference between
an entry that is really unmapped and a page that is protected for NUMA
hinting faults as if the PTE is not present then a fault will be trapped.
Currently one of the software bits is used for identifying IO mappings and
by Xen to track if it's a Xen PTE or a machine PFN. This patch reuses the
software bit for IOMAP for NUMA hinting faults with the expectation that
the bit is not used for userspace addresses. Xen and NUMA balancing are
now mutually exclusive in Kconfig.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable.h | 5 ++++
arch/x86/include/asm/pgtable_types.h | 54 +++++++++++++++++-------------------
3 files changed, 31 insertions(+), 30 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 084b1c1..4fab25a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,7 +26,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
- select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
+ select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 && !XEN
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
select HAVE_IDE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..076daff 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -447,6 +447,8 @@ static inline int pte_same(pte_t a, pte_t b)
static inline int pte_present(pte_t a)
{
+ VM_BUG_ON((pte_flags(a) & (_PAGE_NUMA | _PAGE_GLOBAL)) ==
+ (_PAGE_NUMA | _PAGE_GLOBAL));
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}
@@ -471,6 +473,9 @@ static inline int pte_hidden(pte_t pte)
static inline int pmd_present(pmd_t pmd)
{
+ VM_BUG_ON((pmd_flags(pmd) & (_PAGE_NUMA | _PAGE_GLOBAL)) ==
+ (_PAGE_NUMA | _PAGE_GLOBAL));
+
/*
* Checking for _PAGE_PSE is needed too because
* split_huge_page will temporarily clear the present bit (but
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1aa9ccd..49b3e15 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -16,13 +16,17 @@
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
#define _PAGE_BIT_PAT 7 /* on 4KB pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
-#define _PAGE_BIT_UNUSED1 9 /* available for programmer */
-#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */
-#define _PAGE_BIT_HIDDEN 11 /* hidden by kmemcheck */
+#define _PAGE_BIT_SOFTW1 9 /* available for programmer */
+#define _PAGE_BIT_SOFTW2 10 /* " */
+#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
-#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1
-#define _PAGE_BIT_CPA_TEST _PAGE_BIT_UNUSED1
-#define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
+#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
+#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
+#define _PAGE_BIT_SPLITTING _PAGE_BIT_SOFTW1 /* only valid on a PSE pmd */
+#define _PAGE_BIT_IOMAP _PAGE_BIT_SOFTW2 /* flag used to indicate IO mapping */
+#define _PAGE_BIT_NUMA _PAGE_BIT_SOFTW2 /* for NUMA balancing hinting */
+#define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
+#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
/* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -40,7 +44,6 @@
#define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
-#define _PAGE_UNUSED1 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
#define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT)
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
@@ -61,8 +64,6 @@
* they do not conflict with each other.
*/
-#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_HIDDEN
-
#ifdef CONFIG_MEM_SOFT_DIRTY
#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_SOFT_DIRTY)
#else
@@ -70,6 +71,21 @@
#endif
/*
+ * _PAGE_NUMA distinguishes between a numa hinting minor fault and a page
+ * that is not present. The hinting fault gathers numa placement statistics
+ * (see pte_numa()). The bit is always zero when the PTE is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ */
+#ifdef CONFIG_NUMA_BALANCING
+#define _PAGE_NUMA (_AT(pteval_t, 1) << _PAGE_BIT_NUMA)
+#else
+#define _PAGE_NUMA (_AT(pteval_t, 0))
+#endif
+
+/*
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
* with swap entry format. On x86 bits 6 and 7 are *not* involved
@@ -94,26 +110,6 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
-/*
- * _PAGE_NUMA indicates that this page will trigger a numa hinting
- * minor page fault to gather numa placement statistics (see
- * pte_numa()). The bit picked (8) is within the range between
- * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
- * require changes to the swp entry format because that bit is always
- * zero when the pte is not present.
- *
- * The bit picked must be always zero when the pmd is present and not
- * present, so that we don't lose information when we set it while
- * atomically clearing the present bit.
- *
- * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
- * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
- * couldn't reach, like handle_mm_fault() (see access_error in
- * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
- * handle_mm_fault() to be invoked).
- */
-#define _PAGE_NUMA _PAGE_PROTNONE
-
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
--
1.8.4.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 3/5] mm: Allow FOLL_NUMA on FOLL_FORCE
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
2014-04-08 13:09 ` [PATCH 1/5] x86: Require x86-64 for automatic NUMA balancing Mel Gorman
2014-04-08 13:09 ` [PATCH 2/5] x86: Define _PAGE_NUMA by reusing software bits on the PMD and PTE levels Mel Gorman
@ 2014-04-08 13:09 ` Mel Gorman
2014-04-08 13:09 ` [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes Mel Gorman
` (2 subsequent siblings)
5 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 13:09 UTC (permalink / raw)
To: Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Mel Gorman, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Peter Zijlstra, Andrea Arcangeli, Dave Hansen,
Srikar Dronamraju, Linux-MM, LKML
As _PAGE_NUMA is no longer aliased to _PAGE_PROTNONE there should be no
confusion between them. It should be possible to kick away the special
casing in __get_user_pages.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/memory.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 22dfa61..b9c35a7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,18 +1714,6 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
- /*
- * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
- * would be called on PROT_NONE ranges. We must never invoke
- * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
- * page faults would unprotect the PROT_NONE ranges if
- * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
- * bitflag. So to avoid that, don't set FOLL_NUMA if
- * FOLL_FORCE is set.
- */
- if (!(gup_flags & FOLL_FORCE))
- gup_flags |= FOLL_NUMA;
-
i = 0;
do {
--
1.8.4.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
` (2 preceding siblings ...)
2014-04-08 13:09 ` [PATCH 3/5] mm: Allow FOLL_NUMA on FOLL_FORCE Mel Gorman
@ 2014-04-08 13:09 ` Mel Gorman
2014-04-08 17:21 ` David Vrabel
2014-04-15 10:27 ` David Vrabel
2014-04-08 13:09 ` [PATCH 5/5] x86: Allow Xen to enable NUMA_BALANCING Mel Gorman
2014-04-08 14:40 ` [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 H. Peter Anvin
5 siblings, 2 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 13:09 UTC (permalink / raw)
To: Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Mel Gorman, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Peter Zijlstra, Andrea Arcangeli, Dave Hansen,
Srikar Dronamraju, Linux-MM, LKML
David Vrabel identified a regression when using automatic NUMA balancing
under Xen whereby page table entries were getting corrupted due to the
use of native PTE operations. Quoting him
Xen PV guest page tables require that their entries use machine
addresses if the preset bit (_PAGE_PRESENT) is set, and (for
successful migration) non-present PTEs must use pseudo-physical
addresses. This is because on migration MFNs in present PTEs are
translated to PFNs (canonicalised) so they may be translated back
to the new MFN in the destination domain (uncanonicalised).
pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
set and clear the _PAGE_PRESENT bit using pte_set_flags(),
pte_clear_flags(), etc.
In a Xen PV guest, these functions must translate MFNs to PFNs
when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
_PAGE_PRESENT.
His suggested fix converted p[te|md]_[set|clear]_flags to using
paravirt-friendly ops but this is overkill. He suggested an alternative of
using p[te|md]_modify in the NUMA page table operations but this is does
more work than necessary and would require looking up a VMA for protections.
This patch modifies the NUMA page table operations to use paravirt friendly
operations to set/clear the flags of interest. Unfortunately this will take
a performance hit when updating the PTEs on CONFIG_PARAVIRT but I do not
see a way around it that does not break Xen.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/asm-generic/pgtable.h | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 34c7bdc..38a7437 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -680,24 +680,35 @@ static inline int pmd_numa(pmd_t pmd)
#ifndef pte_mknonnuma
static inline pte_t pte_mknonnuma(pte_t pte)
{
- pte = pte_clear_flags(pte, _PAGE_NUMA);
- return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+ pteval_t val = pte_val(pte);
+
+ val &= ~_PAGE_NUMA;
+ val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
+ return __pte(val);
}
#endif
#ifndef pmd_mknonnuma
static inline pmd_t pmd_mknonnuma(pmd_t pmd)
{
- pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
- return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+ pmdval_t val = pmd_val(pmd);
+
+ val &= ~_PAGE_NUMA;
+ val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
+
+ return __pmd(val);
}
#endif
#ifndef pte_mknuma
static inline pte_t pte_mknuma(pte_t pte)
{
- pte = pte_set_flags(pte, _PAGE_NUMA);
- return pte_clear_flags(pte, _PAGE_PRESENT);
+ pteval_t val = pte_val(pte);
+
+ val &= ~_PAGE_PRESENT;
+ val |= _PAGE_NUMA;
+
+ return __pte(val);
}
#endif
@@ -716,8 +727,12 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
#ifndef pmd_mknuma
static inline pmd_t pmd_mknuma(pmd_t pmd)
{
- pmd = pmd_set_flags(pmd, _PAGE_NUMA);
- return pmd_clear_flags(pmd, _PAGE_PRESENT);
+ pmdval_t val = pmd_val(pmd);
+
+ val &= ~_PAGE_PRESENT;
+ val |= _PAGE_NUMA;
+
+ return __pmd(val);
}
#endif
--
1.8.4.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 5/5] x86: Allow Xen to enable NUMA_BALANCING
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
` (3 preceding siblings ...)
2014-04-08 13:09 ` [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes Mel Gorman
@ 2014-04-08 13:09 ` Mel Gorman
2014-04-08 14:40 ` [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 H. Peter Anvin
5 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 13:09 UTC (permalink / raw)
To: Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Mel Gorman, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Peter Zijlstra, Andrea Arcangeli, Dave Hansen,
Srikar Dronamraju, Linux-MM, LKML
Xen cannot use automatic NUMA balancing as they are depending on the same PTE
bit. There is another software bit that is currently used by software dirty
tracking of pages. This patch allows Xen to use that bit for automatic NUMA
balancing if MEM_SOFT_DIRTY is not enabled. If KMEMCHECK is enabled then
the bit is only set on global page tables so there should be no collision
with NUMA_BALANCING. This shuffling can be disabled if/when Xen moves away
from using _PAGE_BIT_IOMAP.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable_types.h | 14 +++++++++++++-
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4fab25a..3c4ba81 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,7 +26,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
- select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 && !XEN
+ select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 && (!XEN || !MEM_SOFT_DIRTY)
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
select HAVE_IDE
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 49b3e15..fa84d1f 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -24,11 +24,23 @@
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
#define _PAGE_BIT_SPLITTING _PAGE_BIT_SOFTW1 /* only valid on a PSE pmd */
#define _PAGE_BIT_IOMAP _PAGE_BIT_SOFTW2 /* flag used to indicate IO mapping */
-#define _PAGE_BIT_NUMA _PAGE_BIT_SOFTW2 /* for NUMA balancing hinting */
#define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
+/*
+ * Automatic NUMA balancing uses _PAGE_BIT_SOFTW2 if available as generally it
+ * is only used on the kernel page tables and is easily shared. Unfortunately,
+ * Xen also uses this bit so on those configurations it is necessary to use
+ * _PAGE_BIT_SOFTW3 but then MEM_SOFT_DIRTY cannot be enabled at the same time
+ * as it also requires that bit. Constraint is enforced by Kconfig.
+ */
+#ifndef CONFIG_XEN
+#define _PAGE_BIT_NUMA _PAGE_BIT_SOFTW2
+#else
+#define _PAGE_BIT_NUMA _PAGE_BIT_SOFTW3
+#endif
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
--
1.8.4.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
` (4 preceding siblings ...)
2014-04-08 13:09 ` [PATCH 5/5] x86: Allow Xen to enable NUMA_BALANCING Mel Gorman
@ 2014-04-08 14:40 ` H. Peter Anvin
2014-04-08 15:22 ` Linus Torvalds
5 siblings, 1 reply; 26+ messages in thread
From: H. Peter Anvin @ 2014-04-08 14:40 UTC (permalink / raw)
To: Mel Gorman, Linux-X86
Cc: Linus Torvalds, Cyrill Gorcunov, Ingo Molnar, Steven Noonan,
Rik van Riel, David Vrabel, Andrew Morton, Peter Zijlstra,
Andrea Arcangeli, Dave Hansen, Srikar Dronamraju, Linux-MM, LKML
On 04/08/2014 06:09 AM, Mel Gorman wrote:
> Using unused physical bits is something that will break eventually.
>
> Changelog since V1
> o Reuse software-bits
> o Use paravirt ops when modifying PTEs in the NUMA helpers
>
> Aliasing _PAGE_NUMA and _PAGE_PROTNONE had some convenient properties but
> it ultimately gave Xen a headache and pisses almost everybody off that
> looks closely at it. Two discussions on "why this makes sense" is one
> discussion too many so rather than having a third so here is this series.
> This series reuses the PTE bits that are available to the programmer.
> This adds some contraints on how and when automatic NUMA balancing can be
> enabled but it should go away again when Xen stops using _PAGE_IOMAP.
>
> The series also converts the NUMA helpers to use paravirt-friendly operations
> but it needs a Tested-by from the Xen and powerpc people.
>
It is proably simpler to just base this patchset on top of David
Vrabel's which actually *does* remove _PAGE_IOMAP.
David, is your patchset going to be pushed in this merge window as expected?
That being said, these bits are precious, and if this ends up being a
case where "only Xen needs another bit" once again then Xen should
expect to get kicked to the curb at a moment's notice.
-hpa
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 14:40 ` [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 H. Peter Anvin
@ 2014-04-08 15:22 ` Linus Torvalds
2014-04-08 16:04 ` H. Peter Anvin
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2014-04-08 15:22 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Mel Gorman, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 8, 2014 at 7:40 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> David, is your patchset going to be pushed in this merge window as expected?
Apparently aiming for 3.16 right now.
> That being said, these bits are precious, and if this ends up being a
> case where "only Xen needs another bit" once again then Xen should
> expect to get kicked to the curb at a moment's notice.
Quite frankly, I don't think it's a Xen-only issue. The code was hard
to figure out even without the Xen issues. For example, nobody ever
explained to me why it
(a) could be the same as PROTNONE on x86
(b) could not be the same as PROTNONE in general
I think the best explanation for it so far was from the little voices
in my head that sang "It's a kind of Magic", and that isn't even
remotely the best song by Queen.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 15:22 ` Linus Torvalds
@ 2014-04-08 16:04 ` H. Peter Anvin
2014-04-08 16:12 ` Peter Zijlstra
2014-04-08 16:46 ` Mel Gorman
2 siblings, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2014-04-08 16:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mel Gorman, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On 04/08/2014 08:22 AM, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 7:40 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> David, is your patchset going to be pushed in this merge window as expected?
>
> Apparently aiming for 3.16 right now.
>
>> That being said, these bits are precious, and if this ends up being a
>> case where "only Xen needs another bit" once again then Xen should
>> expect to get kicked to the curb at a moment's notice.
>
> Quite frankly, I don't think it's a Xen-only issue. The code was hard
> to figure out even without the Xen issues. For example, nobody ever
> explained to me why it
>
> (a) could be the same as PROTNONE on x86
> (b) could not be the same as PROTNONE in general
>
> I think the best explanation for it so far was from the little voices
> in my head that sang "It's a kind of Magic", and that isn't even
> remotely the best song by Queen.
>
Yes, I was hoping that the timing would work out so we could evict bit
10 (which *is* a Xen-only issue) and then reuse it. I don't think the
NUMA bit is Xen-only.
-hpa
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 15:22 ` Linus Torvalds
2014-04-08 16:04 ` H. Peter Anvin
@ 2014-04-08 16:12 ` Peter Zijlstra
2014-04-08 16:46 ` Mel Gorman
2 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-04-08 16:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Mel Gorman, Linux-X86, Cyrill Gorcunov,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 08, 2014 at 08:22:15AM -0700, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 7:40 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > David, is your patchset going to be pushed in this merge window as expected?
>
> Apparently aiming for 3.16 right now.
>
> > That being said, these bits are precious, and if this ends up being a
> > case where "only Xen needs another bit" once again then Xen should
> > expect to get kicked to the curb at a moment's notice.
>
> Quite frankly, I don't think it's a Xen-only issue. The code was hard
> to figure out even without the Xen issues. For example, nobody ever
> explained to me why it
>
> (a) could be the same as PROTNONE on x86
> (b) could not be the same as PROTNONE in general
>
> I think the best explanation for it so far was from the little voices
> in my head that sang "It's a kind of Magic", and that isn't even
> remotely the best song by Queen.
Right; so initially when I started doing the numa scanning thing I
implemented b. I've never quite understood why that wasn't chosen; but
since Mel already got the PAGE_NUMA bits merged by the time I
re-surfaced, I didn't want to argue too much about it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 15:22 ` Linus Torvalds
2014-04-08 16:04 ` H. Peter Anvin
2014-04-08 16:12 ` Peter Zijlstra
@ 2014-04-08 16:46 ` Mel Gorman
2014-04-08 17:01 ` Linus Torvalds
` (2 more replies)
2 siblings, 3 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 16:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 08, 2014 at 08:22:15AM -0700, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 7:40 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > David, is your patchset going to be pushed in this merge window as expected?
>
> Apparently aiming for 3.16 right now.
>
> > That being said, these bits are precious, and if this ends up being a
> > case where "only Xen needs another bit" once again then Xen should
> > expect to get kicked to the curb at a moment's notice.
>
> Quite frankly, I don't think it's a Xen-only issue. The code was hard
> to figure out even without the Xen issues. For example, nobody ever
> explained to me why it
>
> (a) could be the same as PROTNONE on x86
> (b) could not be the same as PROTNONE in general
This series exists in response to your comment
I fundamentally think that it was a horrible horrible disaster to
make _PAGE_NUMA alias onto _PAGE_PROTNONE.
As long as _PAGE_NUMA aliases to _PAGE_PROTNONE on x86 then the core has to
play games to take that into account and the code will be "hard to figure
out even without the Xen issues". FWIW, ppc64 already uses a different
bit to identify a NUMA pte so it's already the case that _PAGE_NUMA is
not always _PAGE_PROTNONE. The series is an alternative approach but it
needs to use a different bit.
If you are ok with leaving _PAGE_NUMA as _PAGE_PROTNONE on x86 then most of
this series goes away and we're left patch 1 (as NUMA_BALANCING on 32-bit is
pointless) and "[PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting
ptes" which is an (untested on Xen) alternative to David Vrabel's patch
"x86: use pv-ops in {pte,pmd}_{set,clear}_flags()". The alternative patch
modifies the NUMA PTE helpers instead of the main set/clear helpers to
limit the performance hit when PARAVIRT is enabled.
Someone will ask why automatic NUMA balancing hints do not use "real"
PROT_NONE but as it would need VMA information to do that on all
architectures it would mean that VMA-fixups would be required when marking
PTEs for NUMA hinting faults so would be expensive.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 16:46 ` Mel Gorman
@ 2014-04-08 17:01 ` Linus Torvalds
2014-04-08 18:51 ` Mel Gorman
2014-04-08 17:03 ` Mel Gorman
2014-04-08 17:30 ` Peter Zijlstra
2 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2014-04-08 17:01 UTC (permalink / raw)
To: Mel Gorman
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 8, 2014 at 9:46 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> If you are ok with leaving _PAGE_NUMA as _PAGE_PROTNONE
NO I AM NOT!
Dammit, this feature is f*cking brain-damaged.
My complaint has been (and continues to be):
- either it is 100% the same as PROTNONE, in which case thjat
_PAGE_NUMA bit had better go away, and you just use the protnone
helpers!
- if it's not the same as PROTNONE, then it damn well needs a different bit.
You can't have it both ways. You guys tried. The Xen case shows that
trying to distinguish the two DOES NOT WORK. But even apart from the
Xen case, it was just a confusing hell.
Like Yoda said: "Either they are the same or they are not. There is no 'try'".
So pick one solution. Don't try to pick the mixed-up half-way case
that is a disaster and makes no sense.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 16:46 ` Mel Gorman
2014-04-08 17:01 ` Linus Torvalds
@ 2014-04-08 17:03 ` Mel Gorman
2014-04-08 17:30 ` Peter Zijlstra
2 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 17:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 08, 2014 at 05:46:52PM +0100, Mel Gorman wrote:
> On Tue, Apr 08, 2014 at 08:22:15AM -0700, Linus Torvalds wrote:
> > On Tue, Apr 8, 2014 at 7:40 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > >
> > > David, is your patchset going to be pushed in this merge window as expected?
> >
> > Apparently aiming for 3.16 right now.
> >
>
> > > That being said, these bits are precious, and if this ends up being a
> > > case where "only Xen needs another bit" once again then Xen should
> > > expect to get kicked to the curb at a moment's notice.
> >
> > Quite frankly, I don't think it's a Xen-only issue. The code was hard
> > to figure out even without the Xen issues. For example, nobody ever
> > explained to me why it
> >
> > (a) could be the same as PROTNONE on x86
> > (b) could not be the same as PROTNONE in general
>
> This series exists in response to your comment
>
> I fundamentally think that it was a horrible horrible disaster to
> make _PAGE_NUMA alias onto _PAGE_PROTNONE.
>
> As long as _PAGE_NUMA aliases to _PAGE_PROTNONE on x86 then the core has to
> play games to take that into account and the code will be "hard to figure
> out even without the Xen issues".
Is what you want for _PAGE_NUMA to disappear from arch/x86 and instead
use _PAGE_PROTNONE with comments explaining why and leave the core as it
is?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes
2014-04-08 13:09 ` [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes Mel Gorman
@ 2014-04-08 17:21 ` David Vrabel
2014-04-15 10:27 ` David Vrabel
1 sibling, 0 replies; 26+ messages in thread
From: David Vrabel @ 2014-04-08 17:21 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-X86, Linus Torvalds, Cyrill Gorcunov, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On 08/04/14 14:09, Mel Gorman wrote:
> David Vrabel identified a regression when using automatic NUMA balancing
> under Xen whereby page table entries were getting corrupted due to the
> use of native PTE operations. Quoting him
>
> Xen PV guest page tables require that their entries use machine
> addresses if the preset bit (_PAGE_PRESENT) is set, and (for
> successful migration) non-present PTEs must use pseudo-physical
> addresses. This is because on migration MFNs in present PTEs are
> translated to PFNs (canonicalised) so they may be translated back
> to the new MFN in the destination domain (uncanonicalised).
>
> pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
> set and clear the _PAGE_PRESENT bit using pte_set_flags(),
> pte_clear_flags(), etc.
>
> In a Xen PV guest, these functions must translate MFNs to PFNs
> when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
> _PAGE_PRESENT.
>
> His suggested fix converted p[te|md]_[set|clear]_flags to using
> paravirt-friendly ops but this is overkill. He suggested an alternative of
> using p[te|md]_modify in the NUMA page table operations but this is does
> more work than necessary and would require looking up a VMA for protections.
>
> This patch modifies the NUMA page table operations to use paravirt friendly
> operations to set/clear the flags of interest. Unfortunately this will take
> a performance hit when updating the PTEs on CONFIG_PARAVIRT but I do not
> see a way around it that does not break Xen.
Acked-by: David Vrabel <david.vrabel@citrix.com>
It passed my mprotect() PROT_NONE -> PROT_READ test case so
Tested-by: David Vrabel <david.vrabel@citrix.com>
I'll leave it up to the x86 maintainers to decide which fix to take.
This one or the more generic "x86: use pv-ops in
{pte,pmd}_{set,clear}_flags()"
David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 16:46 ` Mel Gorman
2014-04-08 17:01 ` Linus Torvalds
2014-04-08 17:03 ` Mel Gorman
@ 2014-04-08 17:30 ` Peter Zijlstra
2014-04-08 17:41 ` Linus Torvalds
` (2 more replies)
2 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-04-08 17:30 UTC (permalink / raw)
To: Mel Gorman
Cc: Linus Torvalds, H. Peter Anvin, Linux-X86, Cyrill Gorcunov,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 08, 2014 at 05:46:52PM +0100, Mel Gorman wrote:
> Someone will ask why automatic NUMA balancing hints do not use "real"
> PROT_NONE but as it would need VMA information to do that on all
> architectures it would mean that VMA-fixups would be required when marking
> PTEs for NUMA hinting faults so would be expensive.
Like this:
https://lkml.org/lkml/2012/11/13/431
That used the generic PROT_NONE infrastructure and compared, on fault,
the page protection bits against the vma->vm_page_prot bits?
So the objection to that approach was the vma-> dereference in
pte_numa() ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 17:30 ` Peter Zijlstra
@ 2014-04-08 17:41 ` Linus Torvalds
2014-04-08 18:16 ` Cyrill Gorcunov
2014-04-09 6:21 ` Ingo Molnar
2 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2014-04-08 17:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mel Gorman, H. Peter Anvin, Linux-X86, Cyrill Gorcunov,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 8, 2014 at 10:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Like this:
>
> https://lkml.org/lkml/2012/11/13/431
>
> That used the generic PROT_NONE infrastructure and compared, on fault,
> the page protection bits against the vma->vm_page_prot bits?
>
> So the objection to that approach was the vma-> dereference in
> pte_numa() ?
So the important thing is that as long as it works exactly like
PROT_NONE as far as hardware (and that includes paravirtualized setups
too!) then I guess we should be ok.
But that "pte_numa()" does make me go "Hmm.. but does it?". If virtual
environments have to look at the vma in order to look at page tables,
that's not possible. They have to be able to work with the page tables
on their own, _without_ any special rules that are private to the
guest.
So I'm not seeing any *use* of pte_numa() in places that would make me
worry, though. So maybe it works.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 17:30 ` Peter Zijlstra
2014-04-08 17:41 ` Linus Torvalds
@ 2014-04-08 18:16 ` Cyrill Gorcunov
2014-04-09 6:21 ` Ingo Molnar
2 siblings, 0 replies; 26+ messages in thread
From: Cyrill Gorcunov @ 2014-04-08 18:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mel Gorman, Linus Torvalds, H. Peter Anvin, Linux-X86,
Ingo Molnar, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML, Pavel Emelyanov
On Tue, Apr 08, 2014 at 07:30:31PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 08, 2014 at 05:46:52PM +0100, Mel Gorman wrote:
> > Someone will ask why automatic NUMA balancing hints do not use "real"
> > PROT_NONE but as it would need VMA information to do that on all
> > architectures it would mean that VMA-fixups would be required when marking
> > PTEs for NUMA hinting faults so would be expensive.
>
> Like this:
>
> https://lkml.org/lkml/2012/11/13/431
>
> That used the generic PROT_NONE infrastructure and compared, on fault,
> the page protection bits against the vma->vm_page_prot bits?
>
> So the objection to that approach was the vma-> dereference in
> pte_numa() ?
Peter, I somehow missing, with this patch would it be possible to
get rid of ugly macros in 2 level pages like we have now? (I've
dropped off softdirty support for non x86-64 now [patches are
flying around]) but still there are a few remains which make
Linus unhappy.
static __always_inline pgoff_t pte_to_pgoff(pte_t pte)
{
return (pgoff_t)
(pte_bitop(pte.pte_low, PTE_FILE_SHIFT1, PTE_FILE_MASK1, 0) +
pte_bitop(pte.pte_low, PTE_FILE_SHIFT2, PTE_FILE_MASK2, PTE_FILE_LSHIFT2) +
pte_bitop(pte.pte_low, PTE_FILE_SHIFT3, -1UL, PTE_FILE_LSHIFT3));
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 17:01 ` Linus Torvalds
@ 2014-04-08 18:51 ` Mel Gorman
2014-04-08 18:55 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 18:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 08, 2014 at 10:01:39AM -0700, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 9:46 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > If you are ok with leaving _PAGE_NUMA as _PAGE_PROTNONE
>
> NO I AM NOT!
>
> Dammit, this feature is f*cking brain-damaged.
>
> My complaint has been (and continues to be):
>
> - either it is 100% the same as PROTNONE, in which case thjat
> _PAGE_NUMA bit had better go away, and you just use the protnone
> helpers!
>
In which case we'd still use VMAs to distinguish between PROTNONE faults
and NUMA hinting faults. We may still need some special casing. It's plan
b but not my preferred solution at this time.
> - if it's not the same as PROTNONE, then it damn well needs a different bit.
>
With this series applied _PAGE_NUMA != _PAGE_PROTNONE.
> You can't have it both ways. You guys tried. The Xen case shows that
> trying to distinguish the two DOES NOT WORK. But even apart from the
> Xen case, it was just a confusing hell.
>
Which is why I responded with a series that used a different bit instead
of more discussions that would reach the same conclusion.
> Like Yoda said: "Either they are the same or they are not. There is no 'try'".
>
> So pick one solution. Don't try to pick the mixed-up half-way case
> that is a disaster and makes no sense.
>
I picked a solution. The posted series uses a different bit.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 18:51 ` Mel Gorman
@ 2014-04-08 18:55 ` Linus Torvalds
2014-04-08 19:06 ` Mel Gorman
2014-04-08 19:08 ` Rik van Riel
0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2014-04-08 18:55 UTC (permalink / raw)
To: Mel Gorman
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 8, 2014 at 11:51 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> I picked a solution. The posted series uses a different bit.
Yes, and I actually like that. I have nothing against your patch
series. I'm ranting and raving because you then seemed to say "maybe
we shouldn't pick a solution after all" when you said:
> > If you are ok with leaving _PAGE_NUMA as _PAGE_PROTNONE
which was what I reacted to.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 18:55 ` Linus Torvalds
@ 2014-04-08 19:06 ` Mel Gorman
2014-04-08 19:08 ` Rik van Riel
1 sibling, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-08 19:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 08, 2014 at 11:55:22AM -0700, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 11:51 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > I picked a solution. The posted series uses a different bit.
>
> Yes, and I actually like that. I have nothing against your patch
> series. I'm ranting and raving because you then seemed to say "maybe
> we shouldn't pick a solution after all" when you said:
>
> > > If you are ok with leaving _PAGE_NUMA as _PAGE_PROTNONE
>
> which was what I reacted to.
>
Ok, my bad. To be absolutly clear I want to move away from aliasing the
_PAGE_PROTNONE bit. As David reports the series works for him, I'll wait
a bit to see if there are objections or an alternative patch series from
another direction. If not, I'll remove the RFC and repost it through the
x86 maintainers.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 18:55 ` Linus Torvalds
2014-04-08 19:06 ` Mel Gorman
@ 2014-04-08 19:08 ` Rik van Riel
1 sibling, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2014-04-08 19:08 UTC (permalink / raw)
To: Linus Torvalds, Mel Gorman
Cc: H. Peter Anvin, Linux-X86, Cyrill Gorcunov, Ingo Molnar,
Steven Noonan, David Vrabel, Andrew Morton, Peter Zijlstra,
Andrea Arcangeli, Dave Hansen, Srikar Dronamraju, Linux-MM, LKML
On 04/08/2014 02:55 PM, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 11:51 AM, Mel Gorman <mgorman@suse.de> wrote:
>>
>> I picked a solution. The posted series uses a different bit.
>
> Yes, and I actually like that. I have nothing against your patch
> series. I'm ranting and raving because you then seemed to say "maybe
> we shouldn't pick a solution after all" when you said:
FWIW, Mel's patches look good to me.
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-08 17:30 ` Peter Zijlstra
2014-04-08 17:41 ` Linus Torvalds
2014-04-08 18:16 ` Cyrill Gorcunov
@ 2014-04-09 6:21 ` Ingo Molnar
2014-04-09 23:34 ` H. Peter Anvin
2 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2014-04-09 6:21 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mel Gorman, Linus Torvalds, H. Peter Anvin, Linux-X86,
Cyrill Gorcunov, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Apr 08, 2014 at 05:46:52PM +0100, Mel Gorman wrote:
> > Someone will ask why automatic NUMA balancing hints do not use "real"
> > PROT_NONE but as it would need VMA information to do that on all
> > architectures it would mean that VMA-fixups would be required when marking
> > PTEs for NUMA hinting faults so would be expensive.
>
> Like this:
>
> https://lkml.org/lkml/2012/11/13/431
>
> That used the generic PROT_NONE infrastructure and compared, on fault,
> the page protection bits against the vma->vm_page_prot bits?
>
> So the objection to that approach was the vma-> dereference in
> pte_numa() ?
I think the real underlying objection was that PTE_NUMA was the last
leftover from AutoNUMA, and removing it would have made it not a
'compromise' patch set between 'AutoNUMA' and 'sched/numa', but would
have made the sched/numa approach 'win' by and large.
The whole 'losing face' annoyance that plagues all of us (me
included).
I didn't feel it was important to the general logic of adding access
pattern aware NUMA placement logic to the scheduler, and I obviously
could not ignore the NAKs from various mm folks insisting on PTE_NUMA,
so I conceded that point and Mel built on that approach as well.
Nice it's being cleaned up, and I'm pretty happy about how NUMA
balancing ended up looking like.
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-09 6:21 ` Ingo Molnar
@ 2014-04-09 23:34 ` H. Peter Anvin
2014-04-10 0:12 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: H. Peter Anvin @ 2014-04-09 23:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Mel Gorman, Linus Torvalds, Linux-X86, Cyrill Gorcunov,
Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
Andrea Arcangeli, Dave Hansen, Srikar Dronamraju, Linux-MM, LKML
On 04/08/2014 11:21 PM, Ingo Molnar wrote:
>
> I think the real underlying objection was that PTE_NUMA was the last
> leftover from AutoNUMA, and removing it would have made it not a
> 'compromise' patch set between 'AutoNUMA' and 'sched/numa', but would
> have made the sched/numa approach 'win' by and large.
>
> The whole 'losing face' annoyance that plagues all of us (me
> included).
>
> I didn't feel it was important to the general logic of adding access
> pattern aware NUMA placement logic to the scheduler, and I obviously
> could not ignore the NAKs from various mm folks insisting on PTE_NUMA,
> so I conceded that point and Mel built on that approach as well.
>
> Nice it's being cleaned up, and I'm pretty happy about how NUMA
> balancing ended up looking like.
>
How painful would it be to get rid of _PAGE_NUMA entirely? Page bits
are a highly precious commodity and saving one would be valuable.
-hpa
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2
2014-04-09 23:34 ` H. Peter Anvin
@ 2014-04-10 0:12 ` Linus Torvalds
0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2014-04-10 0:12 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Linux-X86,
Cyrill Gorcunov, Steven Noonan, Rik van Riel, David Vrabel,
Andrew Morton, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Wed, Apr 9, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> How painful would it be to get rid of _PAGE_NUMA entirely? Page bits
> are a highly precious commodity and saving one would be valuable.
I don't think _PAGE_NUMA is a problem. It's only set when the page is
not present, so we have tons of bits then.
Now, that's still inconvenient for the 32-bit pte case, because we do
*not* have tons of bits for non-present cases since we need them for
the swap indexes.
This is different from _PAGE_SOFT_DIRTY, which we do need for both
present and swapped-out entries.
Or am I missing something?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes
2014-04-08 13:09 ` [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes Mel Gorman
2014-04-08 17:21 ` David Vrabel
@ 2014-04-15 10:27 ` David Vrabel
2014-04-15 14:44 ` Mel Gorman
1 sibling, 1 reply; 26+ messages in thread
From: David Vrabel @ 2014-04-15 10:27 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-X86, Linus Torvalds, Cyrill Gorcunov, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On 08/04/14 14:09, Mel Gorman wrote:
> David Vrabel identified a regression when using automatic NUMA balancing
> under Xen whereby page table entries were getting corrupted due to the
> use of native PTE operations. Quoting him
>
> Xen PV guest page tables require that their entries use machine
> addresses if the preset bit (_PAGE_PRESENT) is set, and (for
> successful migration) non-present PTEs must use pseudo-physical
> addresses. This is because on migration MFNs in present PTEs are
> translated to PFNs (canonicalised) so they may be translated back
> to the new MFN in the destination domain (uncanonicalised).
>
> pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
> set and clear the _PAGE_PRESENT bit using pte_set_flags(),
> pte_clear_flags(), etc.
>
> In a Xen PV guest, these functions must translate MFNs to PFNs
> when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
> _PAGE_PRESENT.
>
> His suggested fix converted p[te|md]_[set|clear]_flags to using
> paravirt-friendly ops but this is overkill. He suggested an alternative of
> using p[te|md]_modify in the NUMA page table operations but this is does
> more work than necessary and would require looking up a VMA for protections.
>
> This patch modifies the NUMA page table operations to use paravirt friendly
> operations to set/clear the flags of interest. Unfortunately this will take
> a performance hit when updating the PTEs on CONFIG_PARAVIRT but I do not
> see a way around it that does not break Xen.
We're getting more reports of users hitting this regression with distro
provided kernels. Irrespective of the rest of this series, can we get
at least this applied and tagged for stable, please?
http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg01905.html
David
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/asm-generic/pgtable.h | 31 +++++++++++++++++++++++--------
> 1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 34c7bdc..38a7437 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -680,24 +680,35 @@ static inline int pmd_numa(pmd_t pmd)
> #ifndef pte_mknonnuma
> static inline pte_t pte_mknonnuma(pte_t pte)
> {
> - pte = pte_clear_flags(pte, _PAGE_NUMA);
> - return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> + pteval_t val = pte_val(pte);
> +
> + val &= ~_PAGE_NUMA;
> + val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
> + return __pte(val);
> }
> #endif
>
> #ifndef pmd_mknonnuma
> static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> {
> - pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
> - return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> + pmdval_t val = pmd_val(pmd);
> +
> + val &= ~_PAGE_NUMA;
> + val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
> +
> + return __pmd(val);
> }
> #endif
>
> #ifndef pte_mknuma
> static inline pte_t pte_mknuma(pte_t pte)
> {
> - pte = pte_set_flags(pte, _PAGE_NUMA);
> - return pte_clear_flags(pte, _PAGE_PRESENT);
> + pteval_t val = pte_val(pte);
> +
> + val &= ~_PAGE_PRESENT;
> + val |= _PAGE_NUMA;
> +
> + return __pte(val);
> }
> #endif
>
> @@ -716,8 +727,12 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
> #ifndef pmd_mknuma
> static inline pmd_t pmd_mknuma(pmd_t pmd)
> {
> - pmd = pmd_set_flags(pmd, _PAGE_NUMA);
> - return pmd_clear_flags(pmd, _PAGE_PRESENT);
> + pmdval_t val = pmd_val(pmd);
> +
> + val &= ~_PAGE_PRESENT;
> + val |= _PAGE_NUMA;
> +
> + return __pmd(val);
> }
> #endif
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes
2014-04-15 10:27 ` David Vrabel
@ 2014-04-15 14:44 ` Mel Gorman
0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2014-04-15 14:44 UTC (permalink / raw)
To: David Vrabel
Cc: Linux-X86, Linus Torvalds, Cyrill Gorcunov, Peter Anvin,
Ingo Molnar, Steven Noonan, Rik van Riel, Andrew Morton,
Peter Zijlstra, Andrea Arcangeli, Dave Hansen, Srikar Dronamraju,
Linux-MM, LKML
On Tue, Apr 15, 2014 at 11:27:56AM +0100, David Vrabel wrote:
> On 08/04/14 14:09, Mel Gorman wrote:
> > David Vrabel identified a regression when using automatic NUMA balancing
> > under Xen whereby page table entries were getting corrupted due to the
> > use of native PTE operations. Quoting him
> >
> > Xen PV guest page tables require that their entries use machine
> > addresses if the preset bit (_PAGE_PRESENT) is set, and (for
> > successful migration) non-present PTEs must use pseudo-physical
> > addresses. This is because on migration MFNs in present PTEs are
> > translated to PFNs (canonicalised) so they may be translated back
> > to the new MFN in the destination domain (uncanonicalised).
> >
> > pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
> > set and clear the _PAGE_PRESENT bit using pte_set_flags(),
> > pte_clear_flags(), etc.
> >
> > In a Xen PV guest, these functions must translate MFNs to PFNs
> > when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
> > _PAGE_PRESENT.
> >
> > His suggested fix converted p[te|md]_[set|clear]_flags to using
> > paravirt-friendly ops but this is overkill. He suggested an alternative of
> > using p[te|md]_modify in the NUMA page table operations but this is does
> > more work than necessary and would require looking up a VMA for protections.
> >
> > This patch modifies the NUMA page table operations to use paravirt friendly
> > operations to set/clear the flags of interest. Unfortunately this will take
> > a performance hit when updating the PTEs on CONFIG_PARAVIRT but I do not
> > see a way around it that does not break Xen.
>
> We're getting more reports of users hitting this regression with distro
> provided kernels. Irrespective of the rest of this series, can we get
> at least this applied and tagged for stable, please?
>
> http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg01905.html
>
The resending of the series got delayed until today. Fengguang Wu hit
problems testing the series and I ran into a number of similarly shaped
problems that took time to resolve. I sent out a v4 of the series with this
patch at the front and a note on the leader saying it should be picked up
for stable regardless of what happens with the patches 2 and 3.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2014-04-15 14:44 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-08 13:09 [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 Mel Gorman
2014-04-08 13:09 ` [PATCH 1/5] x86: Require x86-64 for automatic NUMA balancing Mel Gorman
2014-04-08 13:09 ` [PATCH 2/5] x86: Define _PAGE_NUMA by reusing software bits on the PMD and PTE levels Mel Gorman
2014-04-08 13:09 ` [PATCH 3/5] mm: Allow FOLL_NUMA on FOLL_FORCE Mel Gorman
2014-04-08 13:09 ` [PATCH 4/5] mm: use paravirt friendly ops for NUMA hinting ptes Mel Gorman
2014-04-08 17:21 ` David Vrabel
2014-04-15 10:27 ` David Vrabel
2014-04-15 14:44 ` Mel Gorman
2014-04-08 13:09 ` [PATCH 5/5] x86: Allow Xen to enable NUMA_BALANCING Mel Gorman
2014-04-08 14:40 ` [RFC PATCH 0/5] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA v2 H. Peter Anvin
2014-04-08 15:22 ` Linus Torvalds
2014-04-08 16:04 ` H. Peter Anvin
2014-04-08 16:12 ` Peter Zijlstra
2014-04-08 16:46 ` Mel Gorman
2014-04-08 17:01 ` Linus Torvalds
2014-04-08 18:51 ` Mel Gorman
2014-04-08 18:55 ` Linus Torvalds
2014-04-08 19:06 ` Mel Gorman
2014-04-08 19:08 ` Rik van Riel
2014-04-08 17:03 ` Mel Gorman
2014-04-08 17:30 ` Peter Zijlstra
2014-04-08 17:41 ` Linus Torvalds
2014-04-08 18:16 ` Cyrill Gorcunov
2014-04-09 6:21 ` Ingo Molnar
2014-04-09 23:34 ` H. Peter Anvin
2014-04-10 0:12 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).