[PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4
@ 2013-12-10 15:51 Mel Gorman
  2013-12-10 15:51 ` [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Changelog since V3
o Dropped a tracing patch
o Rebased to 3.13-rc3
o Removed unnecessary ptl acquisition

Alex Thorlton reported segementation faults when NUMA balancing is enabled
on large machines. There is no obvious explanation from the console what the
problem but similar problems have been observed by Rik van Riel and myself
if migration was aggressive enough. Alex, this series is against 3.13-rc2,
a verification that the fix addresses your problem would be appreciated.

This series starts with a range of patches aimed at addressing the
segmentation fault problem while offsetting some of the cost to avoid badly
regressing performance in -stable. Those that are cc'd to stable (patches
1-12) should be merged ASAP. The rest of the series is relatively minor
stuff that fell out during the course of development that is ok to wait
for the next merge window but should help with the continued development
of NUMA balancing.

 arch/sparc/include/asm/pgtable_64.h |   4 +-
 arch/x86/include/asm/pgtable.h      |  11 +++-
 arch/x86/mm/gup.c                   |  13 +++++
 include/asm-generic/pgtable.h       |   2 +-
 include/linux/migrate.h             |   9 ++++
 include/linux/mm_types.h            |  44 +++++++++++++++
 include/linux/mmzone.h              |   5 +-
 include/trace/events/migrate.h      |  26 +++++++++
 include/trace/events/sched.h        |  87 ++++++++++++++++++++++++++++++
 kernel/fork.c                       |   1 +
 kernel/sched/core.c                 |   2 +
 kernel/sched/fair.c                 |  24 +++++----
 mm/huge_memory.c                    |  45 ++++++++++++----
 mm/mempolicy.c                      |   6 +--
 mm/migrate.c                        | 103 ++++++++++++++++++++++++++++--------
 mm/mprotect.c                       |  15 ++++--
 mm/pgtable-generic.c                |   8 ++-
 17 files changed, 347 insertions(+), 58 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 02/18] mm: numa: Call MMU notifiers on " Mel Gorman
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Base pages are unmapped and flushed from cache and TLB during normal page
migration and replaced with a migration entry that causes any parallel or
gup to block until migration completes. THP does not unmap pages due to
a lack of support for migration entries at a PMD level. This allows races
with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
Close races between THP migration and PMD numa clearing") made worse by
introducing a pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against THP
migration and properly account for the NUMA hinting fault. On the migration
side the page table lock is taken for each PTE update.

Cc: stable@vger.kernel.org
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/gup.c | 13 +++++++++++++
 mm/huge_memory.c  | 24 ++++++++++++++++--------
 mm/migrate.c      | 38 +++++++++++++++++++++++++++++++-------
 3 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..0596e8e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -83,6 +83,12 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		pte_t pte = gup_get_pte(ptep);
 		struct page *page;
 
+		/* Similar to the PMD case, NUMA hinting must take slow path */
+		if (pte_numa(pte)) {
+			pte_unmap(ptep);
+			return 0;
+		}
+
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
@@ -167,6 +173,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
 		} else {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bccd5a6..deae592 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1243,6 +1243,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto out;
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!PageHead(page));
 	if (flags & FOLL_TOUCH) {
@@ -1323,23 +1327,27 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
 			goto clear_pmdnuma;
+	}
 
-		/*
-		 * Otherwise wait for potential migrations and retry. We do
-		 * relock and check_same as the page may no longer be mapped.
-		 * As the fault is being retried, do not account for it.
-		 */
+	/*
+	 * If there are potential migrations, wait for completion and retry. We
+	 * do not relock and check_same as the page may no longer be mapped.
+	 * Furtermore, even if the page is currently misplaced, there is no
+	 * guarantee it is still misplaced after the migration completes.
+	 */
+	if (!page_locked) {
 		spin_unlock(ptl);
 		wait_on_page_locked(page);
 		page_nid = -1;
 		goto out;
 	}
 
-	/* Page is misplaced, serialise migrations and parallel THP splits */
+	/*
+	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
+	 * to serialises splits
+	 */
 	get_page(page);
 	spin_unlock(ptl);
-	if (!page_locked)
-		lock_page(page);
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
diff --git a/mm/migrate.c b/mm/migrate.c
index bb94004..2cabbd5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1722,6 +1722,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	pmd_t orig_entry;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1756,7 +1757,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Recheck the target PMD */
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry))) {
+	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
+fail_putback:
 		spin_unlock(ptl);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1786,16 +1788,34 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 */
 	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
+	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
-	entry = pmd_mknonnuma(entry);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
+	/*
+	 * Clear the old entry under pagetable lock and establish the new PTE.
+	 * Any parallel GUP will either observe the old page blocking on the
+	 * page lock, block on the page table lock or observe the new page.
+	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
+	 * guarantee the copy is visible before the pagetable update.
+	 */
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
-	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
+
+	if (page_count(page) != 2) {
+		set_pmd_at(mm, haddr, pmd, orig_entry);
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		update_mmu_cache_pmd(vma, address, &entry);
+		page_remove_rmap(new_page);
+		goto fail_putback;
+	}
+
 	page_remove_rmap(page);
+
 	/*
 	 * Finish the charge transaction under the page table lock to
 	 * prevent split_huge_page() from dividing up the charge
@@ -1820,9 +1840,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
-	entry = pmd_mknonnuma(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
+	ptl = pmd_lock(mm, pmd);
+	if (pmd_same(*pmd, entry)) {
+		entry = pmd_mknonnuma(entry);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache_pmd(vma, address, &entry);
+	}
+	spin_unlock(ptl);
 
 	unlock_page(page);
 	put_page(page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
  2013-12-10 15:51 ` [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 03/18] mm: Clear pmd_numa before invalidating Mel Gorman
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

MMU notifiers must be called on THP page migration or secondary MMUs will
get very confused.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/migrate.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2cabbd5..be787d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
 #include <linux/balloon_compaction.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -1716,12 +1717,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 				struct page *page, int node)
 {
 	spinlock_t *ptl;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	unsigned long mmun_start = address & HPAGE_PMD_MASK;
+	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 	pmd_t orig_entry;
 
 	/*
@@ -1756,10 +1758,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1800,15 +1804,16 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-	page_add_new_anon_rmap(new_page, vma, haddr);
-	pmdp_clear_flush(vma, haddr, pmd);
-	set_pmd_at(mm, haddr, pmd, entry);
+	flush_cache_range(vma, mmun_start, mmun_end);
+	page_add_new_anon_rmap(new_page, vma, mmun_start);
+	pmdp_clear_flush(vma, mmun_start, pmd);
+	set_pmd_at(mm, mmun_start, pmd, entry);
+	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, haddr, pmd, orig_entry);
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		set_pmd_at(mm, mmun_start, pmd, orig_entry);
+		flush_tlb_range(vma, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1823,6 +1828,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	unlock_page(new_page);
 	unlock_page(page);
@@ -1843,7 +1849,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_mknonnuma(entry);
-		set_pmd_at(mm, haddr, pmd, entry);
+		set_pmd_at(mm, mmun_start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/18] mm: Clear pmd_numa before invalidating
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
  2013-12-10 15:51 ` [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
  2013-12-10 15:51 ` [PATCH 02/18] mm: numa: Call MMU notifiers on " Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

pmdp_invalidate clears the present bit without taking into account that it
might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear
pmd_numa before invalidating.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/pgtable-generic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index cbb3854..e84cad2 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -191,6 +191,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
+	pmd_t entry = *pmdp;
+	if (pmd_numa(entry))
+		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (2 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 03/18] mm: Clear pmd_numa before invalidating Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

If the PMD is flushed then a parallel fault in handle_mm_fault() will enter
the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt
to insert a huge zero page. This is wasteful so the patch avoids clearing
the PMD when setting pmd_numa.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index deae592..5a5da50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1529,7 +1529,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			 */
 			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
-				entry = pmdp_get_and_clear(mm, addr, pmd);
+				entry = *pmd;
 				entry = pmd_mknuma(entry);
 				ret = HPAGE_PMD_NR;
 			}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (3 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-16 23:15   ` [PATCH 19/18] mm,numa: write pte_numa pte back to the page tables Rik van Riel
  2013-12-10 15:51 ` [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

The TLB must be flushed if the PTE is updated but change_pte_range is clearing
the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
reinserts the same entry. Without the flush, it's conceivable that two processors
have different TLBs for the same virtual address and at the very least it would
generate spurious faults. This patch only unmaps the pages in change_pte_range for
a full protection change.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/mprotect.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2666797..0a07e2d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -52,13 +52,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pte_t ptent;
 			bool updated = false;
 
-			ptent = ptep_modify_prot_start(mm, addr, pte);
 			if (!prot_numa) {
+				ptent = ptep_modify_prot_start(mm, addr, pte);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
 				struct page *page;
 
+				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
 					if (!pte_numa(oldpte)) {
@@ -79,7 +80,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (updated)
 				pages++;
-			ptep_modify_prot_commit(mm, addr, pte, ptent);
+
+			/* Only !prot_numa always clears the pte */
+			if (!prot_numa)
+				ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (4 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

The anon_vma lock prevents parallel THP splits and any associated complexity
that arises when handling splits during THP migration. This patch checks
if the lock was successfully acquired and bails from THP migration if it
failed for any reason.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a5da50..0f00b96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1359,6 +1359,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
+	/* Bail if we fail to protect against THP splits for any reason */
+	if (unlikely(!anon_vma)) {
+		put_page(page);
+		page_nid = -1;
+		goto clear_pmdnuma;
+	}
+
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (5 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 08/18] sched: numa: Skip inaccessible VMAs Mel Gorman
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be787d5..a987525 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1780,7 +1780,8 @@ fail_putback:
 		putback_lru_page(page);
 		mod_zone_page_state(page_zone(page),
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-		goto out_fail;
+
+		goto out_unlock;
 	}
 
 	/*
@@ -1854,6 +1855,7 @@ out_dropref:
 	}
 	spin_unlock(ptl);
 
+out_unlock:
 	unlock_page(page);
 	put_page(page);
 	return 0;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/18] sched: numa: Skip inaccessible VMAs
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (6 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect Mel Gorman
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd773ad..18bf84e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1752,6 +1752,13 @@ void task_numa_work(struct callback_head *work)
 		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
 			continue;
 
+		/*
+		 * Skip inaccessible VMAs to avoid any confusion between
+		 * PROT_NONE and NUMA hinting ptes
+		 */
+		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (7 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 08/18] sched: numa: Skip inaccessible VMAs Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

On a protection change it is no longer clear if the page should be still
accessible.  This patch clears the NUMA hinting fault bits on a protection
change.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c | 2 ++
 mm/mprotect.c    | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0f00b96..0ecaba2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1522,6 +1522,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		ret = 1;
 		if (!prot_numa) {
 			entry = pmdp_get_and_clear(mm, addr, pmd);
+			if (pmd_numa(entry))
+				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0a07e2d..eb2f349 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -54,6 +54,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (!prot_numa) {
 				ptent = ptep_modify_prot_start(mm, addr, pte);
+				if (pte_numa(ptent))
+					ptent = pte_mknonnuma(ptent);
 				ptent = pte_modify(ptent, newprot);
 				updated = true;
 			} else {
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (8 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-17 22:53   ` Sasha Levin
  2013-12-10 15:51 ` [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Mel Gorman
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

do_huge_pmd_numa_page() handles the case where there is parallel THP
migration.  However, by the time it is checked the NUMA hinting information
has already been disrupted. This patch adds an earlier check with some helpers.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h |  9 +++++++++
 mm/huge_memory.c        | 22 ++++++++++++++++------
 mm/migrate.c            | 12 ++++++++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f5096b5..b7717d7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,10 +90,19 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
+extern bool pmd_trans_migrating(pmd_t pmd);
+extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
 				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
+static inline bool pmd_trans_migrating(pmd_t pmd)
+{
+	return false;
+}
+static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+}
 static inline int migrate_misplaced_page(struct page *page,
 					 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ecaba2..e3b6a75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -882,6 +882,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		ret = 0;
 		goto out_unlock;
 	}
+
+	/* mmap_sem prevents this happening but warn if that changes */
+	WARN_ON(pmd_trans_migrating(pmd));
+
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
 		spin_unlock(src_ptl);
@@ -1299,6 +1303,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
+	/*
+	 * If there are potential migrations, wait for completion and retry
+	 * without disrupting NUMA hinting information. Do not relock and
+	 * check_same as the page may no longer be mapped.
+	 */
+	if (unlikely(pmd_trans_migrating(*pmdp))) {
+		spin_unlock(ptl);
+		wait_migrate_huge_page(vma->anon_vma, pmdp);
+		goto out;
+	}
+
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
@@ -1329,12 +1344,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto clear_pmdnuma;
 	}
 
-	/*
-	 * If there are potential migrations, wait for completion and retry. We
-	 * do not relock and check_same as the page may no longer be mapped.
-	 * Furtermore, even if the page is currently misplaced, there is no
-	 * guarantee it is still misplaced after the migration completes.
-	 */
+	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
 		spin_unlock(ptl);
 		wait_on_page_locked(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index a987525..cfb4190 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1655,6 +1655,18 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	return 1;
 }
 
+bool pmd_trans_migrating(pmd_t pmd)
+{
+	struct page *page = pmd_page(pmd);
+	return PageLocked(page);
+}
+
+void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
+{
+	struct page *page = pmd_page(*pmd);
+	wait_on_page_locked(page);
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (9 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-11 19:12   ` [PATCH] mm: fix TLB flush race between migration, and change_protection_range -fix Mel Gorman
  2013-12-10 15:51 ` [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Mel Gorman
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the process.

This only affects x86, which has a somewhat optimistic pte_accessible. All
other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions (SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/sparc/include/asm/pgtable_64.h |  4 ++--
 arch/x86/include/asm/pgtable.h      | 11 ++++++++--
 include/asm-generic/pgtable.h       |  2 +-
 include/linux/mm_types.h            | 44 +++++++++++++++++++++++++++++++++++++
 kernel/fork.c                       |  1 +
 mm/huge_memory.c                    |  7 ++++++
 mm/mprotect.c                       |  2 ++
 mm/pgtable-generic.c                |  5 +++--
 8 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 8358dc1..0f9e945 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -619,7 +619,7 @@ static inline unsigned long pte_present(pte_t pte)
 }
 
 #define pte_accessible pte_accessible
-static inline unsigned long pte_accessible(pte_t a)
+static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a)
 {
 	return pte_val(a) & _PAGE_VALID;
 }
@@ -847,7 +847,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
 	 *             and SUN4V pte layout, so this inline test is fine.
 	 */
-	if (likely(mm != &init_mm) && pte_accessible(orig))
+	if (likely(mm != &init_mm) && pte_accessible(mm, orig))
 		tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3d19994..48cab4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -452,9 +452,16 @@ static inline int pte_present(pte_t a)
 }
 
 #define pte_accessible pte_accessible
-static inline int pte_accessible(pte_t a)
+static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
-	return pte_flags(a) & _PAGE_PRESENT;
+	if (pte_flags(a) & _PAGE_PRESENT)
+		return true;
+
+	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+			tlb_flush_pending(mm))
+		return true;
+
+	return false;
 }
 
 static inline int pte_hidden(pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f330d28..b12079a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -217,7 +217,7 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif
 
 #ifndef pte_accessible
-# define pte_accessible(pte)		((void)(pte),1)
+# define pte_accessible(mm, pte)	((void)(pte), 1)
 #endif
 
 #ifndef flush_tlb_fix_spurious_fault
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bd29941..c122bb1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -443,6 +443,14 @@ struct mm_struct {
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+	/*
+	 * An operation with batched TLB flushing is going on. Anything that
+	 * can move process memory needs to flush the TLB when moving a
+	 * PROT_NONE or PROT_NUMA mapped page.
+	 */
+	bool tlb_flush_pending;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
@@ -459,4 +467,40 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return mm->cpu_vm_mask_var;
 }
 
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
+/*
+ * Memory barriers to keep this state in sync are graciously provided by
+ * the page table locks, outside of which no page table modifications happen.
+ * The barriers below prevent the compiler from re-ordering the instructions
+ * around the memory barriers that are already present in the code.
+ */
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	return mm->tlb_flush_pending;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+	mm->tlb_flush_pending = true;
+	barrier();
+}
+/* Clearing is done after a TLB flush, which also provides a barrier. */
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+	barrier();
+	mm->tlb_flush_pending = false;
+}
+#else
+static inline bool tlb_flush_pending(struct mm_struct *mm)
+{
+	return false;
+}
+static inline void set_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+#endif
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 728d5be..5721f0e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -537,6 +537,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	spin_lock_init(&mm->page_table_lock);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
+	clear_tlb_flush_pending(mm);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3b6a75..e3a5ee2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,6 +1377,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
+	 * The page_table_lock above provides a memory barrier
+	 * with change_protection_range.
+	 */
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index eb2f349..9b1be30 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -187,6 +187,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	set_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
@@ -198,6 +199,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	/* Only flush the TLB if we actually modified any entries: */
 	if (pages)
 		flush_tlb_range(vma, start, end);
+	clear_tlb_flush_pending(mm);
 
 	return pages;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e84cad2..a8b9199 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -110,9 +110,10 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pte_t *ptep)
 {
+	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
-	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	if (pte_accessible(pte))
+	pte = ptep_get_and_clear(mm, address, ptep);
+	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	return pte;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (10 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 16:56   ` Rik van Riel
  2013-12-10 15:51 ` [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static Mel Gorman
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

THP migration can fail for a variety of reasons. Avoid flushing the TLB
to deal with THP migration races until the copy is ready to start.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 7 -------
 mm/migrate.c     | 3 +++
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3a5ee2..e3b6a75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,13 +1377,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * The page_table_lock above provides a memory barrier
-	 * with change_protection_range.
-	 */
-	if (tlb_flush_pending(mm))
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
-	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and pmd_numa cleared.
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index cfb4190..0c4fbf6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1759,6 +1759,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_fail;
 	}
 
+	if (tlb_flush_pending(mm))
+		flush_tlb_range(vma, mmun_start, mmun_end);
+
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
 	SetPageSwapBacked(new_page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (11 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting Mel Gorman
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

numamigrate_update_ratelimit and numamigrate_isolate_page only have callers
in mm/migrate.c. This patch makes them static.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0c4fbf6..b6eef65 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1593,7 +1593,8 @@ bool migrate_ratelimited(int node)
 }
 
 /* Returns true if the node is migrate rate-limited after the update */
-bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
+static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
+					unsigned long nr_pages)
 {
 	bool rate_limited = false;
 
@@ -1617,7 +1618,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
 	return rate_limited;
 }
 
-int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
+static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (12 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 15/18] mm: numa: Trace tasks that fail migration due to " Mel Gorman
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

NUMA migrate rate limiting protects a migration counter and window using
a lock but in some cases this can be a contended lock. It is not
critical that the number of pages be perfect, lost updates are
acceptable. Reduce the importance of this lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h |  5 +----
 mm/migrate.c           | 21 ++++++++++++---------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e4..b835d3f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,10 +758,7 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * Lock serializing the per destination node AutoNUMA memory
-	 * migration rate limiting data.
-	 */
+	/* Lock serializing the migrate rate limiting window */
 	spinlock_t numabalancing_migrate_lock;
 
 	/* Rate limiting time interval */
diff --git a/mm/migrate.c b/mm/migrate.c
index b6eef65..564d5c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1596,26 +1596,29 @@ bool migrate_ratelimited(int node)
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 					unsigned long nr_pages)
 {
-	bool rate_limited = false;
-
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	spin_lock(&pgdat->numabalancing_migrate_lock);
 	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
+		spin_lock(&pgdat->numabalancing_migrate_lock);
 		pgdat->numabalancing_migrate_nr_pages = 0;
 		pgdat->numabalancing_migrate_next_window = jiffies +
 			msecs_to_jiffies(migrate_interval_millisecs);
+		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
 	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages)
-		rate_limited = true;
-	else
-		pgdat->numabalancing_migrate_nr_pages += nr_pages;
-	spin_unlock(&pgdat->numabalancing_migrate_lock);
-	
-	return rate_limited;
+		return true;
+
+	/*
+	 * This is an unlocked non-atomic update so errors are possible.
+	 * The consequences are failing to migrate when we potentiall should
+	 * have which is not severe enough to warrant locking. If it is ever
+	 * a problem, it can be converted to a per-cpu counter.
+	 */
+	pgdat->numabalancing_migrate_nr_pages += nr_pages;
+	return false;
 }
 
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (13 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages Mel Gorman
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

A low local/remote numa hinting fault ratio is potentially explained by
failed migrations. This patch adds a tracepoint that fires when migration
fails due to migration rate limitation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/migrate.h | 26 ++++++++++++++++++++++++++
 mm/migrate.c                   |  5 ++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index ec2a6cc..3075ffb 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -45,6 +45,32 @@ TRACE_EVENT(mm_migrate_pages,
 		__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
 
+TRACE_EVENT(mm_numa_migrate_ratelimit,
+
+	TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages),
+
+	TP_ARGS(p, dst_nid, nr_pages),
+
+	TP_STRUCT__entry(
+		__array(	char,		comm,	TASK_COMM_LEN)
+		__field(	pid_t,		pid)
+		__field(	int,		dst_nid)
+		__field(	unsigned long,	nr_pages)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->dst_nid	= dst_nid;
+		__entry->nr_pages	= nr_pages;
+	),
+
+	TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu",
+		__entry->comm,
+		__entry->pid,
+		__entry->dst_nid,
+		__entry->nr_pages)
+);
 #endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index 564d5c9..8dc277d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1608,8 +1608,11 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 			msecs_to_jiffies(migrate_interval_millisecs);
 		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
-	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages)
+	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
+		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
+								nr_pages);
 		return true;
+	}
 
 	/*
 	 * This is an unlocked non-atomic update so errors are possible.
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (14 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 15/18] mm: numa: Trace tasks that fail migration due to " Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 15:51 ` [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Mel Gorman
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

KSM pages can be shared between tasks that are not necessarily related
to each other from a NUMA perspective. This patch causes those pages to
be ignored by automatic NUMA balancing so they do not migrate and do not
cause unrelated tasks to be grouped together.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9b1be30..c258137 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -23,6 +23,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/ksm.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -63,7 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
-				if (page) {
+				if (page && !PageKsm(page)) {
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 17/18] sched: Add tracepoints related to NUMA task migration
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (15 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages Mel Gorman
@ 2013-12-10 15:51 ` Mel Gorman
  2013-12-10 22:22   ` Andrew Morton
  2013-12-10 15:56 ` [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
  2013-12-11 13:21 ` [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Mel Gorman
  18 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML, Mel Gorman

This patch adds three tracepoints
 o trace_sched_move_numa	when a task is moved to a node
 o trace_sched_swap_numa	when a task is swapped with another task
 o trace_sched_stick_numa	when a numa-related migration fails

The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated

 o NUMA migrated stuck	 nr trace_sched_stick_numa
 o NUMA migrated idle	 nr trace_sched_move_numa
 o NUMA migrated swapped nr trace_sched_swap_numa
 o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
 o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
 o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
			 Maybe a small number of these are acceptable
			 but a high number would be a major surprise.
			 It would be even worse if bounces are frequent.
 o NUMA avg task migs.	 Average number of migrations for tasks
 o NUMA stddev task mig	 Self-explanatory
 o NUMA max task migs.	 Maximum number of migrations for a single task

In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount of
useless work.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/sched.h | 87 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |  2 +
 kernel/sched/fair.c          |  6 ++-
 3 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 04c3084..67e1bbf 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -443,6 +443,93 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
+DECLARE_EVENT_CLASS(sched_move_task_template,
+
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu),
+
+	TP_STRUCT__entry(
+		__field( pid_t,	pid			)
+		__field( pid_t,	tgid			)
+		__field( pid_t,	ngid			)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->pid		= task_pid_nr(tsk);
+		__entry->tgid		= task_tgid_nr(tsk);
+		__entry->ngid		= task_numa_group_id(tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= cpu_to_node(dst_cpu);
+	),
+
+	TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d",
+			__entry->pid, __entry->tgid, __entry->ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_cpu, __entry->dst_nid)
+);
+
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+DEFINE_EVENT(sched_move_task_template, sched_move_numa,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
+	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+
+	TP_ARGS(tsk, src_cpu, dst_cpu)
+);
+
+TRACE_EVENT(sched_swap_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
+
+	TP_STRUCT__entry(
+		__field( pid_t,	src_pid			)
+		__field( pid_t,	src_tgid		)
+		__field( pid_t,	src_ngid		)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( pid_t,	dst_pid			)
+		__field( pid_t,	dst_tgid		)
+		__field( pid_t,	dst_ngid		)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->src_pid	= task_pid_nr(src_tsk);
+		__entry->src_tgid	= task_tgid_nr(src_tsk);
+		__entry->src_ngid	= task_numa_group_id(src_tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_pid	= task_pid_nr(dst_tsk);
+		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
+		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= cpu_to_node(dst_cpu);
+	),
+
+	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
+			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
+			__entry->dst_cpu, __entry->dst_nid)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e85cda2..e485d2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
 		goto out;
 
+	trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
 	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
 out:
@@ -4090,6 +4091,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	/* TODO: This is not properly updating schedstats */
 
+	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18bf84e..26fe588 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
 	p->numa_scan_period = task_scan_min(p);
 
 	if (env.best_task == NULL) {
-		int ret = migrate_task_to(p, env.best_cpu);
+		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
+			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
-	ret = migrate_swap(p, env.best_task);
+	if ((ret = migrate_swap(p, env.best_task)) != 0);
+		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
 	return ret;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (16 preceding siblings ...)
  2013-12-10 15:51 ` [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Mel Gorman
@ 2013-12-10 15:56 ` Mel Gorman
  2013-12-11 13:21 ` [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Mel Gorman
  18 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-10 15:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Tue, Dec 10, 2013 at 03:51:18PM +0000, Mel Gorman wrote:
> Changelog since V3
> o Dropped a tracing patch
> o Rebased to 3.13-rc3
> o Removed unnecessary ptl acquisition
> 

*sigh*

There really are only 17 patches in the series. 18/18 does not exist.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible
  2013-12-10 15:51 ` [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Mel Gorman
@ 2013-12-10 16:56   ` Rik van Riel
  0 siblings, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2013-12-10 16:56 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML

On 12/10/2013 10:51 AM, Mel Gorman wrote:
> THP migration can fail for a variety of reasons. Avoid flushing the TLB
> to deal with THP migration races until the copy is ready to start.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration
  2013-12-10 15:51 ` [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Mel Gorman
@ 2013-12-10 22:22   ` Andrew Morton
  2013-12-11  8:37     ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2013-12-10 22:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Tue, 10 Dec 2013 15:51:35 +0000 Mel Gorman <mgorman@suse.de> wrote:

> This patch adds three tracepoints
>  o trace_sched_move_numa	when a task is moved to a node
>  o trace_sched_swap_numa	when a task is swapped with another task
>  o trace_sched_stick_numa	when a numa-related migration fails
> 
> The tracepoints allow the NUMA scheduler activity to be monitored and the
> following high-level metrics can be calculated
> 
>  o NUMA migrated stuck	 nr trace_sched_stick_numa
>  o NUMA migrated idle	 nr trace_sched_move_numa
>  o NUMA migrated swapped nr trace_sched_swap_numa
>  o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
>  o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
>  o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
> 			 Maybe a small number of these are acceptable
> 			 but a high number would be a major surprise.
> 			 It would be even worse if bounces are frequent.
>  o NUMA avg task migs.	 Average number of migrations for tasks
>  o NUMA stddev task mig	 Self-explanatory
>  o NUMA max task migs.	 Maximum number of migrations for a single task
> 
> In general the intent of the tracepoints is to help diagnose problems
> where automatic NUMA balancing appears to be doing an excessive amount of
> useless work.
> 
> ...
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
>  	p->numa_scan_period = task_scan_min(p);
>  
>  	if (env.best_task == NULL) {
> -		int ret = migrate_task_to(p, env.best_cpu);
> +		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
> +			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
>  		return ret;
>  	}
>  
> -	ret = migrate_swap(p, env.best_task);
> +	if ((ret = migrate_swap(p, env.best_task)) != 0);

I'll zap that semicolon...

> +		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
>  	put_task_struct(env.best_task);
>  	return ret;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration
  2013-12-10 22:22   ` Andrew Morton
@ 2013-12-11  8:37     ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-11  8:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Tue, Dec 10, 2013 at 02:22:11PM -0800, Andrew Morton wrote:
> On Tue, 10 Dec 2013 15:51:35 +0000 Mel Gorman <mgorman@suse.de> wrote:
> 
> > This patch adds three tracepoints
> >  o trace_sched_move_numa	when a task is moved to a node
> >  o trace_sched_swap_numa	when a task is swapped with another task
> >  o trace_sched_stick_numa	when a numa-related migration fails
> > 
> > The tracepoints allow the NUMA scheduler activity to be monitored and the
> > following high-level metrics can be calculated
> > 
> >  o NUMA migrated stuck	 nr trace_sched_stick_numa
> >  o NUMA migrated idle	 nr trace_sched_move_numa
> >  o NUMA migrated swapped nr trace_sched_swap_numa
> >  o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
> >  o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
> >  o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
> > 			 Maybe a small number of these are acceptable
> > 			 but a high number would be a major surprise.
> > 			 It would be even worse if bounces are frequent.
> >  o NUMA avg task migs.	 Average number of migrations for tasks
> >  o NUMA stddev task mig	 Self-explanatory
> >  o NUMA max task migs.	 Maximum number of migrations for a single task
> > 
> > In general the intent of the tracepoints is to help diagnose problems
> > where automatic NUMA balancing appears to be doing an excessive amount of
> > useless work.
> > 
> > ...
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p)
> >  	p->numa_scan_period = task_scan_min(p);
> >  
> >  	if (env.best_task == NULL) {
> > -		int ret = migrate_task_to(p, env.best_cpu);
> > +		if ((ret = migrate_task_to(p, env.best_cpu)) != 0)
> > +			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
> >  		return ret;
> >  	}
> >  
> > -	ret = migrate_swap(p, env.best_task);
> > +	if ((ret = migrate_swap(p, env.best_task)) != 0);
> 
> I'll zap that semicolon...
> 

Thanks

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates
  2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
                   ` (17 preceding siblings ...)
  2013-12-10 15:56 ` [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
@ 2013-12-11 13:21 ` Mel Gorman
  2013-12-11 14:44   ` Paul E. McKenney
  2013-12-11 15:21   ` Rik van Riel
  18 siblings, 2 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-11 13:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul E. McKenney, Peter Zijlstra, Alex Thorlton, Rik van Riel,
	Linux-MM, LKML

According to documentation on barriers, stores issued before a LOCK can
complete after the lock implying that it's possible tlb_flush_pending can
be visible after a page table update. As per revised documentation, this patch
adds a smp_mb__before_spinlock to guarantee the correct ordering.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c122bb1..a12f2ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
 	mm->tlb_flush_pending = true;
-	barrier();
+
+	/*
+	 * Guarantee that the tlb_flush_pending store does not leak into the
+	 * critical section updating the page tables
+	 */
+	smp_mb__before_spinlock();
 }
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void clear_tlb_flush_pending(struct mm_struct *mm)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates
  2013-12-11 13:21 ` [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Mel Gorman
@ 2013-12-11 14:44   ` Paul E. McKenney
  2013-12-11 16:40     ` Mel Gorman
  2013-12-11 15:21   ` Rik van Riel
  1 sibling, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2013-12-11 14:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Peter Zijlstra, Alex Thorlton, Rik van Riel,
	Linux-MM, LKML

On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote:
> According to documentation on barriers, stores issued before a LOCK can
> complete after the lock implying that it's possible tlb_flush_pending can
> be visible after a page table update. As per revised documentation, this patch
> adds a smp_mb__before_spinlock to guarantee the correct ordering.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Assuming that there is a lock acquisition after calls to
set_tlb_flush_pending():

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

(I don't see set_tlb_flush_pending() in mainline.)

> ---
>  include/linux/mm_types.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c122bb1..a12f2ab 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm)
>  static inline void set_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	mm->tlb_flush_pending = true;
> -	barrier();
> +
> +	/*
> +	 * Guarantee that the tlb_flush_pending store does not leak into the
> +	 * critical section updating the page tables
> +	 */
> +	smp_mb__before_spinlock();
>  }
>  /* Clearing is done after a TLB flush, which also provides a barrier. */
>  static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates
  2013-12-11 13:21 ` [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Mel Gorman
  2013-12-11 14:44   ` Paul E. McKenney
@ 2013-12-11 15:21   ` Rik van Riel
  1 sibling, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2013-12-11 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Paul E. McKenney, Peter Zijlstra, Alex Thorlton,
	Linux-MM, LKML

On Wed, 11 Dec 2013 13:21:09 +0000
Mel Gorman <mgorman@suse.de> wrote:

> According to documentation on barriers, stores issued before a LOCK can
> complete after the lock implying that it's possible tlb_flush_pending can
> be visible after a page table update. As per revised documentation, this patch
> adds a smp_mb__before_spinlock to guarantee the correct ordering.

And now you have 18 patches :)
 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>
 
-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates
  2013-12-11 14:44   ` Paul E. McKenney
@ 2013-12-11 16:40     ` Mel Gorman
  2013-12-11 16:56       ` Paul E. McKenney
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-11 16:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Peter Zijlstra, Alex Thorlton, Rik van Riel,
	Linux-MM, LKML

On Wed, Dec 11, 2013 at 06:44:47AM -0800, Paul E. McKenney wrote:
> On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote:
> > According to documentation on barriers, stores issued before a LOCK can
> > complete after the lock implying that it's possible tlb_flush_pending can
> > be visible after a page table update. As per revised documentation, this patch
> > adds a smp_mb__before_spinlock to guarantee the correct ordering.
> > 
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Assuming that there is a lock acquisition after calls to
> set_tlb_flush_pending():
> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> (I don't see set_tlb_flush_pending() in mainline.)
> 

It's introduced by a patch flight that is currently sitting in Andrew's
tree. In the case where we care about the value of tlb_flush_pending, a
spinlock will be taken. PMD or PTE split spinlocks or the mm->page_table_lock
depending on whether it is 3.13 or 3.12-stable and earlier kernels. I
pushed the relevant patches to this tree and branch

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git numab-instrument-serialise-v5r1

There is no guarantee the lock will be taken if there are no pages populated
in the region but we also do not care about flushing the TLB in that case
either. Does it matter that there is no guarantee a lock will be taken
after smp_mb__before_spinlock, just very likely that it will be?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates
  2013-12-11 16:40     ` Mel Gorman
@ 2013-12-11 16:56       ` Paul E. McKenney
  0 siblings, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2013-12-11 16:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Peter Zijlstra, Alex Thorlton, Rik van Riel,
	Linux-MM, LKML

On Wed, Dec 11, 2013 at 04:40:52PM +0000, Mel Gorman wrote:
> On Wed, Dec 11, 2013 at 06:44:47AM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote:
> > > According to documentation on barriers, stores issued before a LOCK can
> > > complete after the lock implying that it's possible tlb_flush_pending can
> > > be visible after a page table update. As per revised documentation, this patch
> > > adds a smp_mb__before_spinlock to guarantee the correct ordering.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > 
> > Assuming that there is a lock acquisition after calls to
> > set_tlb_flush_pending():
> > 
> > Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > (I don't see set_tlb_flush_pending() in mainline.)
> > 
> 
> It's introduced by a patch flight that is currently sitting in Andrew's
> tree. In the case where we care about the value of tlb_flush_pending, a
> spinlock will be taken. PMD or PTE split spinlocks or the mm->page_table_lock
> depending on whether it is 3.13 or 3.12-stable and earlier kernels. I
> pushed the relevant patches to this tree and branch
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git numab-instrument-serialise-v5r1
> 
> There is no guarantee the lock will be taken if there are no pages populated
> in the region but we also do not care about flushing the TLB in that case
> either. Does it matter that there is no guarantee a lock will be taken
> after smp_mb__before_spinlock, just very likely that it will be?

If you do smp_mb__before_spinlock() without a lock acquisition, no harm
will be done, other than possibly a bit of performance loss.  So you
should be OK.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm: fix TLB flush race between migration, and change_protection_range -fix
  2013-12-10 15:51 ` [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Mel Gorman
@ 2013-12-11 19:12   ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-11 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML

The following build error was reported by the 0-day build checker.

>> arch/arm/mm/context.c:51:18: error: 'tlb_flush_pending' redeclared as different kind of symbol
   include/linux/mm_types.h:477:91: note: previous definition of 'tlb_flush_pending' was here

This patch renames tlb_flush_pending to
mm_tlb_flush_pending. This is a fix for the -mm patch
mm-fix-tlb-flush-race-between-migration-and-change_protection_range.patch

Note that when slotted into place that it will cause a conflict with
mm-numa-defer-tlb-flush-for-thp-migration-as-long-as-possible.patch . The
resolution is to delete the call from huge_memory.c and make sure the
tlb_flush_pending call in mm/migrate.c is renamed appropriately.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h | 2 +-
 include/linux/mm_types.h       | 4 ++--
 mm/huge_memory.c               | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 48cab4c..bbc8b12 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -458,7 +458,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 		return true;
 
 	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
-			tlb_flush_pending(mm))
+			mm_tlb_flush_pending(mm))
 		return true;
 
 	return false;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c122bb1..e5c49c3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -474,7 +474,7 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
  * The barriers below prevent the compiler from re-ordering the instructions
  * around the memory barriers that are already present in the code.
  */
-static inline bool tlb_flush_pending(struct mm_struct *mm)
+static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
 	barrier();
 	return mm->tlb_flush_pending;
@@ -491,7 +491,7 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm)
 	mm->tlb_flush_pending = false;
 }
 #else
-static inline bool tlb_flush_pending(struct mm_struct *mm)
+static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
 	return false;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3a5ee2..317a8ff 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1380,7 +1380,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * The page_table_lock above provides a memory barrier
 	 * with change_protection_range.
 	 */
-	if (tlb_flush_pending(mm))
+	if (mm_tlb_flush_pending(mm))
 		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 19/18] mm,numa: write pte_numa pte back to the page tables
  2013-12-10 15:51 ` [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
@ 2013-12-16 23:15   ` Rik van Riel
  0 siblings, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2013-12-16 23:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Alex Thorlton, Linux-MM, LKML, chegu_vinod

On Tue, 10 Dec 2013 15:51:23 +0000
Mel Gorman <mgorman@suse.de> wrote:

> The TLB must be flushed if the PTE is updated but change_pte_range is clearing
> the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it
> reinserts the same entry. Without the flush, it's conceivable that two processors
> have different TLBs for the same virtual address and at the very least it would
> generate spurious faults. This patch only unmaps the pages in change_pte_range for
> a full protection change.

Turns out the patch optimized out not one, but both
pte writes. Oops.

We'll need this one too, Andrew :)

---8<---

Subject: mm,numa: write pte_numa pte back to the page tables

The patch "mm: numa: Do not clear PTE for pte_numa update" cleverly
optimizes out an extraneous PTE write when changing the protection
of pages to pte_numa.

It also optimizes out actually writing the new pte_numa entry back
to the page tables. Oops.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Chegu Vinod <chegu_vinod@hp.com>
---
 mm/mprotect.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index edc4e22..4114acf 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -67,6 +67,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				if (page && !PageKsm(page)) {
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
+						set_pte_at(mm, addr, pte, ptent);
 						updated = true;
 					}
 				}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
  2013-12-10 15:51 ` [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
@ 2013-12-17 22:53   ` Sasha Levin
  2013-12-19 11:59     ` Mel Gorman
  2013-12-19 12:00     ` [PATCH] mm: Remove bogus warning in copy_huge_pmd Mel Gorman
  0 siblings, 2 replies; 33+ messages in thread
From: Sasha Levin @ 2013-12-17 22:53 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Linux-MM, LKML

Hi Mel,

On 12/10/2013 10:51 AM, Mel Gorman wrote:
> +
> +	/* mmap_sem prevents this happening but warn if that changes */
> +	WARN_ON(pmd_trans_migrating(pmd));
> +

I seem to be hitting this warning with latest -next kernel:

[ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/
0x3a0()
[ 1704.597258] Modules linked in:
[ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G        W    3.13.0-rc4-
next-20131217-sasha-00013-ga878504-dirty #4149
[ 1704.599924]  0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff:
537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1
[ 1704.604846]  0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0
[ 1704.606391]  ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000
[ 1704.608008] Call Trace:
[ 1704.608511]  [<ffffffff8439501c>] dump_stack+0x52/0x7f
[ 1704.609699]  [<ffffffff8112f8ac>] warn_slowpath_common+0x8c/0xc0
[ 1704.612617]  [<ffffffff8112f8fa>] warn_slowpath_null+0x1a/0x20
[ 1704.614043]  [<ffffffff812b91c5>] copy_huge_pmd+0x145/0x3a0
[ 1704.615587]  [<ffffffff8127e032>] copy_page_range+0x3f2/0x560
[ 1704.616869]  [<ffffffff81199ef1>] ? rwsem_wake+0x51/0x70
[ 1704.617942]  [<ffffffff8112cf59>] dup_mmap+0x2c9/0x3d0
[ 1704.619146]  [<ffffffff8112d54d>] dup_mm+0xad/0x150
[ 1704.620051]  [<ffffffff8112e178>] copy_process+0xa68/0x12e0
[ 1704.622976]  [<ffffffff81194eda>] ? __lock_release+0x1da/0x1f0
[ 1704.624234]  [<ffffffff8112eee6>] do_fork+0x96/0x270
[ 1704.624975]  [<ffffffff81249465>] ? context_tracking_user_exit+0x195/0x1d0
[ 1704.626427]  [<ffffffff811930ed>] ? trace_hardirqs_on+0xd/0x10
[ 1704.627681]  [<ffffffff8112f0d6>] SyS_clone+0x16/0x20
[ 1704.628833]  [<ffffffff843a6309>] stub_clone+0x69/0x90
[ 1704.629672]  [<ffffffff843a6150>] ? tracesys+0xdd/0xe2


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration
  2013-12-17 22:53   ` Sasha Levin
@ 2013-12-19 11:59     ` Mel Gorman
  2013-12-19 12:00     ` [PATCH] mm: Remove bogus warning in copy_huge_pmd Mel Gorman
  1 sibling, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-12-19 11:59 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Andrew Morton, Alex Thorlton, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 05:53:45PM -0500, Sasha Levin wrote:
> Hi Mel,
> 
> On 12/10/2013 10:51 AM, Mel Gorman wrote:
> >+
> >+	/* mmap_sem prevents this happening but warn if that changes */
> >+	WARN_ON(pmd_trans_migrating(pmd));
> >+
> 
> I seem to be hitting this warning with latest -next kernel:
> 

Patch will follow shortly. I appreciate these trinity bug reports but in
the future is there any chance you could include the trinity command line
and the config file you used? Details on the machine would also be nice. In
this case, knowing if the machine was NUMA or not would have been helpful.

Thanks!

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm: Remove bogus warning in copy_huge_pmd
  2013-12-17 22:53   ` Sasha Levin
  2013-12-19 11:59     ` Mel Gorman
@ 2013-12-19 12:00     ` Mel Gorman
  2013-12-19 18:36       ` Rik van Riel
  1 sibling, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-12-19 12:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alex Thorlton, Rik van Riel, Sasha Levin, Linux-MM, LKML

Sasha Levin reported the following warning being triggered

[ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
[ 1704.597258] Modules linked in:
[ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G        W    3.13.0-rc4-next-20131217-sasha-00013-ga878504-dirty #4149
[ 1704.599924]  0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1
[ 1704.604846]  0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0
[ 1704.606391]  ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000
[ 1704.608008] Call Trace:
[ 1704.608511]  [<ffffffff8439501c>] dump_stack+0x52/0x7f
[ 1704.609699]  [<ffffffff8112f8ac>] warn_slowpath_common+0x8c/0xc0
[ 1704.612617]  [<ffffffff8112f8fa>] warn_slowpath_null+0x1a/0x20
[ 1704.614043]  [<ffffffff812b91c5>] copy_huge_pmd+0x145/0x3a0
[ 1704.615587]  [<ffffffff8127e032>] copy_page_range+0x3f2/0x560
[ 1704.616869]  [<ffffffff81199ef1>] ? rwsem_wake+0x51/0x70
[ 1704.617942]  [<ffffffff8112cf59>] dup_mmap+0x2c9/0x3d0
[ 1704.619146]  [<ffffffff8112d54d>] dup_mm+0xad/0x150
[ 1704.620051]  [<ffffffff8112e178>] copy_process+0xa68/0x12e0
[ 1704.622976]  [<ffffffff81194eda>] ? __lock_release+0x1da/0x1f0
[ 1704.624234]  [<ffffffff8112eee6>] do_fork+0x96/0x270
[ 1704.624975]  [<ffffffff81249465>] ? context_tracking_user_exit+0x195/0x1d0
[ 1704.626427]  [<ffffffff811930ed>] ? trace_hardirqs_on+0xd/0x10
[ 1704.627681]  [<ffffffff8112f0d6>] SyS_clone+0x16/0x20
[ 1704.628833]  [<ffffffff843a6309>] stub_clone+0x69/0x90
[ 1704.629672]  [<ffffffff843a6150>] ? tracesys+0xdd/0xe2

This warning was introduced by "mm: numa: Avoid unnecessary disruption
of NUMA hinting during migration" for paranoia reasons but the warning
is bogus. I was thinking of parallel races between NUMA hinting faults
and forks but this warning would also be triggered by a parallel reclaim
splitting a THP during a fork. Remote the bogus warning.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3b6a75..468bd3a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -883,9 +883,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	/* mmap_sem prevents this happening but warn if that changes */
-	WARN_ON(pmd_trans_migrating(pmd));
-
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
 		spin_unlock(src_ptl);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm: Remove bogus warning in copy_huge_pmd
  2013-12-19 12:00     ` [PATCH] mm: Remove bogus warning in copy_huge_pmd Mel Gorman
@ 2013-12-19 18:36       ` Rik van Riel
  0 siblings, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2013-12-19 18:36 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Alex Thorlton, Sasha Levin, Linux-MM, LKML

On 12/19/2013 07:00 AM, Mel Gorman wrote:
> Sasha Levin reported the following warning being triggered
>
> [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
> [ 1704.597258] Modules linked in:
> [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G        W    3.13.0-rc4-next-20131217-sasha-00013-ga878504-dirty #4149
> [ 1704.599924]  0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1
> [ 1704.604846]  0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0
> [ 1704.606391]  ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000
> [ 1704.608008] Call Trace:
> [ 1704.608511]  [<ffffffff8439501c>] dump_stack+0x52/0x7f
> [ 1704.609699]  [<ffffffff8112f8ac>] warn_slowpath_common+0x8c/0xc0
> [ 1704.612617]  [<ffffffff8112f8fa>] warn_slowpath_null+0x1a/0x20
> [ 1704.614043]  [<ffffffff812b91c5>] copy_huge_pmd+0x145/0x3a0
> [ 1704.615587]  [<ffffffff8127e032>] copy_page_range+0x3f2/0x560
> [ 1704.616869]  [<ffffffff81199ef1>] ? rwsem_wake+0x51/0x70
> [ 1704.617942]  [<ffffffff8112cf59>] dup_mmap+0x2c9/0x3d0
> [ 1704.619146]  [<ffffffff8112d54d>] dup_mm+0xad/0x150
> [ 1704.620051]  [<ffffffff8112e178>] copy_process+0xa68/0x12e0
> [ 1704.622976]  [<ffffffff81194eda>] ? __lock_release+0x1da/0x1f0
> [ 1704.624234]  [<ffffffff8112eee6>] do_fork+0x96/0x270
> [ 1704.624975]  [<ffffffff81249465>] ? context_tracking_user_exit+0x195/0x1d0
> [ 1704.626427]  [<ffffffff811930ed>] ? trace_hardirqs_on+0xd/0x10
> [ 1704.627681]  [<ffffffff8112f0d6>] SyS_clone+0x16/0x20
> [ 1704.628833]  [<ffffffff843a6309>] stub_clone+0x69/0x90
> [ 1704.629672]  [<ffffffff843a6150>] ? tracesys+0xdd/0xe2
>
> This warning was introduced by "mm: numa: Avoid unnecessary disruption
> of NUMA hinting during migration" for paranoia reasons but the warning
> is bogus. I was thinking of parallel races between NUMA hinting faults
> and forks but this warning would also be triggered by a parallel reclaim
> splitting a THP during a fork. Remote the bogus warning.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-12-19 18:36 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-10 15:51 [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
2013-12-10 15:51 ` [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
2013-12-10 15:51 ` [PATCH 02/18] mm: numa: Call MMU notifiers on " Mel Gorman
2013-12-10 15:51 ` [PATCH 03/18] mm: Clear pmd_numa before invalidating Mel Gorman
2013-12-10 15:51 ` [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
2013-12-10 15:51 ` [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
2013-12-16 23:15   ` [PATCH 19/18] mm,numa: write pte_numa pte back to the page tables Rik van Riel
2013-12-10 15:51 ` [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
2013-12-10 15:51 ` [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
2013-12-10 15:51 ` [PATCH 08/18] sched: numa: Skip inaccessible VMAs Mel Gorman
2013-12-10 15:51 ` [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect Mel Gorman
2013-12-10 15:51 ` [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
2013-12-17 22:53   ` Sasha Levin
2013-12-19 11:59     ` Mel Gorman
2013-12-19 12:00     ` [PATCH] mm: Remove bogus warning in copy_huge_pmd Mel Gorman
2013-12-19 18:36       ` Rik van Riel
2013-12-10 15:51 ` [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Mel Gorman
2013-12-11 19:12   ` [PATCH] mm: fix TLB flush race between migration, and change_protection_range -fix Mel Gorman
2013-12-10 15:51 ` [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Mel Gorman
2013-12-10 16:56   ` Rik van Riel
2013-12-10 15:51 ` [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static Mel Gorman
2013-12-10 15:51 ` [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting Mel Gorman
2013-12-10 15:51 ` [PATCH 15/18] mm: numa: Trace tasks that fail migration due to " Mel Gorman
2013-12-10 15:51 ` [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages Mel Gorman
2013-12-10 15:51 ` [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Mel Gorman
2013-12-10 22:22   ` Andrew Morton
2013-12-11  8:37     ` Mel Gorman
2013-12-10 15:56 ` [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Mel Gorman
2013-12-11 13:21 ` [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Mel Gorman
2013-12-11 14:44   ` Paul E. McKenney
2013-12-11 16:40     ` Mel Gorman
2013-12-11 16:56       ` Paul E. McKenney
2013-12-11 15:21   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).