From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f173.google.com (mail-ea0-f173.google.com [209.85.215.173]) by kanga.kvack.org (Postfix) with ESMTP id 7A37C6B0035 for ; Tue, 10 Dec 2013 10:51:38 -0500 (EST) Received: by mail-ea0-f173.google.com with SMTP id o10so2335022eaj.32 for ; Tue, 10 Dec 2013 07:51:37 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id m49si14894117eeg.73.2013.12.10.07.51.37 for ; Tue, 10 Dec 2013 07:51:37 -0800 (PST) From: Mel Gorman Subject: [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Date: Tue, 10 Dec 2013 15:51:18 +0000 Message-Id: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Changelog since V3 o Dropped a tracing patch o Rebased to 3.13-rc3 o Removed unnecessary ptl acquisition Alex Thorlton reported segementation faults when NUMA balancing is enabled on large machines. There is no obvious explanation from the console what the problem but similar problems have been observed by Rik van Riel and myself if migration was aggressive enough. Alex, this series is against 3.13-rc2, a verification that the fix addresses your problem would be appreciated. This series starts with a range of patches aimed at addressing the segmentation fault problem while offsetting some of the cost to avoid badly regressing performance in -stable. Those that are cc'd to stable (patches 1-12) should be merged ASAP. The rest of the series is relatively minor stuff that fell out during the course of development that is ok to wait for the next merge window but should help with the continued development of NUMA balancing. arch/sparc/include/asm/pgtable_64.h | 4 +- arch/x86/include/asm/pgtable.h | 11 +++- arch/x86/mm/gup.c | 13 +++++ include/asm-generic/pgtable.h | 2 +- include/linux/migrate.h | 9 ++++ include/linux/mm_types.h | 44 +++++++++++++++ include/linux/mmzone.h | 5 +- include/trace/events/migrate.h | 26 +++++++++ include/trace/events/sched.h | 87 ++++++++++++++++++++++++++++++ kernel/fork.c | 1 + kernel/sched/core.c | 2 + kernel/sched/fair.c | 24 +++++---- mm/huge_memory.c | 45 ++++++++++++---- mm/mempolicy.c | 6 +-- mm/migrate.c | 103 ++++++++++++++++++++++++++++-------- mm/mprotect.c | 15 ++++-- mm/pgtable-generic.c | 8 ++- 17 files changed, 347 insertions(+), 58 deletions(-) -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by kanga.kvack.org (Postfix) with ESMTP id C63B76B0035 for ; Tue, 10 Dec 2013 10:51:38 -0500 (EST) Received: by mail-wi0-f170.google.com with SMTP id hq4so5529333wib.3 for ; Tue, 10 Dec 2013 07:51:38 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id v6si4218320eel.133.2013.12.10.07.51.37 for ; Tue, 10 Dec 2013 07:51:38 -0800 (PST) From: Mel Gorman Subject: [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Date: Tue, 10 Dec 2013 15:51:19 +0000 Message-Id: <1386690695-27380-2-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Base pages are unmapped and flushed from cache and TLB during normal page migration and replaced with a migration entry that causes any parallel or gup to block until migration completes. THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm: Close races between THP migration and PMD numa clearing") made worse by introducing a pmd_clear_flush(). This patch forces get_user_page (fast and normal) on a pmd_numa page to go through the slow get_user_page path where it will serialise against THP migration and properly account for the NUMA hinting fault. On the migration side the page table lock is taken for each PTE update. Cc: stable@vger.kernel.org Reviewed-by: Rik van Riel Signed-off-by: Mel Gorman --- arch/x86/mm/gup.c | 13 +++++++++++++ mm/huge_memory.c | 24 ++++++++++++++++-------- mm/migrate.c | 38 +++++++++++++++++++++++++++++++------- 3 files changed, 60 insertions(+), 15 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..0596e8e 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -83,6 +83,12 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, pte_t pte = gup_get_pte(ptep); struct page *page; + /* Similar to the PMD case, NUMA hinting must take slow path */ + if (pte_numa(pte)) { + pte_unmap(ptep); + return 0; + } + if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { pte_unmap(ptep); return 0; @@ -167,6 +173,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, if (pmd_none(pmd) || pmd_trans_splitting(pmd)) return 0; if (unlikely(pmd_large(pmd))) { + /* + * NUMA hinting faults need to be handled in the GUP + * slowpath for accounting purposes and so that they + * can be serialised against THP migration. + */ + if (pmd_numa(pmd)) + return 0; if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) return 0; } else { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bccd5a6..deae592 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1243,6 +1243,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) return ERR_PTR(-EFAULT); + /* Full NUMA hinting faults to serialise migration in fault paths */ + if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) + goto out; + page = pmd_page(*pmd); VM_BUG_ON(!PageHead(page)); if (flags & FOLL_TOUCH) { @@ -1323,23 +1327,27 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, /* If the page was locked, there are no parallel migrations */ if (page_locked) goto clear_pmdnuma; + } - /* - * Otherwise wait for potential migrations and retry. We do - * relock and check_same as the page may no longer be mapped. - * As the fault is being retried, do not account for it. - */ + /* + * If there are potential migrations, wait for completion and retry. We + * do not relock and check_same as the page may no longer be mapped. + * Furtermore, even if the page is currently misplaced, there is no + * guarantee it is still misplaced after the migration completes. + */ + if (!page_locked) { spin_unlock(ptl); wait_on_page_locked(page); page_nid = -1; goto out; } - /* Page is misplaced, serialise migrations and parallel THP splits */ + /* + * Page is misplaced. Page lock serialises migrations. Acquire anon_vma + * to serialises splits + */ get_page(page); spin_unlock(ptl); - if (!page_locked) - lock_page(page); anon_vma = page_lock_anon_vma_read(page); /* Confirm the PMD did not change while page_table_lock was released */ diff --git a/mm/migrate.c b/mm/migrate.c index bb94004..2cabbd5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1722,6 +1722,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, struct page *new_page = NULL; struct mem_cgroup *memcg = NULL; int page_lru = page_is_file_cache(page); + pmd_t orig_entry; /* * Rate-limit the amount of data that is being migrated to a node. @@ -1756,7 +1757,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, /* Recheck the target PMD */ ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_same(*pmd, entry))) { + if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) { +fail_putback: spin_unlock(ptl); /* Reverse changes made by migrate_page_copy() */ @@ -1786,16 +1788,34 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, */ mem_cgroup_prepare_migration(page, new_page, &memcg); + orig_entry = *pmd; entry = mk_pmd(new_page, vma->vm_page_prot); - entry = pmd_mknonnuma(entry); - entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = pmd_mkhuge(entry); + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + /* + * Clear the old entry under pagetable lock and establish the new PTE. + * Any parallel GUP will either observe the old page blocking on the + * page lock, block on the page table lock or observe the new page. + * The SetPageUptodate on the new page and page_add_new_anon_rmap + * guarantee the copy is visible before the pagetable update. + */ + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + page_add_new_anon_rmap(new_page, vma, haddr); pmdp_clear_flush(vma, haddr, pmd); set_pmd_at(mm, haddr, pmd, entry); - page_add_new_anon_rmap(new_page, vma, haddr); update_mmu_cache_pmd(vma, address, &entry); + + if (page_count(page) != 2) { + set_pmd_at(mm, haddr, pmd, orig_entry); + flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + update_mmu_cache_pmd(vma, address, &entry); + page_remove_rmap(new_page); + goto fail_putback; + } + page_remove_rmap(page); + /* * Finish the charge transaction under the page table lock to * prevent split_huge_page() from dividing up the charge @@ -1820,9 +1840,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, out_fail: count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); out_dropref: - entry = pmd_mknonnuma(entry); - set_pmd_at(mm, haddr, pmd, entry); - update_mmu_cache_pmd(vma, address, &entry); + ptl = pmd_lock(mm, pmd); + if (pmd_same(*pmd, entry)) { + entry = pmd_mknonnuma(entry); + set_pmd_at(mm, haddr, pmd, entry); + update_mmu_cache_pmd(vma, address, &entry); + } + spin_unlock(ptl); unlock_page(page); put_page(page); -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f43.google.com (mail-ee0-f43.google.com [74.125.83.43]) by kanga.kvack.org (Postfix) with ESMTP id 6069F6B0038 for ; Tue, 10 Dec 2013 10:51:39 -0500 (EST) Received: by mail-ee0-f43.google.com with SMTP id c13so2323583eek.2 for ; Tue, 10 Dec 2013 07:51:38 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id j47si14853087eeo.200.2013.12.10.07.51.38 for ; Tue, 10 Dec 2013 07:51:38 -0800 (PST) From: Mel Gorman Subject: [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration Date: Tue, 10 Dec 2013 15:51:20 +0000 Message-Id: <1386690695-27380-3-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman MMU notifiers must be called on THP page migration or secondary MMUs will get very confused. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/migrate.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 2cabbd5..be787d5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -36,6 +36,7 @@ #include #include #include +#include #include @@ -1716,12 +1717,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, struct page *page, int node) { spinlock_t *ptl; - unsigned long haddr = address & HPAGE_PMD_MASK; pg_data_t *pgdat = NODE_DATA(node); int isolated = 0; struct page *new_page = NULL; struct mem_cgroup *memcg = NULL; int page_lru = page_is_file_cache(page); + unsigned long mmun_start = address & HPAGE_PMD_MASK; + unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE; pmd_t orig_entry; /* @@ -1756,10 +1758,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, WARN_ON(PageLRU(new_page)); /* Recheck the target PMD */ + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ptl = pmd_lock(mm, pmd); if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) { fail_putback: spin_unlock(ptl); + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); /* Reverse changes made by migrate_page_copy() */ if (TestClearPageActive(new_page)) @@ -1800,15 +1804,16 @@ fail_putback: * The SetPageUptodate on the new page and page_add_new_anon_rmap * guarantee the copy is visible before the pagetable update. */ - flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); - page_add_new_anon_rmap(new_page, vma, haddr); - pmdp_clear_flush(vma, haddr, pmd); - set_pmd_at(mm, haddr, pmd, entry); + flush_cache_range(vma, mmun_start, mmun_end); + page_add_new_anon_rmap(new_page, vma, mmun_start); + pmdp_clear_flush(vma, mmun_start, pmd); + set_pmd_at(mm, mmun_start, pmd, entry); + flush_tlb_range(vma, mmun_start, mmun_end); update_mmu_cache_pmd(vma, address, &entry); if (page_count(page) != 2) { - set_pmd_at(mm, haddr, pmd, orig_entry); - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + set_pmd_at(mm, mmun_start, pmd, orig_entry); + flush_tlb_range(vma, mmun_start, mmun_end); update_mmu_cache_pmd(vma, address, &entry); page_remove_rmap(new_page); goto fail_putback; @@ -1823,6 +1828,7 @@ fail_putback: */ mem_cgroup_end_migration(memcg, page, new_page, true); spin_unlock(ptl); + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); unlock_page(new_page); unlock_page(page); @@ -1843,7 +1849,7 @@ out_dropref: ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { entry = pmd_mknonnuma(entry); - set_pmd_at(mm, haddr, pmd, entry); + set_pmd_at(mm, mmun_start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } spin_unlock(ptl); -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f180.google.com (mail-ea0-f180.google.com [209.85.215.180]) by kanga.kvack.org (Postfix) with ESMTP id 2344D6B0038 for ; Tue, 10 Dec 2013 10:51:40 -0500 (EST) Received: by mail-ea0-f180.google.com with SMTP id f15so2343738eak.11 for ; Tue, 10 Dec 2013 07:51:39 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id t6si14871039eeh.129.2013.12.10.07.51.39 for ; Tue, 10 Dec 2013 07:51:39 -0800 (PST) From: Mel Gorman Subject: [PATCH 03/18] mm: Clear pmd_numa before invalidating Date: Tue, 10 Dec 2013 15:51:21 +0000 Message-Id: <1386690695-27380-4-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman pmdp_invalidate clears the present bit without taking into account that it might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear pmd_numa before invalidating. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/pgtable-generic.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index cbb3854..e84cad2 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -191,6 +191,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + pmd_t entry = *pmdp; + if (pmd_numa(entry)) + entry = pmd_mknonnuma(entry); set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp)); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); } -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f173.google.com (mail-ea0-f173.google.com [209.85.215.173]) by kanga.kvack.org (Postfix) with ESMTP id A81496B003A for ; Tue, 10 Dec 2013 10:51:40 -0500 (EST) Received: by mail-ea0-f173.google.com with SMTP id o10so2299740eaj.18 for ; Tue, 10 Dec 2013 07:51:40 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id e48si14851646eeh.197.2013.12.10.07.51.39 for ; Tue, 10 Dec 2013 07:51:40 -0800 (PST) From: Mel Gorman Subject: [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan Date: Tue, 10 Dec 2013 15:51:22 +0000 Message-Id: <1386690695-27380-5-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman If the PMD is flushed then a parallel fault in handle_mm_fault() will enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt to insert a huge zero page. This is wasteful so the patch avoids clearing the PMD when setting pmd_numa. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index deae592..5a5da50 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1529,7 +1529,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, */ if (!is_huge_zero_page(page) && !pmd_numa(*pmd)) { - entry = pmdp_get_and_clear(mm, addr, pmd); + entry = *pmd; entry = pmd_mknuma(entry); ret = HPAGE_PMD_NR; } -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f44.google.com (mail-ee0-f44.google.com [74.125.83.44]) by kanga.kvack.org (Postfix) with ESMTP id 6D1896B003B for ; Tue, 10 Dec 2013 10:51:41 -0500 (EST) Received: by mail-ee0-f44.google.com with SMTP id b57so2322195eek.17 for ; Tue, 10 Dec 2013 07:51:40 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id 5si14864214eei.165.2013.12.10.07.51.40 for ; Tue, 10 Dec 2013 07:51:40 -0800 (PST) From: Mel Gorman Subject: [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Date: Tue, 10 Dec 2013 15:51:23 +0000 Message-Id: <1386690695-27380-6-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman The TLB must be flushed if the PTE is updated but change_pte_range is clearing the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it reinserts the same entry. Without the flush, it's conceivable that two processors have different TLBs for the same virtual address and at the very least it would generate spurious faults. This patch only unmaps the pages in change_pte_range for a full protection change. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/mprotect.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 2666797..0a07e2d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -52,13 +52,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, pte_t ptent; bool updated = false; - ptent = ptep_modify_prot_start(mm, addr, pte); if (!prot_numa) { + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); updated = true; } else { struct page *page; + ptent = *pte; page = vm_normal_page(vma, addr, oldpte); if (page) { if (!pte_numa(oldpte)) { @@ -79,7 +80,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (updated) pages++; - ptep_modify_prot_commit(mm, addr, pte, ptent); + + /* Only !prot_numa always clears the pte */ + if (!prot_numa) + ptep_modify_prot_commit(mm, addr, pte, ptent); } else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (mail-ee0-f46.google.com [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id 17DEF6B003C for ; Tue, 10 Dec 2013 10:51:41 -0500 (EST) Received: by mail-ee0-f46.google.com with SMTP id d49so2312117eek.19 for ; Tue, 10 Dec 2013 07:51:41 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id a9si14897424eew.54.2013.12.10.07.51.41 for ; Tue, 10 Dec 2013 07:51:41 -0800 (PST) From: Mel Gorman Subject: [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Date: Tue, 10 Dec 2013 15:51:24 +0000 Message-Id: <1386690695-27380-7-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman The anon_vma lock prevents parallel THP splits and any associated complexity that arises when handling splits during THP migration. This patch checks if the lock was successfully acquired and bails from THP migration if it failed for any reason. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/huge_memory.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5a5da50..0f00b96 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1359,6 +1359,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out_unlock; } + /* Bail if we fail to protect against THP splits for any reason */ + if (unlikely(!anon_vma)) { + put_page(page); + page_nid = -1; + goto clear_pmdnuma; + } + /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f51.google.com (mail-ee0-f51.google.com [74.125.83.51]) by kanga.kvack.org (Postfix) with ESMTP id BE3316B003D for ; Tue, 10 Dec 2013 10:51:42 -0500 (EST) Received: by mail-ee0-f51.google.com with SMTP id b15so2327991eek.24 for ; Tue, 10 Dec 2013 07:51:42 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id m44si14854346eeo.205.2013.12.10.07.51.41 for ; Tue, 10 Dec 2013 07:51:42 -0800 (PST) From: Mel Gorman Subject: [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path Date: Tue, 10 Dec 2013 15:51:25 +0000 Message-Id: <1386690695-27380-8-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman If a PMD changes during a THP migration then migration aborts but the failure path is doing more work than is necessary. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/migrate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index be787d5..a987525 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1780,7 +1780,8 @@ fail_putback: putback_lru_page(page); mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); - goto out_fail; + + goto out_unlock; } /* @@ -1854,6 +1855,7 @@ out_dropref: } spin_unlock(ptl); +out_unlock: unlock_page(page); put_page(page); return 0; -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f47.google.com (mail-ee0-f47.google.com [74.125.83.47]) by kanga.kvack.org (Postfix) with ESMTP id 62AE56B0044 for ; Tue, 10 Dec 2013 10:51:43 -0500 (EST) Received: by mail-ee0-f47.google.com with SMTP id e51so2248616eek.20 for ; Tue, 10 Dec 2013 07:51:42 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id t6si14852305eeh.213.2013.12.10.07.51.42 for ; Tue, 10 Dec 2013 07:51:42 -0800 (PST) From: Mel Gorman Subject: [PATCH 08/18] sched: numa: Skip inaccessible VMAs Date: Tue, 10 Dec 2013 15:51:26 +0000 Message-Id: <1386690695-27380-9-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Inaccessible VMA should not be trapping NUMA hint faults. Skip them. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fd773ad..18bf84e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1752,6 +1752,13 @@ void task_numa_work(struct callback_head *work) (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) continue; + /* + * Skip inaccessible VMAs to avoid any confusion between + * PROT_NONE and NUMA hinting ptes + */ + if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) + continue; + do { start = max(start, vma->vm_start); end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f180.google.com (mail-ea0-f180.google.com [209.85.215.180]) by kanga.kvack.org (Postfix) with ESMTP id 0BE166B0044 for ; Tue, 10 Dec 2013 10:51:43 -0500 (EST) Received: by mail-ea0-f180.google.com with SMTP id f15so2321309eak.25 for ; Tue, 10 Dec 2013 07:51:43 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id r9si14884139eeo.86.2013.12.10.07.51.43 for ; Tue, 10 Dec 2013 07:51:43 -0800 (PST) From: Mel Gorman Subject: [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect Date: Tue, 10 Dec 2013 15:51:27 +0000 Message-Id: <1386690695-27380-10-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman On a protection change it is no longer clear if the page should be still accessible. This patch clears the NUMA hinting fault bits on a protection change. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/huge_memory.c | 2 ++ mm/mprotect.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0f00b96..0ecaba2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1522,6 +1522,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, ret = 1; if (!prot_numa) { entry = pmdp_get_and_clear(mm, addr, pmd); + if (pmd_numa(entry)) + entry = pmd_mknonnuma(entry); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; BUG_ON(pmd_write(entry)); diff --git a/mm/mprotect.c b/mm/mprotect.c index 0a07e2d..eb2f349 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -54,6 +54,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (!prot_numa) { ptent = ptep_modify_prot_start(mm, addr, pte); + if (pte_numa(ptent)) + ptent = pte_mknonnuma(ptent); ptent = pte_modify(ptent, newprot); updated = true; } else { -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f179.google.com (mail-ea0-f179.google.com [209.85.215.179]) by kanga.kvack.org (Postfix) with ESMTP id DE7C66B0055 for ; Tue, 10 Dec 2013 10:51:44 -0500 (EST) Received: by mail-ea0-f179.google.com with SMTP id r15so2336800ead.24 for ; Tue, 10 Dec 2013 07:51:44 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id l2si14859892een.167.2013.12.10.07.51.44 for ; Tue, 10 Dec 2013 07:51:44 -0800 (PST) From: Mel Gorman Subject: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Date: Tue, 10 Dec 2013 15:51:28 +0000 Message-Id: <1386690695-27380-11-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman do_huge_pmd_numa_page() handles the case where there is parallel THP migration. However, by the time it is checked the NUMA hinting information has already been disrupted. This patch adds an earlier check with some helpers. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/linux/migrate.h | 9 +++++++++ mm/huge_memory.c | 22 ++++++++++++++++------ mm/migrate.c | 12 ++++++++++++ 3 files changed, 37 insertions(+), 6 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index f5096b5..b7717d7 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -90,10 +90,19 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_NUMA_BALANCING +extern bool pmd_trans_migrating(pmd_t pmd); +extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); #else +static inline bool pmd_trans_migrating(pmd_t pmd) +{ + return false; +} +static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) +{ +} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0ecaba2..e3b6a75 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -882,6 +882,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = 0; goto out_unlock; } + + /* mmap_sem prevents this happening but warn if that changes */ + WARN_ON(pmd_trans_migrating(pmd)); + if (unlikely(pmd_trans_splitting(pmd))) { /* split huge page running from under us */ spin_unlock(src_ptl); @@ -1299,6 +1303,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; + /* + * If there are potential migrations, wait for completion and retry + * without disrupting NUMA hinting information. Do not relock and + * check_same as the page may no longer be mapped. + */ + if (unlikely(pmd_trans_migrating(*pmdp))) { + spin_unlock(ptl); + wait_migrate_huge_page(vma->anon_vma, pmdp); + goto out; + } + page = pmd_page(pmd); BUG_ON(is_huge_zero_page(page)); page_nid = page_to_nid(page); @@ -1329,12 +1344,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto clear_pmdnuma; } - /* - * If there are potential migrations, wait for completion and retry. We - * do not relock and check_same as the page may no longer be mapped. - * Furtermore, even if the page is currently misplaced, there is no - * guarantee it is still misplaced after the migration completes. - */ + /* Migration could have started since the pmd_trans_migrating check */ if (!page_locked) { spin_unlock(ptl); wait_on_page_locked(page); diff --git a/mm/migrate.c b/mm/migrate.c index a987525..cfb4190 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1655,6 +1655,18 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) return 1; } +bool pmd_trans_migrating(pmd_t pmd) +{ + struct page *page = pmd_page(pmd); + return PageLocked(page); +} + +void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) +{ + struct page *page = pmd_page(*pmd); + wait_on_page_locked(page); +} + /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f177.google.com (mail-ea0-f177.google.com [209.85.215.177]) by kanga.kvack.org (Postfix) with ESMTP id 978656B005A for ; Tue, 10 Dec 2013 10:51:45 -0500 (EST) Received: by mail-ea0-f177.google.com with SMTP id n15so2328889ead.22 for ; Tue, 10 Dec 2013 07:51:45 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id u49si14903127eep.43.2013.12.10.07.51.44 for ; Tue, 10 Dec 2013 07:51:44 -0800 (PST) From: Mel Gorman Subject: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Date: Tue, 10 Dec 2013 15:51:29 +0000 Message-Id: <1386690695-27380-12-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman From: Rik van Riel There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. Cc: stable@vger.kernel.org Signed-off-by: Rik van Riel Signed-off-by: Mel Gorman --- arch/sparc/include/asm/pgtable_64.h | 4 ++-- arch/x86/include/asm/pgtable.h | 11 ++++++++-- include/asm-generic/pgtable.h | 2 +- include/linux/mm_types.h | 44 +++++++++++++++++++++++++++++++++++++ kernel/fork.c | 1 + mm/huge_memory.c | 7 ++++++ mm/mprotect.c | 2 ++ mm/pgtable-generic.c | 5 +++-- 8 files changed, 69 insertions(+), 7 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 8358dc1..0f9e945 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -619,7 +619,7 @@ static inline unsigned long pte_present(pte_t pte) } #define pte_accessible pte_accessible -static inline unsigned long pte_accessible(pte_t a) +static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a) { return pte_val(a) & _PAGE_VALID; } @@ -847,7 +847,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U * and SUN4V pte layout, so this inline test is fine. */ - if (likely(mm != &init_mm) && pte_accessible(orig)) + if (likely(mm != &init_mm) && pte_accessible(mm, orig)) tlb_batch_add(mm, addr, ptep, orig, fullmm); } diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 3d19994..48cab4c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -452,9 +452,16 @@ static inline int pte_present(pte_t a) } #define pte_accessible pte_accessible -static inline int pte_accessible(pte_t a) +static inline bool pte_accessible(struct mm_struct *mm, pte_t a) { - return pte_flags(a) & _PAGE_PRESENT; + if (pte_flags(a) & _PAGE_PRESENT) + return true; + + if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) && + tlb_flush_pending(mm)) + return true; + + return false; } static inline int pte_hidden(pte_t pte) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index f330d28..b12079a 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -217,7 +217,7 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) #endif #ifndef pte_accessible -# define pte_accessible(pte) ((void)(pte),1) +# define pte_accessible(mm, pte) ((void)(pte), 1) #endif #ifndef flush_tlb_fix_spurious_fault diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index bd29941..c122bb1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -443,6 +443,14 @@ struct mm_struct { /* numa_scan_seq prevents two threads setting pte_numa */ int numa_scan_seq; #endif +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) + /* + * An operation with batched TLB flushing is going on. Anything that + * can move process memory needs to flush the TLB when moving a + * PROT_NONE or PROT_NUMA mapped page. + */ + bool tlb_flush_pending; +#endif struct uprobes_state uprobes_state; }; @@ -459,4 +467,40 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) return mm->cpu_vm_mask_var; } +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) +/* + * Memory barriers to keep this state in sync are graciously provided by + * the page table locks, outside of which no page table modifications happen. + * The barriers below prevent the compiler from re-ordering the instructions + * around the memory barriers that are already present in the code. + */ +static inline bool tlb_flush_pending(struct mm_struct *mm) +{ + barrier(); + return mm->tlb_flush_pending; +} +static inline void set_tlb_flush_pending(struct mm_struct *mm) +{ + mm->tlb_flush_pending = true; + barrier(); +} +/* Clearing is done after a TLB flush, which also provides a barrier. */ +static inline void clear_tlb_flush_pending(struct mm_struct *mm) +{ + barrier(); + mm->tlb_flush_pending = false; +} +#else +static inline bool tlb_flush_pending(struct mm_struct *mm) +{ + return false; +} +static inline void set_tlb_flush_pending(struct mm_struct *mm) +{ +} +static inline void clear_tlb_flush_pending(struct mm_struct *mm) +{ +} +#endif + #endif /* _LINUX_MM_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 728d5be..5721f0e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -537,6 +537,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) spin_lock_init(&mm->page_table_lock); mm_init_aio(mm); mm_init_owner(mm, p); + clear_tlb_flush_pending(mm); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3b6a75..e3a5ee2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1377,6 +1377,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* + * The page_table_lock above provides a memory barrier + * with change_protection_range. + */ + if (tlb_flush_pending(mm)) + flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + + /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. */ diff --git a/mm/mprotect.c b/mm/mprotect.c index eb2f349..9b1be30 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -187,6 +187,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, BUG_ON(addr >= end); pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); + set_tlb_flush_pending(mm); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) @@ -198,6 +199,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, /* Only flush the TLB if we actually modified any entries: */ if (pages) flush_tlb_range(vma, start, end); + clear_tlb_flush_pending(mm); return pages; } diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index e84cad2..a8b9199 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -110,9 +110,10 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { + struct mm_struct *mm = (vma)->vm_mm; pte_t pte; - pte = ptep_get_and_clear((vma)->vm_mm, address, ptep); - if (pte_accessible(pte)) + pte = ptep_get_and_clear(mm, address, ptep); + if (pte_accessible(mm, pte)) flush_tlb_page(vma, address); return pte; } -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f51.google.com (mail-ee0-f51.google.com [74.125.83.51]) by kanga.kvack.org (Postfix) with ESMTP id 3850F6B005C for ; Tue, 10 Dec 2013 10:51:46 -0500 (EST) Received: by mail-ee0-f51.google.com with SMTP id b15so2328019eek.24 for ; Tue, 10 Dec 2013 07:51:45 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id j47si14890872eeo.74.2013.12.10.07.51.45 for ; Tue, 10 Dec 2013 07:51:45 -0800 (PST) From: Mel Gorman Subject: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Date: Tue, 10 Dec 2013 15:51:30 +0000 Message-Id: <1386690695-27380-13-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman THP migration can fail for a variety of reasons. Avoid flushing the TLB to deal with THP migration races until the copy is ready to start. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman --- mm/huge_memory.c | 7 ------- mm/migrate.c | 3 +++ 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3a5ee2..e3b6a75 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1377,13 +1377,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * The page_table_lock above provides a memory barrier - * with change_protection_range. - */ - if (tlb_flush_pending(mm)) - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); - - /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. */ diff --git a/mm/migrate.c b/mm/migrate.c index cfb4190..0c4fbf6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1759,6 +1759,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, goto out_fail; } + if (tlb_flush_pending(mm)) + flush_tlb_range(vma, mmun_start, mmun_end); + /* Prepare a page as a migration target */ __set_page_locked(new_page); SetPageSwapBacked(new_page); -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f43.google.com (mail-wg0-f43.google.com [74.125.82.43]) by kanga.kvack.org (Postfix) with ESMTP id E3C0B6B0062 for ; Tue, 10 Dec 2013 10:51:46 -0500 (EST) Received: by mail-wg0-f43.google.com with SMTP id k14so5153638wgh.10 for ; Tue, 10 Dec 2013 07:51:46 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id s42si14873220eew.140.2013.12.10.07.51.46 for ; Tue, 10 Dec 2013 07:51:46 -0800 (PST) From: Mel Gorman Subject: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static Date: Tue, 10 Dec 2013 15:51:31 +0000 Message-Id: <1386690695-27380-14-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman numamigrate_update_ratelimit and numamigrate_isolate_page only have callers in mm/migrate.c. This patch makes them static. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/migrate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 0c4fbf6..b6eef65 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1593,7 +1593,8 @@ bool migrate_ratelimited(int node) } /* Returns true if the node is migrate rate-limited after the update */ -bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) +static bool numamigrate_update_ratelimit(pg_data_t *pgdat, + unsigned long nr_pages) { bool rate_limited = false; @@ -1617,7 +1618,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) return rate_limited; } -int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) +static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) { int page_lru; -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f171.google.com (mail-we0-f171.google.com [74.125.82.171]) by kanga.kvack.org (Postfix) with ESMTP id 95CD66B0069 for ; Tue, 10 Dec 2013 10:51:47 -0500 (EST) Received: by mail-we0-f171.google.com with SMTP id q58so5231354wes.16 for ; Tue, 10 Dec 2013 07:51:47 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id p46si14849865eem.210.2013.12.10.07.51.46 for ; Tue, 10 Dec 2013 07:51:46 -0800 (PST) From: Mel Gorman Subject: [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting Date: Tue, 10 Dec 2013 15:51:32 +0000 Message-Id: <1386690695-27380-15-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman NUMA migrate rate limiting protects a migration counter and window using a lock but in some cases this can be a contended lock. It is not critical that the number of pages be perfect, lost updates are acceptable. Reduce the importance of this lock. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/linux/mmzone.h | 5 +---- mm/migrate.c | 21 ++++++++++++--------- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bd791e4..b835d3f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -758,10 +758,7 @@ typedef struct pglist_data { int kswapd_max_order; enum zone_type classzone_idx; #ifdef CONFIG_NUMA_BALANCING - /* - * Lock serializing the per destination node AutoNUMA memory - * migration rate limiting data. - */ + /* Lock serializing the migrate rate limiting window */ spinlock_t numabalancing_migrate_lock; /* Rate limiting time interval */ diff --git a/mm/migrate.c b/mm/migrate.c index b6eef65..564d5c9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1596,26 +1596,29 @@ bool migrate_ratelimited(int node) static bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) { - bool rate_limited = false; - /* * Rate-limit the amount of data that is being migrated to a node. * Optimal placement is no good if the memory bus is saturated and * all the time is being spent migrating! */ - spin_lock(&pgdat->numabalancing_migrate_lock); if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) { + spin_lock(&pgdat->numabalancing_migrate_lock); pgdat->numabalancing_migrate_nr_pages = 0; pgdat->numabalancing_migrate_next_window = jiffies + msecs_to_jiffies(migrate_interval_millisecs); + spin_unlock(&pgdat->numabalancing_migrate_lock); } if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) - rate_limited = true; - else - pgdat->numabalancing_migrate_nr_pages += nr_pages; - spin_unlock(&pgdat->numabalancing_migrate_lock); - - return rate_limited; + return true; + + /* + * This is an unlocked non-atomic update so errors are possible. + * The consequences are failing to migrate when we potentiall should + * have which is not severe enough to warrant locking. If it is ever + * a problem, it can be converted to a per-cpu counter. + */ + pgdat->numabalancing_migrate_nr_pages += nr_pages; + return false; } static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49]) by kanga.kvack.org (Postfix) with ESMTP id 607126B0069 for ; Tue, 10 Dec 2013 10:51:48 -0500 (EST) Received: by mail-ee0-f49.google.com with SMTP id c41so2324222eek.8 for ; Tue, 10 Dec 2013 07:51:47 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id a9si14867618eew.159.2013.12.10.07.51.47 for ; Tue, 10 Dec 2013 07:51:47 -0800 (PST) From: Mel Gorman Subject: [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting Date: Tue, 10 Dec 2013 15:51:33 +0000 Message-Id: <1386690695-27380-16-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman A low local/remote numa hinting fault ratio is potentially explained by failed migrations. This patch adds a tracepoint that fires when migration fails due to migration rate limitation. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/trace/events/migrate.h | 26 ++++++++++++++++++++++++++ mm/migrate.c | 5 ++++- 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index ec2a6cc..3075ffb 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -45,6 +45,32 @@ TRACE_EVENT(mm_migrate_pages, __print_symbolic(__entry->reason, MIGRATE_REASON)) ); +TRACE_EVENT(mm_numa_migrate_ratelimit, + + TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages), + + TP_ARGS(p, dst_nid, nr_pages), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN) + __field( pid_t, pid) + __field( int, dst_nid) + __field( unsigned long, nr_pages) + ), + + TP_fast_assign( + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->dst_nid = dst_nid; + __entry->nr_pages = nr_pages; + ), + + TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu", + __entry->comm, + __entry->pid, + __entry->dst_nid, + __entry->nr_pages) +); #endif /* _TRACE_MIGRATE_H */ /* This part must be outside protection */ diff --git a/mm/migrate.c b/mm/migrate.c index 564d5c9..8dc277d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1608,8 +1608,11 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat, msecs_to_jiffies(migrate_interval_millisecs); spin_unlock(&pgdat->numabalancing_migrate_lock); } - if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) + if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) { + trace_mm_numa_migrate_ratelimit(current, pgdat->node_id, + nr_pages); return true; + } /* * This is an unlocked non-atomic update so errors are possible. -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id 1A0E16B006E for ; Tue, 10 Dec 2013 10:51:49 -0500 (EST) Received: by mail-wi0-f171.google.com with SMTP id bz8so5494491wib.16 for ; Tue, 10 Dec 2013 07:51:48 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id u49si14846482eep.211.2013.12.10.07.51.48 for ; Tue, 10 Dec 2013 07:51:48 -0800 (PST) From: Mel Gorman Subject: [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages Date: Tue, 10 Dec 2013 15:51:34 +0000 Message-Id: <1386690695-27380-17-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman KSM pages can be shared between tasks that are not necessarily related to each other from a NUMA perspective. This patch causes those pages to be ignored by automatic NUMA balancing so they do not migrate and do not cause unrelated tasks to be grouped together. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/mprotect.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 9b1be30..c258137 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -63,7 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, ptent = *pte; page = vm_normal_page(vma, addr, oldpte); - if (page) { + if (page && !PageKsm(page)) { if (!pte_numa(oldpte)) { ptent = pte_mknuma(ptent); updated = true; -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f50.google.com (mail-ee0-f50.google.com [74.125.83.50]) by kanga.kvack.org (Postfix) with ESMTP id C83456B006E for ; Tue, 10 Dec 2013 10:51:49 -0500 (EST) Received: by mail-ee0-f50.google.com with SMTP id c41so2382757eek.9 for ; Tue, 10 Dec 2013 07:51:49 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id j47si14879081eeo.95.2013.12.10.07.51.48 for ; Tue, 10 Dec 2013 07:51:49 -0800 (PST) From: Mel Gorman Subject: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Date: Tue, 10 Dec 2013 15:51:35 +0000 Message-Id: <1386690695-27380-18-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman This patch adds three tracepoints o trace_sched_move_numa when a task is moved to a node o trace_sched_swap_numa when a task is swapped with another task o trace_sched_stick_numa when a numa-related migration fails The tracepoints allow the NUMA scheduler activity to be monitored and the following high-level metrics can be calculated o NUMA migrated stuck nr trace_sched_stick_numa o NUMA migrated idle nr trace_sched_move_numa o NUMA migrated swapped nr trace_sched_swap_numa o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid Maybe a small number of these are acceptable but a high number would be a major surprise. It would be even worse if bounces are frequent. o NUMA avg task migs. Average number of migrations for tasks o NUMA stddev task mig Self-explanatory o NUMA max task migs. Maximum number of migrations for a single task In general the intent of the tracepoints is to help diagnose problems where automatic NUMA balancing appears to be doing an excessive amount of useless work. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/trace/events/sched.h | 87 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 2 + kernel/sched/fair.c | 6 ++- 3 files changed, 93 insertions(+), 2 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 04c3084..67e1bbf 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -443,6 +443,93 @@ TRACE_EVENT(sched_process_hang, ); #endif /* CONFIG_DETECT_HUNG_TASK */ +DECLARE_EVENT_CLASS(sched_move_task_template, + + TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), + + TP_ARGS(tsk, src_cpu, dst_cpu), + + TP_STRUCT__entry( + __field( pid_t, pid ) + __field( pid_t, tgid ) + __field( pid_t, ngid ) + __field( int, src_cpu ) + __field( int, src_nid ) + __field( int, dst_cpu ) + __field( int, dst_nid ) + ), + + TP_fast_assign( + __entry->pid = task_pid_nr(tsk); + __entry->tgid = task_tgid_nr(tsk); + __entry->ngid = task_numa_group_id(tsk); + __entry->src_cpu = src_cpu; + __entry->src_nid = cpu_to_node(src_cpu); + __entry->dst_cpu = dst_cpu; + __entry->dst_nid = cpu_to_node(dst_cpu); + ), + + TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d", + __entry->pid, __entry->tgid, __entry->ngid, + __entry->src_cpu, __entry->src_nid, + __entry->dst_cpu, __entry->dst_nid) +); + +/* + * Tracks migration of tasks from one runqueue to another. Can be used to + * detect if automatic NUMA balancing is bouncing between nodes + */ +DEFINE_EVENT(sched_move_task_template, sched_move_numa, + TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), + + TP_ARGS(tsk, src_cpu, dst_cpu) +); + +DEFINE_EVENT(sched_move_task_template, sched_stick_numa, + TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), + + TP_ARGS(tsk, src_cpu, dst_cpu) +); + +TRACE_EVENT(sched_swap_numa, + + TP_PROTO(struct task_struct *src_tsk, int src_cpu, + struct task_struct *dst_tsk, int dst_cpu), + + TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu), + + TP_STRUCT__entry( + __field( pid_t, src_pid ) + __field( pid_t, src_tgid ) + __field( pid_t, src_ngid ) + __field( int, src_cpu ) + __field( int, src_nid ) + __field( pid_t, dst_pid ) + __field( pid_t, dst_tgid ) + __field( pid_t, dst_ngid ) + __field( int, dst_cpu ) + __field( int, dst_nid ) + ), + + TP_fast_assign( + __entry->src_pid = task_pid_nr(src_tsk); + __entry->src_tgid = task_tgid_nr(src_tsk); + __entry->src_ngid = task_numa_group_id(src_tsk); + __entry->src_cpu = src_cpu; + __entry->src_nid = cpu_to_node(src_cpu); + __entry->dst_pid = task_pid_nr(dst_tsk); + __entry->dst_tgid = task_tgid_nr(dst_tsk); + __entry->dst_ngid = task_numa_group_id(dst_tsk); + __entry->dst_cpu = dst_cpu; + __entry->dst_nid = cpu_to_node(dst_cpu); + ), + + TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d", + __entry->src_pid, __entry->src_tgid, __entry->src_ngid, + __entry->src_cpu, __entry->src_nid, + __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid, + __entry->dst_cpu, __entry->dst_nid) +); #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e85cda2..e485d2b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p) if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task))) goto out; + trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu); ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg); out: @@ -4090,6 +4091,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu) /* TODO: This is not properly updating schedstats */ + trace_sched_move_numa(p, curr_cpu, target_cpu); return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18bf84e..26fe588 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p) p->numa_scan_period = task_scan_min(p); if (env.best_task == NULL) { - int ret = migrate_task_to(p, env.best_cpu); + if ((ret = migrate_task_to(p, env.best_cpu)) != 0) + trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); return ret; } - ret = migrate_swap(p, env.best_task); + if ((ret = migrate_swap(p, env.best_task)) != 0); + trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); put_task_struct(env.best_task); return ret; } -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f51.google.com (mail-wg0-f51.google.com [74.125.82.51]) by kanga.kvack.org (Postfix) with ESMTP id D19EF6B0035 for ; Tue, 10 Dec 2013 10:56:36 -0500 (EST) Received: by mail-wg0-f51.google.com with SMTP id b13so5168934wgh.30 for ; Tue, 10 Dec 2013 07:56:36 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id p46si14877231eem.168.2013.12.10.07.56.36 for ; Tue, 10 Dec 2013 07:56:36 -0800 (PST) Date: Tue, 10 Dec 2013 15:56:33 +0000 From: Mel Gorman Subject: Re: [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Message-ID: <20131210155633.GL11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML On Tue, Dec 10, 2013 at 03:51:18PM +0000, Mel Gorman wrote: > Changelog since V3 > o Dropped a tracing patch > o Rebased to 3.13-rc3 > o Removed unnecessary ptl acquisition > *sigh* There really are only 17 patches in the series. 18/18 does not exist. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f174.google.com (mail-ea0-f174.google.com [209.85.215.174]) by kanga.kvack.org (Postfix) with ESMTP id 0DAC06B0036 for ; Tue, 10 Dec 2013 11:56:13 -0500 (EST) Received: by mail-ea0-f174.google.com with SMTP id b10so2374914eae.33 for ; Tue, 10 Dec 2013 08:56:13 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id i1si15124229eev.152.2013.12.10.08.56.12 for ; Tue, 10 Dec 2013 08:56:13 -0800 (PST) Message-ID: <52A747A8.9030307@redhat.com> Date: Tue, 10 Dec 2013 11:56:08 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-13-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-13-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Alex Thorlton , Linux-MM , LKML On 12/10/2013 10:51 AM, Mel Gorman wrote: > THP migration can fail for a variety of reasons. Avoid flushing the TLB > to deal with THP migration races until the copy is ready to start. > > Cc: stable@vger.kernel.org > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f42.google.com (mail-pb0-f42.google.com [209.85.160.42]) by kanga.kvack.org (Postfix) with ESMTP id C57756B0035 for ; Tue, 10 Dec 2013 17:22:14 -0500 (EST) Received: by mail-pb0-f42.google.com with SMTP id uo5so8678314pbc.1 for ; Tue, 10 Dec 2013 14:22:14 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTP id sa6si11634755pbb.23.2013.12.10.14.22.12 for ; Tue, 10 Dec 2013 14:22:13 -0800 (PST) Date: Tue, 10 Dec 2013 14:22:11 -0800 From: Andrew Morton Subject: Re: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Message-Id: <20131210142211.099fe782c361707ab3c04742@linux-foundation.org> In-Reply-To: <1386690695-27380-18-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-18-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML On Tue, 10 Dec 2013 15:51:35 +0000 Mel Gorman wrote: > This patch adds three tracepoints > o trace_sched_move_numa when a task is moved to a node > o trace_sched_swap_numa when a task is swapped with another task > o trace_sched_stick_numa when a numa-related migration fails > > The tracepoints allow the NUMA scheduler activity to be monitored and the > following high-level metrics can be calculated > > o NUMA migrated stuck nr trace_sched_stick_numa > o NUMA migrated idle nr trace_sched_move_numa > o NUMA migrated swapped nr trace_sched_swap_numa > o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) > o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) > o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid > Maybe a small number of these are acceptable > but a high number would be a major surprise. > It would be even worse if bounces are frequent. > o NUMA avg task migs. Average number of migrations for tasks > o NUMA stddev task mig Self-explanatory > o NUMA max task migs. Maximum number of migrations for a single task > > In general the intent of the tracepoints is to help diagnose problems > where automatic NUMA balancing appears to be doing an excessive amount of > useless work. > > ... > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p) > p->numa_scan_period = task_scan_min(p); > > if (env.best_task == NULL) { > - int ret = migrate_task_to(p, env.best_cpu); > + if ((ret = migrate_task_to(p, env.best_cpu)) != 0) > + trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); > return ret; > } > > - ret = migrate_swap(p, env.best_task); > + if ((ret = migrate_swap(p, env.best_task)) != 0); I'll zap that semicolon... > + trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); > put_task_struct(env.best_task); > return ret; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f48.google.com (mail-ee0-f48.google.com [74.125.83.48]) by kanga.kvack.org (Postfix) with ESMTP id C78216B0035 for ; Wed, 11 Dec 2013 03:37:48 -0500 (EST) Received: by mail-ee0-f48.google.com with SMTP id e49so2675596eek.21 for ; Wed, 11 Dec 2013 00:37:48 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id e48si17973762eeh.155.2013.12.11.00.37.47 for ; Wed, 11 Dec 2013 00:37:47 -0800 (PST) Date: Wed, 11 Dec 2013 08:37:45 +0000 From: Mel Gorman Subject: Re: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Message-ID: <20131211083744.GP11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-18-git-send-email-mgorman@suse.de> <20131210142211.099fe782c361707ab3c04742@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131210142211.099fe782c361707ab3c04742@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML On Tue, Dec 10, 2013 at 02:22:11PM -0800, Andrew Morton wrote: > On Tue, 10 Dec 2013 15:51:35 +0000 Mel Gorman wrote: > > > This patch adds three tracepoints > > o trace_sched_move_numa when a task is moved to a node > > o trace_sched_swap_numa when a task is swapped with another task > > o trace_sched_stick_numa when a numa-related migration fails > > > > The tracepoints allow the NUMA scheduler activity to be monitored and the > > following high-level metrics can be calculated > > > > o NUMA migrated stuck nr trace_sched_stick_numa > > o NUMA migrated idle nr trace_sched_move_numa > > o NUMA migrated swapped nr trace_sched_swap_numa > > o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) > > o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) > > o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid > > Maybe a small number of these are acceptable > > but a high number would be a major surprise. > > It would be even worse if bounces are frequent. > > o NUMA avg task migs. Average number of migrations for tasks > > o NUMA stddev task mig Self-explanatory > > o NUMA max task migs. Maximum number of migrations for a single task > > > > In general the intent of the tracepoints is to help diagnose problems > > where automatic NUMA balancing appears to be doing an excessive amount of > > useless work. > > > > ... > > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p) > > p->numa_scan_period = task_scan_min(p); > > > > if (env.best_task == NULL) { > > - int ret = migrate_task_to(p, env.best_cpu); > > + if ((ret = migrate_task_to(p, env.best_cpu)) != 0) > > + trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); > > return ret; > > } > > > > - ret = migrate_swap(p, env.best_task); > > + if ((ret = migrate_swap(p, env.best_task)) != 0); > > I'll zap that semicolon... > Thanks -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f42.google.com (mail-ee0-f42.google.com [74.125.83.42]) by kanga.kvack.org (Postfix) with ESMTP id D4A426B0035 for ; Wed, 11 Dec 2013 08:21:13 -0500 (EST) Received: by mail-ee0-f42.google.com with SMTP id e53so2865143eek.15 for ; Wed, 11 Dec 2013 05:21:13 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id l2si19016368een.125.2013.12.11.05.21.12 for ; Wed, 11 Dec 2013 05:21:12 -0800 (PST) Date: Wed, 11 Dec 2013 13:21:09 +0000 From: Mel Gorman Subject: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211132109.GB24125@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: "Paul E. McKenney" , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML According to documentation on barriers, stores issued before a LOCK can complete after the lock implying that it's possible tlb_flush_pending can be visible after a page table update. As per revised documentation, this patch adds a smp_mb__before_spinlock to guarantee the correct ordering. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman --- include/linux/mm_types.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c122bb1..a12f2ab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm) static inline void set_tlb_flush_pending(struct mm_struct *mm) { mm->tlb_flush_pending = true; - barrier(); + + /* + * Guarantee that the tlb_flush_pending store does not leak into the + * critical section updating the page tables + */ + smp_mb__before_spinlock(); } /* Clearing is done after a TLB flush, which also provides a barrier. */ static inline void clear_tlb_flush_pending(struct mm_struct *mm) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f176.google.com (mail-ob0-f176.google.com [209.85.214.176]) by kanga.kvack.org (Postfix) with ESMTP id 5E1E36B0035 for ; Wed, 11 Dec 2013 09:44:52 -0500 (EST) Received: by mail-ob0-f176.google.com with SMTP id vb8so1502836obc.21 for ; Wed, 11 Dec 2013 06:44:52 -0800 (PST) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com. [32.97.110.150]) by mx.google.com with ESMTPS id jb8si13640258obb.1.2013.12.11.06.44.51 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 11 Dec 2013 06:44:51 -0800 (PST) Received: from /spool/local by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 11 Dec 2013 07:44:50 -0700 Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id B8B9D19D8048 for ; Wed, 11 Dec 2013 07:44:41 -0700 (MST) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by b03cxnp07027.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id rBBCgecX6226248 for ; Wed, 11 Dec 2013 13:42:40 +0100 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id rBBElo8S000944 for ; Wed, 11 Dec 2013 07:47:50 -0700 Date: Wed, 11 Dec 2013 06:44:47 -0800 From: "Paul E. McKenney" Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211144446.GP4208@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131211132109.GB24125@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote: > According to documentation on barriers, stores issued before a LOCK can > complete after the lock implying that it's possible tlb_flush_pending can > be visible after a page table update. As per revised documentation, this patch > adds a smp_mb__before_spinlock to guarantee the correct ordering. > > Cc: stable@vger.kernel.org > Signed-off-by: Mel Gorman Assuming that there is a lock acquisition after calls to set_tlb_flush_pending(): Acked-by: Paul E. McKenney (I don't see set_tlb_flush_pending() in mainline.) > --- > include/linux/mm_types.h | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index c122bb1..a12f2ab 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm) > static inline void set_tlb_flush_pending(struct mm_struct *mm) > { > mm->tlb_flush_pending = true; > - barrier(); > + > + /* > + * Guarantee that the tlb_flush_pending store does not leak into the > + * critical section updating the page tables > + */ > + smp_mb__before_spinlock(); > } > /* Clearing is done after a TLB flush, which also provides a barrier. */ > static inline void clear_tlb_flush_pending(struct mm_struct *mm) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f170.google.com (mail-ea0-f170.google.com [209.85.215.170]) by kanga.kvack.org (Postfix) with ESMTP id 53B666B0031 for ; Wed, 11 Dec 2013 10:21:58 -0500 (EST) Received: by mail-ea0-f170.google.com with SMTP id k10so3010645eaj.1 for ; Wed, 11 Dec 2013 07:21:57 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id a9si19530905eew.75.2013.12.11.07.21.56 for ; Wed, 11 Dec 2013 07:21:57 -0800 (PST) Date: Wed, 11 Dec 2013 10:21:26 -0500 From: Rik van Riel Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211102126.532c763d@annuminas.surriel.com> In-Reply-To: <20131211132109.GB24125@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , "Paul E. McKenney" , Peter Zijlstra , Alex Thorlton , Linux-MM , LKML On Wed, 11 Dec 2013 13:21:09 +0000 Mel Gorman wrote: > According to documentation on barriers, stores issued before a LOCK can > complete after the lock implying that it's possible tlb_flush_pending can > be visible after a page table update. As per revised documentation, this patch > adds a smp_mb__before_spinlock to guarantee the correct ordering. And now you have 18 patches :) > Cc: stable@vger.kernel.org > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f42.google.com (mail-ee0-f42.google.com [74.125.83.42]) by kanga.kvack.org (Postfix) with ESMTP id 5D03A6B0031 for ; Wed, 11 Dec 2013 11:40:55 -0500 (EST) Received: by mail-ee0-f42.google.com with SMTP id e53so2991116eek.15 for ; Wed, 11 Dec 2013 08:40:54 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id r9si19886175eeo.65.2013.12.11.08.40.54 for ; Wed, 11 Dec 2013 08:40:54 -0800 (PST) Date: Wed, 11 Dec 2013 16:40:52 +0000 From: Mel Gorman Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211164052.GB11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> <20131211144446.GP4208@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131211144446.GP4208@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Paul E. McKenney" Cc: Andrew Morton , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML On Wed, Dec 11, 2013 at 06:44:47AM -0800, Paul E. McKenney wrote: > On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote: > > According to documentation on barriers, stores issued before a LOCK can > > complete after the lock implying that it's possible tlb_flush_pending can > > be visible after a page table update. As per revised documentation, this patch > > adds a smp_mb__before_spinlock to guarantee the correct ordering. > > > > Cc: stable@vger.kernel.org > > Signed-off-by: Mel Gorman > > Assuming that there is a lock acquisition after calls to > set_tlb_flush_pending(): > > Acked-by: Paul E. McKenney > > (I don't see set_tlb_flush_pending() in mainline.) > It's introduced by a patch flight that is currently sitting in Andrew's tree. In the case where we care about the value of tlb_flush_pending, a spinlock will be taken. PMD or PTE split spinlocks or the mm->page_table_lock depending on whether it is 3.13 or 3.12-stable and earlier kernels. I pushed the relevant patches to this tree and branch git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git numab-instrument-serialise-v5r1 There is no guarantee the lock will be taken if there are no pages populated in the region but we also do not care about flushing the TLB in that case either. Does it matter that there is no guarantee a lock will be taken after smp_mb__before_spinlock, just very likely that it will be? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f172.google.com (mail-ob0-f172.google.com [209.85.214.172]) by kanga.kvack.org (Postfix) with ESMTP id 217CA6B0031 for ; Wed, 11 Dec 2013 11:56:26 -0500 (EST) Received: by mail-ob0-f172.google.com with SMTP id gq1so7230539obb.17 for ; Wed, 11 Dec 2013 08:56:25 -0800 (PST) Received: from e39.co.us.ibm.com (e39.co.us.ibm.com. [32.97.110.160]) by mx.google.com with ESMTPS id ns8si13950926obc.35.2013.12.11.08.56.24 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 11 Dec 2013 08:56:25 -0800 (PST) Received: from /spool/local by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 11 Dec 2013 09:56:24 -0700 Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id BA6B83E4003F for ; Wed, 11 Dec 2013 09:56:21 -0700 (MST) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by b03cxnp07027.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id rBBEsDRi8257934 for ; Wed, 11 Dec 2013 15:54:13 +0100 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id rBBGxNUi006792 for ; Wed, 11 Dec 2013 09:59:23 -0700 Date: Wed, 11 Dec 2013 08:56:20 -0800 From: "Paul E. McKenney" Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211165620.GU4208@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> <20131211144446.GP4208@linux.vnet.ibm.com> <20131211164052.GB11295@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131211164052.GB11295@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML On Wed, Dec 11, 2013 at 04:40:52PM +0000, Mel Gorman wrote: > On Wed, Dec 11, 2013 at 06:44:47AM -0800, Paul E. McKenney wrote: > > On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote: > > > According to documentation on barriers, stores issued before a LOCK can > > > complete after the lock implying that it's possible tlb_flush_pending can > > > be visible after a page table update. As per revised documentation, this patch > > > adds a smp_mb__before_spinlock to guarantee the correct ordering. > > > > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Mel Gorman > > > > Assuming that there is a lock acquisition after calls to > > set_tlb_flush_pending(): > > > > Acked-by: Paul E. McKenney > > > > (I don't see set_tlb_flush_pending() in mainline.) > > > > It's introduced by a patch flight that is currently sitting in Andrew's > tree. In the case where we care about the value of tlb_flush_pending, a > spinlock will be taken. PMD or PTE split spinlocks or the mm->page_table_lock > depending on whether it is 3.13 or 3.12-stable and earlier kernels. I > pushed the relevant patches to this tree and branch > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git numab-instrument-serialise-v5r1 > > There is no guarantee the lock will be taken if there are no pages populated > in the region but we also do not care about flushing the TLB in that case > either. Does it matter that there is no guarantee a lock will be taken > after smp_mb__before_spinlock, just very likely that it will be? If you do smp_mb__before_spinlock() without a lock acquisition, no harm will be done, other than possibly a bit of performance loss. So you should be OK. Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54]) by kanga.kvack.org (Postfix) with ESMTP id B6B5A6B0031 for ; Wed, 11 Dec 2013 14:12:54 -0500 (EST) Received: by mail-ee0-f54.google.com with SMTP id e51so2930966eek.13 for ; Wed, 11 Dec 2013 11:12:53 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTP id h45si20487834eeo.151.2013.12.11.11.12.53 for ; Wed, 11 Dec 2013 11:12:53 -0800 (PST) Date: Wed, 11 Dec 2013 19:12:50 +0000 From: Mel Gorman Subject: [PATCH] mm: fix TLB flush race between migration, and change_protection_range -fix Message-ID: <20131211191250.GD11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-12-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1386690695-27380-12-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML The following build error was reported by the 0-day build checker. >> arch/arm/mm/context.c:51:18: error: 'tlb_flush_pending' redeclared as different kind of symbol include/linux/mm_types.h:477:91: note: previous definition of 'tlb_flush_pending' was here This patch renames tlb_flush_pending to mm_tlb_flush_pending. This is a fix for the -mm patch mm-fix-tlb-flush-race-between-migration-and-change_protection_range.patch Note that when slotted into place that it will cause a conflict with mm-numa-defer-tlb-flush-for-thp-migration-as-long-as-possible.patch . The resolution is to delete the call from huge_memory.c and make sure the tlb_flush_pending call in mm/migrate.c is renamed appropriately. Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 2 +- include/linux/mm_types.h | 4 ++-- mm/huge_memory.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 48cab4c..bbc8b12 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -458,7 +458,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a) return true; if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) && - tlb_flush_pending(mm)) + mm_tlb_flush_pending(mm)) return true; return false; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c122bb1..e5c49c3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -474,7 +474,7 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) * The barriers below prevent the compiler from re-ordering the instructions * around the memory barriers that are already present in the code. */ -static inline bool tlb_flush_pending(struct mm_struct *mm) +static inline bool mm_tlb_flush_pending(struct mm_struct *mm) { barrier(); return mm->tlb_flush_pending; @@ -491,7 +491,7 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm) mm->tlb_flush_pending = false; } #else -static inline bool tlb_flush_pending(struct mm_struct *mm) +static inline bool mm_tlb_flush_pending(struct mm_struct *mm) { return false; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3a5ee2..317a8ff 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1380,7 +1380,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * The page_table_lock above provides a memory barrier * with change_protection_range. */ - if (tlb_flush_pending(mm)) + if (mm_tlb_flush_pending(mm)) flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id 4BB216B0035 for ; Mon, 16 Dec 2013 18:15:21 -0500 (EST) Received: by mail-wi0-f179.google.com with SMTP id z2so2903471wiv.0 for ; Mon, 16 Dec 2013 15:15:20 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id z6si4247751wja.32.2013.12.16.15.15.19 for ; Mon, 16 Dec 2013 15:15:20 -0800 (PST) Date: Mon, 16 Dec 2013 18:15:13 -0500 From: Rik van Riel Subject: [PATCH 19/18] mm,numa: write pte_numa pte back to the page tables Message-ID: <20131216181513.14eda80d@annuminas.surriel.com> In-Reply-To: <1386690695-27380-6-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-6-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Alex Thorlton , Linux-MM , LKML , chegu_vinod@hp.com On Tue, 10 Dec 2013 15:51:23 +0000 Mel Gorman wrote: > The TLB must be flushed if the PTE is updated but change_pte_range is clearing > the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it > reinserts the same entry. Without the flush, it's conceivable that two processors > have different TLBs for the same virtual address and at the very least it would > generate spurious faults. This patch only unmaps the pages in change_pte_range for > a full protection change. Turns out the patch optimized out not one, but both pte writes. Oops. We'll need this one too, Andrew :) ---8<--- Subject: mm,numa: write pte_numa pte back to the page tables The patch "mm: numa: Do not clear PTE for pte_numa update" cleverly optimizes out an extraneous PTE write when changing the protection of pages to pte_numa. It also optimizes out actually writing the new pte_numa entry back to the page tables. Oops. Signed-off-by: Rik van Riel Reported-by: Chegu Vinod --- mm/mprotect.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/mprotect.c b/mm/mprotect.c index edc4e22..4114acf 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -67,6 +67,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (page && !PageKsm(page)) { if (!pte_numa(oldpte)) { ptent = pte_mknuma(ptent); + set_pte_at(mm, addr, pte, ptent); updated = true; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f50.google.com (mail-qe0-f50.google.com [209.85.128.50]) by kanga.kvack.org (Postfix) with ESMTP id 61E236B0035 for ; Tue, 17 Dec 2013 17:53:54 -0500 (EST) Received: by mail-qe0-f50.google.com with SMTP id 1so5793844qec.23 for ; Tue, 17 Dec 2013 14:53:54 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id l3si15797414qac.30.2013.12.17.14.53.53 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 14:53:53 -0800 (PST) Message-ID: <52B0D5F9.5030208@oracle.com> Date: Tue, 17 Dec 2013 17:53:45 -0500 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-11-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML Hi Mel, On 12/10/2013 10:51 AM, Mel Gorman wrote: > + > + /* mmap_sem prevents this happening but warn if that changes */ > + WARN_ON(pmd_trans_migrating(pmd)); > + I seem to be hitting this warning with latest -next kernel: [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0() [ 1704.597258] Modules linked in: [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G W 3.13.0-rc4- next-20131217-sasha-00013-ga878504-dirty #4149 [ 1704.599924] 0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1 [ 1704.604846] 0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0 [ 1704.606391] ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000 [ 1704.608008] Call Trace: [ 1704.608511] [] dump_stack+0x52/0x7f [ 1704.609699] [] warn_slowpath_common+0x8c/0xc0 [ 1704.612617] [] warn_slowpath_null+0x1a/0x20 [ 1704.614043] [] copy_huge_pmd+0x145/0x3a0 [ 1704.615587] [] copy_page_range+0x3f2/0x560 [ 1704.616869] [] ? rwsem_wake+0x51/0x70 [ 1704.617942] [] dup_mmap+0x2c9/0x3d0 [ 1704.619146] [] dup_mm+0xad/0x150 [ 1704.620051] [] copy_process+0xa68/0x12e0 [ 1704.622976] [] ? __lock_release+0x1da/0x1f0 [ 1704.624234] [] do_fork+0x96/0x270 [ 1704.624975] [] ? context_tracking_user_exit+0x195/0x1d0 [ 1704.626427] [] ? trace_hardirqs_on+0xd/0x10 [ 1704.627681] [] SyS_clone+0x16/0x20 [ 1704.628833] [] stub_clone+0x69/0x90 [ 1704.629672] [] ? tracesys+0xdd/0xe2 Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f52.google.com (mail-ee0-f52.google.com [74.125.83.52]) by kanga.kvack.org (Postfix) with ESMTP id ABEFD6B0037 for ; Thu, 19 Dec 2013 06:59:08 -0500 (EST) Received: by mail-ee0-f52.google.com with SMTP id d17so418821eek.25 for ; Thu, 19 Dec 2013 03:59:08 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v6si4043810eel.7.2013.12.19.03.59.07 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 19 Dec 2013 03:59:07 -0800 (PST) Date: Thu, 19 Dec 2013 11:59:05 +0000 From: Mel Gorman Subject: Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Message-ID: <20131219115905.GI11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> <52B0D5F9.5030208@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <52B0D5F9.5030208@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Andrew Morton , Alex Thorlton , Rik van Riel , Linux-MM , LKML On Tue, Dec 17, 2013 at 05:53:45PM -0500, Sasha Levin wrote: > Hi Mel, > > On 12/10/2013 10:51 AM, Mel Gorman wrote: > >+ > >+ /* mmap_sem prevents this happening but warn if that changes */ > >+ WARN_ON(pmd_trans_migrating(pmd)); > >+ > > I seem to be hitting this warning with latest -next kernel: > Patch will follow shortly. I appreciate these trinity bug reports but in the future is there any chance you could include the trinity command line and the config file you used? Details on the machine would also be nice. In this case, knowing if the machine was NUMA or not would have been helpful. Thanks! -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f180.google.com (mail-ea0-f180.google.com [209.85.215.180]) by kanga.kvack.org (Postfix) with ESMTP id 65DF86B0038 for ; Thu, 19 Dec 2013 07:00:10 -0500 (EST) Received: by mail-ea0-f180.google.com with SMTP id f15so417630eak.25 for ; Thu, 19 Dec 2013 04:00:09 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 5si3958047eei.207.2013.12.19.04.00.09 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 19 Dec 2013 04:00:09 -0800 (PST) Date: Thu, 19 Dec 2013 12:00:07 +0000 From: Mel Gorman Subject: [PATCH] mm: Remove bogus warning in copy_huge_pmd Message-ID: <20131219120007.GJ11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> <52B0D5F9.5030208@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <52B0D5F9.5030208@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Sasha Levin , Linux-MM , LKML Sasha Levin reported the following warning being triggered [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0() [ 1704.597258] Modules linked in: [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G W 3.13.0-rc4-next-20131217-sasha-00013-ga878504-dirty #4149 [ 1704.599924] 0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1 [ 1704.604846] 0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0 [ 1704.606391] ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000 [ 1704.608008] Call Trace: [ 1704.608511] [] dump_stack+0x52/0x7f [ 1704.609699] [] warn_slowpath_common+0x8c/0xc0 [ 1704.612617] [] warn_slowpath_null+0x1a/0x20 [ 1704.614043] [] copy_huge_pmd+0x145/0x3a0 [ 1704.615587] [] copy_page_range+0x3f2/0x560 [ 1704.616869] [] ? rwsem_wake+0x51/0x70 [ 1704.617942] [] dup_mmap+0x2c9/0x3d0 [ 1704.619146] [] dup_mm+0xad/0x150 [ 1704.620051] [] copy_process+0xa68/0x12e0 [ 1704.622976] [] ? __lock_release+0x1da/0x1f0 [ 1704.624234] [] do_fork+0x96/0x270 [ 1704.624975] [] ? context_tracking_user_exit+0x195/0x1d0 [ 1704.626427] [] ? trace_hardirqs_on+0xd/0x10 [ 1704.627681] [] SyS_clone+0x16/0x20 [ 1704.628833] [] stub_clone+0x69/0x90 [ 1704.629672] [] ? tracesys+0xdd/0xe2 This warning was introduced by "mm: numa: Avoid unnecessary disruption of NUMA hinting during migration" for paranoia reasons but the warning is bogus. I was thinking of parallel races between NUMA hinting faults and forks but this warning would also be triggered by a parallel reclaim splitting a THP during a fork. Remote the bogus warning. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3b6a75..468bd3a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,9 +883,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, goto out_unlock; } - /* mmap_sem prevents this happening but warn if that changes */ - WARN_ON(pmd_trans_migrating(pmd)); - if (unlikely(pmd_trans_splitting(pmd))) { /* split huge page running from under us */ spin_unlock(src_ptl); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f52.google.com (mail-qe0-f52.google.com [209.85.128.52]) by kanga.kvack.org (Postfix) with ESMTP id F1F9B6B0031 for ; Thu, 19 Dec 2013 13:36:09 -0500 (EST) Received: by mail-qe0-f52.google.com with SMTP id ne12so1366934qeb.25 for ; Thu, 19 Dec 2013 10:36:09 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id t13si3621562qef.73.2013.12.19.10.36.08 for ; Thu, 19 Dec 2013 10:36:09 -0800 (PST) Message-ID: <52B33C93.2020006@redhat.com> Date: Thu, 19 Dec 2013 13:36:03 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH] mm: Remove bogus warning in copy_huge_pmd References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> <52B0D5F9.5030208@oracle.com> <20131219120007.GJ11295@suse.de> In-Reply-To: <20131219120007.GJ11295@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: Alex Thorlton , Sasha Levin , Linux-MM , LKML On 12/19/2013 07:00 AM, Mel Gorman wrote: > Sasha Levin reported the following warning being triggered > > [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0() > [ 1704.597258] Modules linked in: > [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G W 3.13.0-rc4-next-20131217-sasha-00013-ga878504-dirty #4149 > [ 1704.599924] 0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1 > [ 1704.604846] 0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0 > [ 1704.606391] ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000 > [ 1704.608008] Call Trace: > [ 1704.608511] [] dump_stack+0x52/0x7f > [ 1704.609699] [] warn_slowpath_common+0x8c/0xc0 > [ 1704.612617] [] warn_slowpath_null+0x1a/0x20 > [ 1704.614043] [] copy_huge_pmd+0x145/0x3a0 > [ 1704.615587] [] copy_page_range+0x3f2/0x560 > [ 1704.616869] [] ? rwsem_wake+0x51/0x70 > [ 1704.617942] [] dup_mmap+0x2c9/0x3d0 > [ 1704.619146] [] dup_mm+0xad/0x150 > [ 1704.620051] [] copy_process+0xa68/0x12e0 > [ 1704.622976] [] ? __lock_release+0x1da/0x1f0 > [ 1704.624234] [] do_fork+0x96/0x270 > [ 1704.624975] [] ? context_tracking_user_exit+0x195/0x1d0 > [ 1704.626427] [] ? trace_hardirqs_on+0xd/0x10 > [ 1704.627681] [] SyS_clone+0x16/0x20 > [ 1704.628833] [] stub_clone+0x69/0x90 > [ 1704.629672] [] ? tracesys+0xdd/0xe2 > > This warning was introduced by "mm: numa: Avoid unnecessary disruption > of NUMA hinting during migration" for paranoia reasons but the warning > is bogus. I was thinking of parallel races between NUMA hinting faults > and forks but this warning would also be triggered by a parallel reclaim > splitting a THP during a fork. Remote the bogus warning. > > Signed-off-by: Mel Gorman Acked-by: Rik van Riel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754550Ab3LJPvk (ORCPT ); Tue, 10 Dec 2013 10:51:40 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44379 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751687Ab3LJPvi (ORCPT ); Tue, 10 Dec 2013 10:51:38 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Date: Tue, 10 Dec 2013 15:51:18 +0000 Message-Id: <1386690695-27380-1-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changelog since V3 o Dropped a tracing patch o Rebased to 3.13-rc3 o Removed unnecessary ptl acquisition Alex Thorlton reported segementation faults when NUMA balancing is enabled on large machines. There is no obvious explanation from the console what the problem but similar problems have been observed by Rik van Riel and myself if migration was aggressive enough. Alex, this series is against 3.13-rc2, a verification that the fix addresses your problem would be appreciated. This series starts with a range of patches aimed at addressing the segmentation fault problem while offsetting some of the cost to avoid badly regressing performance in -stable. Those that are cc'd to stable (patches 1-12) should be merged ASAP. The rest of the series is relatively minor stuff that fell out during the course of development that is ok to wait for the next merge window but should help with the continued development of NUMA balancing. arch/sparc/include/asm/pgtable_64.h | 4 +- arch/x86/include/asm/pgtable.h | 11 +++- arch/x86/mm/gup.c | 13 +++++ include/asm-generic/pgtable.h | 2 +- include/linux/migrate.h | 9 ++++ include/linux/mm_types.h | 44 +++++++++++++++ include/linux/mmzone.h | 5 +- include/trace/events/migrate.h | 26 +++++++++ include/trace/events/sched.h | 87 ++++++++++++++++++++++++++++++ kernel/fork.c | 1 + kernel/sched/core.c | 2 + kernel/sched/fair.c | 24 +++++---- mm/huge_memory.c | 45 ++++++++++++---- mm/mempolicy.c | 6 +-- mm/migrate.c | 103 ++++++++++++++++++++++++++++-------- mm/mprotect.c | 15 ++++-- mm/pgtable-generic.c | 8 ++- 17 files changed, 347 insertions(+), 58 deletions(-) -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755059Ab3LJPvz (ORCPT ); Tue, 10 Dec 2013 10:51:55 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44445 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754910Ab3LJPvq (ORCPT ); Tue, 10 Dec 2013 10:51:46 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible Date: Tue, 10 Dec 2013 15:51:30 +0000 Message-Id: <1386690695-27380-13-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org THP migration can fail for a variety of reasons. Avoid flushing the TLB to deal with THP migration races until the copy is ready to start. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman --- mm/huge_memory.c | 7 ------- mm/migrate.c | 3 +++ 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3a5ee2..e3b6a75 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1377,13 +1377,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * The page_table_lock above provides a memory barrier - * with change_protection_range. - */ - if (tlb_flush_pending(mm)) - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); - - /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. */ diff --git a/mm/migrate.c b/mm/migrate.c index cfb4190..0c4fbf6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1759,6 +1759,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, goto out_fail; } + if (tlb_flush_pending(mm)) + flush_tlb_range(vma, mmun_start, mmun_end); + /* Prepare a page as a migration target */ __set_page_locked(new_page); SetPageSwapBacked(new_page); -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755100Ab3LJPv5 (ORCPT ); Tue, 10 Dec 2013 10:51:57 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44459 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754949Ab3LJPvs (ORCPT ); Tue, 10 Dec 2013 10:51:48 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 15/18] mm: numa: Trace tasks that fail migration due to rate limiting Date: Tue, 10 Dec 2013 15:51:33 +0000 Message-Id: <1386690695-27380-16-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A low local/remote numa hinting fault ratio is potentially explained by failed migrations. This patch adds a tracepoint that fires when migration fails due to migration rate limitation. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/trace/events/migrate.h | 26 ++++++++++++++++++++++++++ mm/migrate.c | 5 ++++- 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index ec2a6cc..3075ffb 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -45,6 +45,32 @@ TRACE_EVENT(mm_migrate_pages, __print_symbolic(__entry->reason, MIGRATE_REASON)) ); +TRACE_EVENT(mm_numa_migrate_ratelimit, + + TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages), + + TP_ARGS(p, dst_nid, nr_pages), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN) + __field( pid_t, pid) + __field( int, dst_nid) + __field( unsigned long, nr_pages) + ), + + TP_fast_assign( + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->dst_nid = dst_nid; + __entry->nr_pages = nr_pages; + ), + + TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu", + __entry->comm, + __entry->pid, + __entry->dst_nid, + __entry->nr_pages) +); #endif /* _TRACE_MIGRATE_H */ /* This part must be outside protection */ diff --git a/mm/migrate.c b/mm/migrate.c index 564d5c9..8dc277d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1608,8 +1608,11 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat, msecs_to_jiffies(migrate_interval_millisecs); spin_unlock(&pgdat->numabalancing_migrate_lock); } - if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) + if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) { + trace_mm_numa_migrate_ratelimit(current, pgdat->node_id, + nr_pages); return true; + } /* * This is an unlocked non-atomic update so errors are possible. -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755017Ab3LJPvx (ORCPT ); Tue, 10 Dec 2013 10:51:53 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44432 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754535Ab3LJPvo (ORCPT ); Tue, 10 Dec 2013 10:51:44 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Date: Tue, 10 Dec 2013 15:51:28 +0000 Message-Id: <1386690695-27380-11-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org do_huge_pmd_numa_page() handles the case where there is parallel THP migration. However, by the time it is checked the NUMA hinting information has already been disrupted. This patch adds an earlier check with some helpers. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/linux/migrate.h | 9 +++++++++ mm/huge_memory.c | 22 ++++++++++++++++------ mm/migrate.c | 12 ++++++++++++ 3 files changed, 37 insertions(+), 6 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index f5096b5..b7717d7 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -90,10 +90,19 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_NUMA_BALANCING +extern bool pmd_trans_migrating(pmd_t pmd); +extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); #else +static inline bool pmd_trans_migrating(pmd_t pmd) +{ + return false; +} +static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) +{ +} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0ecaba2..e3b6a75 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -882,6 +882,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = 0; goto out_unlock; } + + /* mmap_sem prevents this happening but warn if that changes */ + WARN_ON(pmd_trans_migrating(pmd)); + if (unlikely(pmd_trans_splitting(pmd))) { /* split huge page running from under us */ spin_unlock(src_ptl); @@ -1299,6 +1303,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; + /* + * If there are potential migrations, wait for completion and retry + * without disrupting NUMA hinting information. Do not relock and + * check_same as the page may no longer be mapped. + */ + if (unlikely(pmd_trans_migrating(*pmdp))) { + spin_unlock(ptl); + wait_migrate_huge_page(vma->anon_vma, pmdp); + goto out; + } + page = pmd_page(pmd); BUG_ON(is_huge_zero_page(page)); page_nid = page_to_nid(page); @@ -1329,12 +1344,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto clear_pmdnuma; } - /* - * If there are potential migrations, wait for completion and retry. We - * do not relock and check_same as the page may no longer be mapped. - * Furtermore, even if the page is currently misplaced, there is no - * guarantee it is still misplaced after the migration completes. - */ + /* Migration could have started since the pmd_trans_migrating check */ if (!page_locked) { spin_unlock(ptl); wait_on_page_locked(page); diff --git a/mm/migrate.c b/mm/migrate.c index a987525..cfb4190 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1655,6 +1655,18 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) return 1; } +bool pmd_trans_migrating(pmd_t pmd) +{ + struct page *page = pmd_page(pmd); + return PageLocked(page); +} + +void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) +{ + struct page *page = pmd_page(*pmd); + wait_on_page_locked(page); +} + /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754990Ab3LJPvt (ORCPT ); Tue, 10 Dec 2013 10:51:49 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44390 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754029Ab3LJPvj (ORCPT ); Tue, 10 Dec 2013 10:51:39 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 02/18] mm: numa: Call MMU notifiers on THP migration Date: Tue, 10 Dec 2013 15:51:20 +0000 Message-Id: <1386690695-27380-3-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org MMU notifiers must be called on THP page migration or secondary MMUs will get very confused. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/migrate.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 2cabbd5..be787d5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -36,6 +36,7 @@ #include #include #include +#include #include @@ -1716,12 +1717,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, struct page *page, int node) { spinlock_t *ptl; - unsigned long haddr = address & HPAGE_PMD_MASK; pg_data_t *pgdat = NODE_DATA(node); int isolated = 0; struct page *new_page = NULL; struct mem_cgroup *memcg = NULL; int page_lru = page_is_file_cache(page); + unsigned long mmun_start = address & HPAGE_PMD_MASK; + unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE; pmd_t orig_entry; /* @@ -1756,10 +1758,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, WARN_ON(PageLRU(new_page)); /* Recheck the target PMD */ + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ptl = pmd_lock(mm, pmd); if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) { fail_putback: spin_unlock(ptl); + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); /* Reverse changes made by migrate_page_copy() */ if (TestClearPageActive(new_page)) @@ -1800,15 +1804,16 @@ fail_putback: * The SetPageUptodate on the new page and page_add_new_anon_rmap * guarantee the copy is visible before the pagetable update. */ - flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); - page_add_new_anon_rmap(new_page, vma, haddr); - pmdp_clear_flush(vma, haddr, pmd); - set_pmd_at(mm, haddr, pmd, entry); + flush_cache_range(vma, mmun_start, mmun_end); + page_add_new_anon_rmap(new_page, vma, mmun_start); + pmdp_clear_flush(vma, mmun_start, pmd); + set_pmd_at(mm, mmun_start, pmd, entry); + flush_tlb_range(vma, mmun_start, mmun_end); update_mmu_cache_pmd(vma, address, &entry); if (page_count(page) != 2) { - set_pmd_at(mm, haddr, pmd, orig_entry); - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + set_pmd_at(mm, mmun_start, pmd, orig_entry); + flush_tlb_range(vma, mmun_start, mmun_end); update_mmu_cache_pmd(vma, address, &entry); page_remove_rmap(new_page); goto fail_putback; @@ -1823,6 +1828,7 @@ fail_putback: */ mem_cgroup_end_migration(memcg, page, new_page, true); spin_unlock(ptl); + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); unlock_page(new_page); unlock_page(page); @@ -1843,7 +1849,7 @@ out_dropref: ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { entry = pmd_mknonnuma(entry); - set_pmd_at(mm, haddr, pmd, entry); + set_pmd_at(mm, mmun_start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } spin_unlock(ptl); -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755225Ab3LJPwk (ORCPT ); Tue, 10 Dec 2013 10:52:40 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44467 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754982Ab3LJPvt (ORCPT ); Tue, 10 Dec 2013 10:51:49 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 16/18] mm: numa: Do not automatically migrate KSM pages Date: Tue, 10 Dec 2013 15:51:34 +0000 Message-Id: <1386690695-27380-17-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org KSM pages can be shared between tasks that are not necessarily related to each other from a NUMA perspective. This patch causes those pages to be ignored by automatic NUMA balancing so they do not migrate and do not cause unrelated tasks to be grouped together. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/mprotect.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 9b1be30..c258137 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -63,7 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, ptent = *pte; page = vm_normal_page(vma, addr, oldpte); - if (page) { + if (page && !PageKsm(page)) { if (!pte_numa(oldpte)) { ptent = pte_mknuma(ptent); updated = true; -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755196Ab3LJPwi (ORCPT ); Tue, 10 Dec 2013 10:52:38 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44471 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754986Ab3LJPvu (ORCPT ); Tue, 10 Dec 2013 10:51:50 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Date: Tue, 10 Dec 2013 15:51:35 +0000 Message-Id: <1386690695-27380-18-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch adds three tracepoints o trace_sched_move_numa when a task is moved to a node o trace_sched_swap_numa when a task is swapped with another task o trace_sched_stick_numa when a numa-related migration fails The tracepoints allow the NUMA scheduler activity to be monitored and the following high-level metrics can be calculated o NUMA migrated stuck nr trace_sched_stick_numa o NUMA migrated idle nr trace_sched_move_numa o NUMA migrated swapped nr trace_sched_swap_numa o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid Maybe a small number of these are acceptable but a high number would be a major surprise. It would be even worse if bounces are frequent. o NUMA avg task migs. Average number of migrations for tasks o NUMA stddev task mig Self-explanatory o NUMA max task migs. Maximum number of migrations for a single task In general the intent of the tracepoints is to help diagnose problems where automatic NUMA balancing appears to be doing an excessive amount of useless work. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/trace/events/sched.h | 87 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 2 + kernel/sched/fair.c | 6 ++- 3 files changed, 93 insertions(+), 2 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 04c3084..67e1bbf 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -443,6 +443,93 @@ TRACE_EVENT(sched_process_hang, ); #endif /* CONFIG_DETECT_HUNG_TASK */ +DECLARE_EVENT_CLASS(sched_move_task_template, + + TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), + + TP_ARGS(tsk, src_cpu, dst_cpu), + + TP_STRUCT__entry( + __field( pid_t, pid ) + __field( pid_t, tgid ) + __field( pid_t, ngid ) + __field( int, src_cpu ) + __field( int, src_nid ) + __field( int, dst_cpu ) + __field( int, dst_nid ) + ), + + TP_fast_assign( + __entry->pid = task_pid_nr(tsk); + __entry->tgid = task_tgid_nr(tsk); + __entry->ngid = task_numa_group_id(tsk); + __entry->src_cpu = src_cpu; + __entry->src_nid = cpu_to_node(src_cpu); + __entry->dst_cpu = dst_cpu; + __entry->dst_nid = cpu_to_node(dst_cpu); + ), + + TP_printk("pid=%d tgid=%d ngid=%d src_cpu=%d src_nid=%d dst_cpu=%d dst_nid=%d", + __entry->pid, __entry->tgid, __entry->ngid, + __entry->src_cpu, __entry->src_nid, + __entry->dst_cpu, __entry->dst_nid) +); + +/* + * Tracks migration of tasks from one runqueue to another. Can be used to + * detect if automatic NUMA balancing is bouncing between nodes + */ +DEFINE_EVENT(sched_move_task_template, sched_move_numa, + TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), + + TP_ARGS(tsk, src_cpu, dst_cpu) +); + +DEFINE_EVENT(sched_move_task_template, sched_stick_numa, + TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), + + TP_ARGS(tsk, src_cpu, dst_cpu) +); + +TRACE_EVENT(sched_swap_numa, + + TP_PROTO(struct task_struct *src_tsk, int src_cpu, + struct task_struct *dst_tsk, int dst_cpu), + + TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu), + + TP_STRUCT__entry( + __field( pid_t, src_pid ) + __field( pid_t, src_tgid ) + __field( pid_t, src_ngid ) + __field( int, src_cpu ) + __field( int, src_nid ) + __field( pid_t, dst_pid ) + __field( pid_t, dst_tgid ) + __field( pid_t, dst_ngid ) + __field( int, dst_cpu ) + __field( int, dst_nid ) + ), + + TP_fast_assign( + __entry->src_pid = task_pid_nr(src_tsk); + __entry->src_tgid = task_tgid_nr(src_tsk); + __entry->src_ngid = task_numa_group_id(src_tsk); + __entry->src_cpu = src_cpu; + __entry->src_nid = cpu_to_node(src_cpu); + __entry->dst_pid = task_pid_nr(dst_tsk); + __entry->dst_tgid = task_tgid_nr(dst_tsk); + __entry->dst_ngid = task_numa_group_id(dst_tsk); + __entry->dst_cpu = dst_cpu; + __entry->dst_nid = cpu_to_node(dst_cpu); + ), + + TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d", + __entry->src_pid, __entry->src_tgid, __entry->src_ngid, + __entry->src_cpu, __entry->src_nid, + __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid, + __entry->dst_cpu, __entry->dst_nid) +); #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e85cda2..e485d2b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1108,6 +1108,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p) if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task))) goto out; + trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu); ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg); out: @@ -4090,6 +4091,7 @@ int migrate_task_to(struct task_struct *p, int target_cpu) /* TODO: This is not properly updating schedstats */ + trace_sched_move_numa(p, curr_cpu, target_cpu); return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18bf84e..26fe588 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p) p->numa_scan_period = task_scan_min(p); if (env.best_task == NULL) { - int ret = migrate_task_to(p, env.best_cpu); + if ((ret = migrate_task_to(p, env.best_cpu)) != 0) + trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); return ret; } - ret = migrate_swap(p, env.best_task); + if ((ret = migrate_swap(p, env.best_task)) != 0); + trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); put_task_struct(env.best_task); return ret; } -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755267Ab3LJPx1 (ORCPT ); Tue, 10 Dec 2013 10:53:27 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44455 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754934Ab3LJPvr (ORCPT ); Tue, 10 Dec 2013 10:51:47 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 14/18] mm: numa: Limit scope of lock for NUMA migrate rate limiting Date: Tue, 10 Dec 2013 15:51:32 +0000 Message-Id: <1386690695-27380-15-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org NUMA migrate rate limiting protects a migration counter and window using a lock but in some cases this can be a contended lock. It is not critical that the number of pages be perfect, lost updates are acceptable. Reduce the importance of this lock. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- include/linux/mmzone.h | 5 +---- mm/migrate.c | 21 ++++++++++++--------- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bd791e4..b835d3f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -758,10 +758,7 @@ typedef struct pglist_data { int kswapd_max_order; enum zone_type classzone_idx; #ifdef CONFIG_NUMA_BALANCING - /* - * Lock serializing the per destination node AutoNUMA memory - * migration rate limiting data. - */ + /* Lock serializing the migrate rate limiting window */ spinlock_t numabalancing_migrate_lock; /* Rate limiting time interval */ diff --git a/mm/migrate.c b/mm/migrate.c index b6eef65..564d5c9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1596,26 +1596,29 @@ bool migrate_ratelimited(int node) static bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) { - bool rate_limited = false; - /* * Rate-limit the amount of data that is being migrated to a node. * Optimal placement is no good if the memory bus is saturated and * all the time is being spent migrating! */ - spin_lock(&pgdat->numabalancing_migrate_lock); if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) { + spin_lock(&pgdat->numabalancing_migrate_lock); pgdat->numabalancing_migrate_nr_pages = 0; pgdat->numabalancing_migrate_next_window = jiffies + msecs_to_jiffies(migrate_interval_millisecs); + spin_unlock(&pgdat->numabalancing_migrate_lock); } if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) - rate_limited = true; - else - pgdat->numabalancing_migrate_nr_pages += nr_pages; - spin_unlock(&pgdat->numabalancing_migrate_lock); - - return rate_limited; + return true; + + /* + * This is an unlocked non-atomic update so errors are possible. + * The consequences are failing to migrate when we potentiall should + * have which is not severe enough to warrant locking. If it is ever + * a problem, it can be converted to a per-cpu counter. + */ + pgdat->numabalancing_migrate_nr_pages += nr_pages; + return false; } static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754919Ab3LJPvq (ORCPT ); Tue, 10 Dec 2013 10:51:46 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44401 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751313Ab3LJPvk (ORCPT ); Tue, 10 Dec 2013 10:51:40 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 04/18] mm: numa: Do not clear PMD during PTE update scan Date: Tue, 10 Dec 2013 15:51:22 +0000 Message-Id: <1386690695-27380-5-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If the PMD is flushed then a parallel fault in handle_mm_fault() will enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll attempt to insert a huge zero page. This is wasteful so the patch avoids clearing the PMD when setting pmd_numa. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index deae592..5a5da50 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1529,7 +1529,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, */ if (!is_huge_zero_page(page) && !pmd_numa(*pmd)) { - entry = pmdp_get_and_clear(mm, addr, pmd); + entry = *pmd; entry = pmd_mknuma(entry); ret = HPAGE_PMD_NR; } -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755293Ab3LJPxt (ORCPT ); Tue, 10 Dec 2013 10:53:49 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44449 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754923Ab3LJPvq (ORCPT ); Tue, 10 Dec 2013 10:51:46 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 13/18] mm: numa: Make NUMA-migrate related functions static Date: Tue, 10 Dec 2013 15:51:31 +0000 Message-Id: <1386690695-27380-14-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org numamigrate_update_ratelimit and numamigrate_isolate_page only have callers in mm/migrate.c. This patch makes them static. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/migrate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 0c4fbf6..b6eef65 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1593,7 +1593,8 @@ bool migrate_ratelimited(int node) } /* Returns true if the node is migrate rate-limited after the update */ -bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) +static bool numamigrate_update_ratelimit(pg_data_t *pgdat, + unsigned long nr_pages) { bool rate_limited = false; @@ -1617,7 +1618,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) return rate_limited; } -int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) +static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) { int page_lru; -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755328Ab3LJPyZ (ORCPT ); Tue, 10 Dec 2013 10:54:25 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44437 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754897Ab3LJPvp (ORCPT ); Tue, 10 Dec 2013 10:51:45 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 11/18] mm: fix TLB flush race between migration, and change_protection_range Date: Tue, 10 Dec 2013 15:51:29 +0000 Message-Id: <1386690695-27380-12-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. Cc: stable@vger.kernel.org Signed-off-by: Rik van Riel Signed-off-by: Mel Gorman --- arch/sparc/include/asm/pgtable_64.h | 4 ++-- arch/x86/include/asm/pgtable.h | 11 ++++++++-- include/asm-generic/pgtable.h | 2 +- include/linux/mm_types.h | 44 +++++++++++++++++++++++++++++++++++++ kernel/fork.c | 1 + mm/huge_memory.c | 7 ++++++ mm/mprotect.c | 2 ++ mm/pgtable-generic.c | 5 +++-- 8 files changed, 69 insertions(+), 7 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 8358dc1..0f9e945 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -619,7 +619,7 @@ static inline unsigned long pte_present(pte_t pte) } #define pte_accessible pte_accessible -static inline unsigned long pte_accessible(pte_t a) +static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a) { return pte_val(a) & _PAGE_VALID; } @@ -847,7 +847,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U * and SUN4V pte layout, so this inline test is fine. */ - if (likely(mm != &init_mm) && pte_accessible(orig)) + if (likely(mm != &init_mm) && pte_accessible(mm, orig)) tlb_batch_add(mm, addr, ptep, orig, fullmm); } diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 3d19994..48cab4c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -452,9 +452,16 @@ static inline int pte_present(pte_t a) } #define pte_accessible pte_accessible -static inline int pte_accessible(pte_t a) +static inline bool pte_accessible(struct mm_struct *mm, pte_t a) { - return pte_flags(a) & _PAGE_PRESENT; + if (pte_flags(a) & _PAGE_PRESENT) + return true; + + if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) && + tlb_flush_pending(mm)) + return true; + + return false; } static inline int pte_hidden(pte_t pte) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index f330d28..b12079a 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -217,7 +217,7 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) #endif #ifndef pte_accessible -# define pte_accessible(pte) ((void)(pte),1) +# define pte_accessible(mm, pte) ((void)(pte), 1) #endif #ifndef flush_tlb_fix_spurious_fault diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index bd29941..c122bb1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -443,6 +443,14 @@ struct mm_struct { /* numa_scan_seq prevents two threads setting pte_numa */ int numa_scan_seq; #endif +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) + /* + * An operation with batched TLB flushing is going on. Anything that + * can move process memory needs to flush the TLB when moving a + * PROT_NONE or PROT_NUMA mapped page. + */ + bool tlb_flush_pending; +#endif struct uprobes_state uprobes_state; }; @@ -459,4 +467,40 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) return mm->cpu_vm_mask_var; } +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) +/* + * Memory barriers to keep this state in sync are graciously provided by + * the page table locks, outside of which no page table modifications happen. + * The barriers below prevent the compiler from re-ordering the instructions + * around the memory barriers that are already present in the code. + */ +static inline bool tlb_flush_pending(struct mm_struct *mm) +{ + barrier(); + return mm->tlb_flush_pending; +} +static inline void set_tlb_flush_pending(struct mm_struct *mm) +{ + mm->tlb_flush_pending = true; + barrier(); +} +/* Clearing is done after a TLB flush, which also provides a barrier. */ +static inline void clear_tlb_flush_pending(struct mm_struct *mm) +{ + barrier(); + mm->tlb_flush_pending = false; +} +#else +static inline bool tlb_flush_pending(struct mm_struct *mm) +{ + return false; +} +static inline void set_tlb_flush_pending(struct mm_struct *mm) +{ +} +static inline void clear_tlb_flush_pending(struct mm_struct *mm) +{ +} +#endif + #endif /* _LINUX_MM_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 728d5be..5721f0e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -537,6 +537,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) spin_lock_init(&mm->page_table_lock); mm_init_aio(mm); mm_init_owner(mm, p); + clear_tlb_flush_pending(mm); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3b6a75..e3a5ee2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1377,6 +1377,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* + * The page_table_lock above provides a memory barrier + * with change_protection_range. + */ + if (tlb_flush_pending(mm)) + flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + + /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. */ diff --git a/mm/mprotect.c b/mm/mprotect.c index eb2f349..9b1be30 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -187,6 +187,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, BUG_ON(addr >= end); pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); + set_tlb_flush_pending(mm); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) @@ -198,6 +199,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, /* Only flush the TLB if we actually modified any entries: */ if (pages) flush_tlb_range(vma, start, end); + clear_tlb_flush_pending(mm); return pages; } diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index e84cad2..a8b9199 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -110,9 +110,10 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { + struct mm_struct *mm = (vma)->vm_mm; pte_t pte; - pte = ptep_get_and_clear((vma)->vm_mm, address, ptep); - if (pte_accessible(pte)) + pte = ptep_get_and_clear(mm, address, ptep); + if (pte_accessible(mm, pte)) flush_tlb_page(vma, address); return pte; } -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754894Ab3LJPvo (ORCPT ); Tue, 10 Dec 2013 10:51:44 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44395 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754535Ab3LJPvj (ORCPT ); Tue, 10 Dec 2013 10:51:39 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 03/18] mm: Clear pmd_numa before invalidating Date: Tue, 10 Dec 2013 15:51:21 +0000 Message-Id: <1386690695-27380-4-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org pmdp_invalidate clears the present bit without taking into account that it might be in the _PAGE_NUMA bit leaving the PMD in an unexpected state. Clear pmd_numa before invalidating. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/pgtable-generic.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index cbb3854..e84cad2 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -191,6 +191,9 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + pmd_t entry = *pmdp; + if (pmd_numa(entry)) + entry = pmd_mknonnuma(entry); set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp)); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); } -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755349Ab3LJPyv (ORCPT ); Tue, 10 Dec 2013 10:54:51 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44427 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754866Ab3LJPvo (ORCPT ); Tue, 10 Dec 2013 10:51:44 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 09/18] mm: numa: Clear numa hinting information on mprotect Date: Tue, 10 Dec 2013 15:51:27 +0000 Message-Id: <1386690695-27380-10-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On a protection change it is no longer clear if the page should be still accessible. This patch clears the NUMA hinting fault bits on a protection change. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/huge_memory.c | 2 ++ mm/mprotect.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0f00b96..0ecaba2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1522,6 +1522,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, ret = 1; if (!prot_numa) { entry = pmdp_get_and_clear(mm, addr, pmd); + if (pmd_numa(entry)) + entry = pmd_mknonnuma(entry); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; BUG_ON(pmd_write(entry)); diff --git a/mm/mprotect.c b/mm/mprotect.c index 0a07e2d..eb2f349 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -54,6 +54,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (!prot_numa) { ptent = ptep_modify_prot_start(mm, addr, pte); + if (pte_numa(ptent)) + ptent = pte_mknonnuma(ptent); ptent = pte_modify(ptent, newprot); updated = true; } else { -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755374Ab3LJPzH (ORCPT ); Tue, 10 Dec 2013 10:55:07 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44422 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754608Ab3LJPvn (ORCPT ); Tue, 10 Dec 2013 10:51:43 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 08/18] sched: numa: Skip inaccessible VMAs Date: Tue, 10 Dec 2013 15:51:26 +0000 Message-Id: <1386690695-27380-9-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Inaccessible VMA should not be trapping NUMA hint faults. Skip them. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fd773ad..18bf84e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1752,6 +1752,13 @@ void task_numa_work(struct callback_head *work) (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) continue; + /* + * Skip inaccessible VMAs to avoid any confusion between + * PROT_NONE and NUMA hinting ptes + */ + if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) + continue; + do { start = max(start, vma->vm_start); end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754800Ab3LJP4b (ORCPT ); Tue, 10 Dec 2013 10:56:31 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44417 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754583Ab3LJPvm (ORCPT ); Tue, 10 Dec 2013 10:51:42 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 07/18] mm: numa: Avoid unnecessary work on the failure path Date: Tue, 10 Dec 2013 15:51:25 +0000 Message-Id: <1386690695-27380-8-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a PMD changes during a THP migration then migration aborts but the failure path is doing more work than is necessary. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/migrate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index be787d5..a987525 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1780,7 +1780,8 @@ fail_putback: putback_lru_page(page); mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); - goto out_fail; + + goto out_unlock; } /* @@ -1854,6 +1855,7 @@ out_dropref: } spin_unlock(ptl); +out_unlock: unlock_page(page); put_page(page); return 0; -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755402Ab3LJP4k (ORCPT ); Tue, 10 Dec 2013 10:56:40 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44649 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753817Ab3LJP4g (ORCPT ); Tue, 10 Dec 2013 10:56:36 -0500 Date: Tue, 10 Dec 2013 15:56:33 +0000 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH 00/17] NUMA balancing segmentation fault fixes and misc followups v4 Message-ID: <20131210155633.GL11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 10, 2013 at 03:51:18PM +0000, Mel Gorman wrote: > Changelog since V3 > o Dropped a tracing patch > o Rebased to 3.13-rc3 > o Removed unnecessary ptl acquisition > *sigh* There really are only 17 patches in the series. 18/18 does not exist. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755488Ab3LJP46 (ORCPT ); Tue, 10 Dec 2013 10:56:58 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44405 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751687Ab3LJPvl (ORCPT ); Tue, 10 Dec 2013 10:51:41 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 05/18] mm: numa: Do not clear PTE for pte_numa update Date: Tue, 10 Dec 2013 15:51:23 +0000 Message-Id: <1386690695-27380-6-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The TLB must be flushed if the PTE is updated but change_pte_range is clearing the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it reinserts the same entry. Without the flush, it's conceivable that two processors have different TLBs for the same virtual address and at the very least it would generate spurious faults. This patch only unmaps the pages in change_pte_range for a full protection change. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/mprotect.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 2666797..0a07e2d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -52,13 +52,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, pte_t ptent; bool updated = false; - ptent = ptep_modify_prot_start(mm, addr, pte); if (!prot_numa) { + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); updated = true; } else { struct page *page; + ptent = *pte; page = vm_normal_page(vma, addr, oldpte); if (page) { if (!pte_numa(oldpte)) { @@ -79,7 +80,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (updated) pages++; - ptep_modify_prot_commit(mm, addr, pte, ptent); + + /* Only !prot_numa always clears the pte */ + if (!prot_numa) + ptep_modify_prot_commit(mm, addr, pte, ptent); } else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755465Ab3LJP44 (ORCPT ); Tue, 10 Dec 2013 10:56:56 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44411 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753064Ab3LJPvl (ORCPT ); Tue, 10 Dec 2013 10:51:41 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 06/18] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Date: Tue, 10 Dec 2013 15:51:24 +0000 Message-Id: <1386690695-27380-7-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The anon_vma lock prevents parallel THP splits and any associated complexity that arises when handling splits during THP migration. This patch checks if the lock was successfully acquired and bails from THP migration if it failed for any reason. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/huge_memory.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5a5da50..0f00b96 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1359,6 +1359,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out_unlock; } + /* Bail if we fail to protect against THP splits for any reason */ + if (unlikely(!anon_vma)) { + put_page(page); + page_nid = -1; + goto clear_pmdnuma; + } + /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755127Ab3LJP60 (ORCPT ); Tue, 10 Dec 2013 10:58:26 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44385 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753047Ab3LJPvi (ORCPT ); Tue, 10 Dec 2013 10:51:38 -0500 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 01/18] mm: numa: Serialise parallel get_user_page against THP migration Date: Tue, 10 Dec 2013 15:51:19 +0000 Message-Id: <1386690695-27380-2-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Base pages are unmapped and flushed from cache and TLB during normal page migration and replaced with a migration entry that causes any parallel or gup to block until migration completes. THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm: Close races between THP migration and PMD numa clearing") made worse by introducing a pmd_clear_flush(). This patch forces get_user_page (fast and normal) on a pmd_numa page to go through the slow get_user_page path where it will serialise against THP migration and properly account for the NUMA hinting fault. On the migration side the page table lock is taken for each PTE update. Cc: stable@vger.kernel.org Reviewed-by: Rik van Riel Signed-off-by: Mel Gorman --- arch/x86/mm/gup.c | 13 +++++++++++++ mm/huge_memory.c | 24 ++++++++++++++++-------- mm/migrate.c | 38 +++++++++++++++++++++++++++++++------- 3 files changed, 60 insertions(+), 15 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..0596e8e 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -83,6 +83,12 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, pte_t pte = gup_get_pte(ptep); struct page *page; + /* Similar to the PMD case, NUMA hinting must take slow path */ + if (pte_numa(pte)) { + pte_unmap(ptep); + return 0; + } + if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { pte_unmap(ptep); return 0; @@ -167,6 +173,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, if (pmd_none(pmd) || pmd_trans_splitting(pmd)) return 0; if (unlikely(pmd_large(pmd))) { + /* + * NUMA hinting faults need to be handled in the GUP + * slowpath for accounting purposes and so that they + * can be serialised against THP migration. + */ + if (pmd_numa(pmd)) + return 0; if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) return 0; } else { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bccd5a6..deae592 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1243,6 +1243,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) return ERR_PTR(-EFAULT); + /* Full NUMA hinting faults to serialise migration in fault paths */ + if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) + goto out; + page = pmd_page(*pmd); VM_BUG_ON(!PageHead(page)); if (flags & FOLL_TOUCH) { @@ -1323,23 +1327,27 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, /* If the page was locked, there are no parallel migrations */ if (page_locked) goto clear_pmdnuma; + } - /* - * Otherwise wait for potential migrations and retry. We do - * relock and check_same as the page may no longer be mapped. - * As the fault is being retried, do not account for it. - */ + /* + * If there are potential migrations, wait for completion and retry. We + * do not relock and check_same as the page may no longer be mapped. + * Furtermore, even if the page is currently misplaced, there is no + * guarantee it is still misplaced after the migration completes. + */ + if (!page_locked) { spin_unlock(ptl); wait_on_page_locked(page); page_nid = -1; goto out; } - /* Page is misplaced, serialise migrations and parallel THP splits */ + /* + * Page is misplaced. Page lock serialises migrations. Acquire anon_vma + * to serialises splits + */ get_page(page); spin_unlock(ptl); - if (!page_locked) - lock_page(page); anon_vma = page_lock_anon_vma_read(page); /* Confirm the PMD did not change while page_table_lock was released */ diff --git a/mm/migrate.c b/mm/migrate.c index bb94004..2cabbd5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1722,6 +1722,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, struct page *new_page = NULL; struct mem_cgroup *memcg = NULL; int page_lru = page_is_file_cache(page); + pmd_t orig_entry; /* * Rate-limit the amount of data that is being migrated to a node. @@ -1756,7 +1757,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, /* Recheck the target PMD */ ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_same(*pmd, entry))) { + if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) { +fail_putback: spin_unlock(ptl); /* Reverse changes made by migrate_page_copy() */ @@ -1786,16 +1788,34 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, */ mem_cgroup_prepare_migration(page, new_page, &memcg); + orig_entry = *pmd; entry = mk_pmd(new_page, vma->vm_page_prot); - entry = pmd_mknonnuma(entry); - entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = pmd_mkhuge(entry); + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + /* + * Clear the old entry under pagetable lock and establish the new PTE. + * Any parallel GUP will either observe the old page blocking on the + * page lock, block on the page table lock or observe the new page. + * The SetPageUptodate on the new page and page_add_new_anon_rmap + * guarantee the copy is visible before the pagetable update. + */ + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + page_add_new_anon_rmap(new_page, vma, haddr); pmdp_clear_flush(vma, haddr, pmd); set_pmd_at(mm, haddr, pmd, entry); - page_add_new_anon_rmap(new_page, vma, haddr); update_mmu_cache_pmd(vma, address, &entry); + + if (page_count(page) != 2) { + set_pmd_at(mm, haddr, pmd, orig_entry); + flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + update_mmu_cache_pmd(vma, address, &entry); + page_remove_rmap(new_page); + goto fail_putback; + } + page_remove_rmap(page); + /* * Finish the charge transaction under the page table lock to * prevent split_huge_page() from dividing up the charge @@ -1820,9 +1840,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, out_fail: count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); out_dropref: - entry = pmd_mknonnuma(entry); - set_pmd_at(mm, haddr, pmd, entry); - update_mmu_cache_pmd(vma, address, &entry); + ptl = pmd_lock(mm, pmd); + if (pmd_same(*pmd, entry)) { + entry = pmd_mknonnuma(entry); + set_pmd_at(mm, haddr, pmd, entry); + update_mmu_cache_pmd(vma, address, &entry); + } + spin_unlock(ptl); unlock_page(page); put_page(page); -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754241Ab3LJQ4T (ORCPT ); Tue, 10 Dec 2013 11:56:19 -0500 Received: from mx1.redhat.com ([209.132.183.28]:9461 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751791Ab3LJQ4Q (ORCPT ); Tue, 10 Dec 2013 11:56:16 -0500 Message-ID: <52A747A8.9030307@redhat.com> Date: Tue, 10 Dec 2013 11:56:08 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , Alex Thorlton , Linux-MM , LKML Subject: Re: [PATCH 12/18] mm: numa: Defer TLB flush for THP migration as long as possible References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-13-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-13-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/10/2013 10:51 AM, Mel Gorman wrote: > THP migration can fail for a variety of reasons. Avoid flushing the TLB > to deal with THP migration races until the copy is ready to start. > > Cc: stable@vger.kernel.org > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751873Ab3LJWWP (ORCPT ); Tue, 10 Dec 2013 17:22:15 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:40791 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751656Ab3LJWWN (ORCPT ); Tue, 10 Dec 2013 17:22:13 -0500 Date: Tue, 10 Dec 2013 14:22:11 -0800 From: Andrew Morton To: Mel Gorman Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Message-Id: <20131210142211.099fe782c361707ab3c04742@linux-foundation.org> In-Reply-To: <1386690695-27380-18-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-18-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 10 Dec 2013 15:51:35 +0000 Mel Gorman wrote: > This patch adds three tracepoints > o trace_sched_move_numa when a task is moved to a node > o trace_sched_swap_numa when a task is swapped with another task > o trace_sched_stick_numa when a numa-related migration fails > > The tracepoints allow the NUMA scheduler activity to be monitored and the > following high-level metrics can be calculated > > o NUMA migrated stuck nr trace_sched_stick_numa > o NUMA migrated idle nr trace_sched_move_numa > o NUMA migrated swapped nr trace_sched_swap_numa > o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) > o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) > o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid > Maybe a small number of these are acceptable > but a high number would be a major surprise. > It would be even worse if bounces are frequent. > o NUMA avg task migs. Average number of migrations for tasks > o NUMA stddev task mig Self-explanatory > o NUMA max task migs. Maximum number of migrations for a single task > > In general the intent of the tracepoints is to help diagnose problems > where automatic NUMA balancing appears to be doing an excessive amount of > useless work. > > ... > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p) > p->numa_scan_period = task_scan_min(p); > > if (env.best_task == NULL) { > - int ret = migrate_task_to(p, env.best_cpu); > + if ((ret = migrate_task_to(p, env.best_cpu)) != 0) > + trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); > return ret; > } > > - ret = migrate_swap(p, env.best_task); > + if ((ret = migrate_swap(p, env.best_task)) != 0); I'll zap that semicolon... > + trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); > put_task_struct(env.best_task); > return ret; > } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751490Ab3LKIht (ORCPT ); Wed, 11 Dec 2013 03:37:49 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39048 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751178Ab3LKIhs (ORCPT ); Wed, 11 Dec 2013 03:37:48 -0500 Date: Wed, 11 Dec 2013 08:37:45 +0000 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH 17/18] sched: Add tracepoints related to NUMA task migration Message-ID: <20131211083744.GP11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-18-git-send-email-mgorman@suse.de> <20131210142211.099fe782c361707ab3c04742@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131210142211.099fe782c361707ab3c04742@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 10, 2013 at 02:22:11PM -0800, Andrew Morton wrote: > On Tue, 10 Dec 2013 15:51:35 +0000 Mel Gorman wrote: > > > This patch adds three tracepoints > > o trace_sched_move_numa when a task is moved to a node > > o trace_sched_swap_numa when a task is swapped with another task > > o trace_sched_stick_numa when a numa-related migration fails > > > > The tracepoints allow the NUMA scheduler activity to be monitored and the > > following high-level metrics can be calculated > > > > o NUMA migrated stuck nr trace_sched_stick_numa > > o NUMA migrated idle nr trace_sched_move_numa > > o NUMA migrated swapped nr trace_sched_swap_numa > > o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) > > o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) > > o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid > > Maybe a small number of these are acceptable > > but a high number would be a major surprise. > > It would be even worse if bounces are frequent. > > o NUMA avg task migs. Average number of migrations for tasks > > o NUMA stddev task mig Self-explanatory > > o NUMA max task migs. Maximum number of migrations for a single task > > > > In general the intent of the tracepoints is to help diagnose problems > > where automatic NUMA balancing appears to be doing an excessive amount of > > useless work. > > > > ... > > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1272,11 +1272,13 @@ static int task_numa_migrate(struct task_struct *p) > > p->numa_scan_period = task_scan_min(p); > > > > if (env.best_task == NULL) { > > - int ret = migrate_task_to(p, env.best_cpu); > > + if ((ret = migrate_task_to(p, env.best_cpu)) != 0) > > + trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); > > return ret; > > } > > > > - ret = migrate_swap(p, env.best_task); > > + if ((ret = migrate_swap(p, env.best_task)) != 0); > > I'll zap that semicolon... > Thanks -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751468Ab3LKNVO (ORCPT ); Wed, 11 Dec 2013 08:21:14 -0500 Received: from cantor2.suse.de ([195.135.220.15]:49300 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751345Ab3LKNVN (ORCPT ); Wed, 11 Dec 2013 08:21:13 -0500 Date: Wed, 11 Dec 2013 13:21:09 +0000 From: Mel Gorman To: Andrew Morton Cc: "Paul E. McKenney" , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211132109.GB24125@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1386690695-27380-1-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org According to documentation on barriers, stores issued before a LOCK can complete after the lock implying that it's possible tlb_flush_pending can be visible after a page table update. As per revised documentation, this patch adds a smp_mb__before_spinlock to guarantee the correct ordering. Cc: stable@vger.kernel.org Signed-off-by: Mel Gorman --- include/linux/mm_types.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c122bb1..a12f2ab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm) static inline void set_tlb_flush_pending(struct mm_struct *mm) { mm->tlb_flush_pending = true; - barrier(); + + /* + * Guarantee that the tlb_flush_pending store does not leak into the + * critical section updating the page tables + */ + smp_mb__before_spinlock(); } /* Clearing is done after a TLB flush, which also provides a barrier. */ static inline void clear_tlb_flush_pending(struct mm_struct *mm) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751887Ab3LKOoy (ORCPT ); Wed, 11 Dec 2013 09:44:54 -0500 Received: from e32.co.us.ibm.com ([32.97.110.150]:39415 "EHLO e32.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750707Ab3LKOow (ORCPT ); Wed, 11 Dec 2013 09:44:52 -0500 Date: Wed, 11 Dec 2013 06:44:47 -0800 From: "Paul E. McKenney" To: Mel Gorman Cc: Andrew Morton , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211144446.GP4208@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131211132109.GB24125@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13121114-0928-0000-0000-0000048D628F Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote: > According to documentation on barriers, stores issued before a LOCK can > complete after the lock implying that it's possible tlb_flush_pending can > be visible after a page table update. As per revised documentation, this patch > adds a smp_mb__before_spinlock to guarantee the correct ordering. > > Cc: stable@vger.kernel.org > Signed-off-by: Mel Gorman Assuming that there is a lock acquisition after calls to set_tlb_flush_pending(): Acked-by: Paul E. McKenney (I don't see set_tlb_flush_pending() in mainline.) > --- > include/linux/mm_types.h | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index c122bb1..a12f2ab 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -482,7 +482,12 @@ static inline bool tlb_flush_pending(struct mm_struct *mm) > static inline void set_tlb_flush_pending(struct mm_struct *mm) > { > mm->tlb_flush_pending = true; > - barrier(); > + > + /* > + * Guarantee that the tlb_flush_pending store does not leak into the > + * critical section updating the page tables > + */ > + smp_mb__before_spinlock(); > } > /* Clearing is done after a TLB flush, which also provides a barrier. */ > static inline void clear_tlb_flush_pending(struct mm_struct *mm) > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751519Ab3LKPWF (ORCPT ); Wed, 11 Dec 2013 10:22:05 -0500 Received: from mx1.redhat.com ([209.132.183.28]:63548 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750945Ab3LKPWD (ORCPT ); Wed, 11 Dec 2013 10:22:03 -0500 Date: Wed, 11 Dec 2013 10:21:26 -0500 From: Rik van Riel To: Mel Gorman Cc: Andrew Morton , "Paul E. McKenney" , Peter Zijlstra , Alex Thorlton , Linux-MM , LKML Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211102126.532c763d@annuminas.surriel.com> In-Reply-To: <20131211132109.GB24125@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> Organization: Red Hat, Inc. Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 11 Dec 2013 13:21:09 +0000 Mel Gorman wrote: > According to documentation on barriers, stores issued before a LOCK can > complete after the lock implying that it's possible tlb_flush_pending can > be visible after a page table update. As per revised documentation, this patch > adds a smp_mb__before_spinlock to guarantee the correct ordering. And now you have 18 patches :) > Cc: stable@vger.kernel.org > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751861Ab3LKQk4 (ORCPT ); Wed, 11 Dec 2013 11:40:56 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59740 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750968Ab3LKQkz (ORCPT ); Wed, 11 Dec 2013 11:40:55 -0500 Date: Wed, 11 Dec 2013 16:40:52 +0000 From: Mel Gorman To: "Paul E. McKenney" Cc: Andrew Morton , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211164052.GB11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> <20131211144446.GP4208@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131211144446.GP4208@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 11, 2013 at 06:44:47AM -0800, Paul E. McKenney wrote: > On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote: > > According to documentation on barriers, stores issued before a LOCK can > > complete after the lock implying that it's possible tlb_flush_pending can > > be visible after a page table update. As per revised documentation, this patch > > adds a smp_mb__before_spinlock to guarantee the correct ordering. > > > > Cc: stable@vger.kernel.org > > Signed-off-by: Mel Gorman > > Assuming that there is a lock acquisition after calls to > set_tlb_flush_pending(): > > Acked-by: Paul E. McKenney > > (I don't see set_tlb_flush_pending() in mainline.) > It's introduced by a patch flight that is currently sitting in Andrew's tree. In the case where we care about the value of tlb_flush_pending, a spinlock will be taken. PMD or PTE split spinlocks or the mm->page_table_lock depending on whether it is 3.13 or 3.12-stable and earlier kernels. I pushed the relevant patches to this tree and branch git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git numab-instrument-serialise-v5r1 There is no guarantee the lock will be taken if there are no pages populated in the region but we also do not care about flushing the TLB in that case either. Does it matter that there is no guarantee a lock will be taken after smp_mb__before_spinlock, just very likely that it will be? -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751625Ab3LKQ41 (ORCPT ); Wed, 11 Dec 2013 11:56:27 -0500 Received: from e37.co.us.ibm.com ([32.97.110.158]:51741 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751202Ab3LKQ4Y (ORCPT ); Wed, 11 Dec 2013 11:56:24 -0500 Date: Wed, 11 Dec 2013 08:56:20 -0800 From: "Paul E. McKenney" To: Mel Gorman Cc: Andrew Morton , Peter Zijlstra , Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH] mm: numa: Guarantee that tlb_flush_pending updates are visible before page table updates Message-ID: <20131211165620.GU4208@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <20131211132109.GB24125@suse.de> <20131211144446.GP4208@linux.vnet.ibm.com> <20131211164052.GB11295@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131211164052.GB11295@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13121116-7164-0000-0000-0000042309AC Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 11, 2013 at 04:40:52PM +0000, Mel Gorman wrote: > On Wed, Dec 11, 2013 at 06:44:47AM -0800, Paul E. McKenney wrote: > > On Wed, Dec 11, 2013 at 01:21:09PM +0000, Mel Gorman wrote: > > > According to documentation on barriers, stores issued before a LOCK can > > > complete after the lock implying that it's possible tlb_flush_pending can > > > be visible after a page table update. As per revised documentation, this patch > > > adds a smp_mb__before_spinlock to guarantee the correct ordering. > > > > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Mel Gorman > > > > Assuming that there is a lock acquisition after calls to > > set_tlb_flush_pending(): > > > > Acked-by: Paul E. McKenney > > > > (I don't see set_tlb_flush_pending() in mainline.) > > > > It's introduced by a patch flight that is currently sitting in Andrew's > tree. In the case where we care about the value of tlb_flush_pending, a > spinlock will be taken. PMD or PTE split spinlocks or the mm->page_table_lock > depending on whether it is 3.13 or 3.12-stable and earlier kernels. I > pushed the relevant patches to this tree and branch > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git numab-instrument-serialise-v5r1 > > There is no guarantee the lock will be taken if there are no pages populated > in the region but we also do not care about flushing the TLB in that case > either. Does it matter that there is no guarantee a lock will be taken > after smp_mb__before_spinlock, just very likely that it will be? If you do smp_mb__before_spinlock() without a lock acquisition, no harm will be done, other than possibly a bit of performance loss. So you should be OK. Thanx, Paul From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752055Ab3LKTN3 (ORCPT ); Wed, 11 Dec 2013 14:13:29 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35388 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751461Ab3LKTMy (ORCPT ); Wed, 11 Dec 2013 14:12:54 -0500 Date: Wed, 11 Dec 2013 19:12:50 +0000 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: [PATCH] mm: fix TLB flush race between migration, and change_protection_range -fix Message-ID: <20131211191250.GD11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-12-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1386690695-27380-12-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following build error was reported by the 0-day build checker. >> arch/arm/mm/context.c:51:18: error: 'tlb_flush_pending' redeclared as different kind of symbol include/linux/mm_types.h:477:91: note: previous definition of 'tlb_flush_pending' was here This patch renames tlb_flush_pending to mm_tlb_flush_pending. This is a fix for the -mm patch mm-fix-tlb-flush-race-between-migration-and-change_protection_range.patch Note that when slotted into place that it will cause a conflict with mm-numa-defer-tlb-flush-for-thp-migration-as-long-as-possible.patch . The resolution is to delete the call from huge_memory.c and make sure the tlb_flush_pending call in mm/migrate.c is renamed appropriately. Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 2 +- include/linux/mm_types.h | 4 ++-- mm/huge_memory.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 48cab4c..bbc8b12 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -458,7 +458,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a) return true; if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) && - tlb_flush_pending(mm)) + mm_tlb_flush_pending(mm)) return true; return false; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c122bb1..e5c49c3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -474,7 +474,7 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) * The barriers below prevent the compiler from re-ordering the instructions * around the memory barriers that are already present in the code. */ -static inline bool tlb_flush_pending(struct mm_struct *mm) +static inline bool mm_tlb_flush_pending(struct mm_struct *mm) { barrier(); return mm->tlb_flush_pending; @@ -491,7 +491,7 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm) mm->tlb_flush_pending = false; } #else -static inline bool tlb_flush_pending(struct mm_struct *mm) +static inline bool mm_tlb_flush_pending(struct mm_struct *mm) { return false; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3a5ee2..317a8ff 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1380,7 +1380,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * The page_table_lock above provides a memory barrier * with change_protection_range. */ - if (tlb_flush_pending(mm)) + if (mm_tlb_flush_pending(mm)) flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); /* From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751747Ab3LPXPY (ORCPT ); Mon, 16 Dec 2013 18:15:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:30182 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750803Ab3LPXPX (ORCPT ); Mon, 16 Dec 2013 18:15:23 -0500 Date: Mon, 16 Dec 2013 18:15:13 -0500 From: Rik van Riel To: Mel Gorman Cc: Andrew Morton , Alex Thorlton , Linux-MM , LKML , chegu_vinod@hp.com Subject: [PATCH 19/18] mm,numa: write pte_numa pte back to the page tables Message-ID: <20131216181513.14eda80d@annuminas.surriel.com> In-Reply-To: <1386690695-27380-6-git-send-email-mgorman@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-6-git-send-email-mgorman@suse.de> Organization: Red Hat, Inc. Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 10 Dec 2013 15:51:23 +0000 Mel Gorman wrote: > The TLB must be flushed if the PTE is updated but change_pte_range is clearing > the PTE while marking PTEs pte_numa without necessarily flushing the TLB if it > reinserts the same entry. Without the flush, it's conceivable that two processors > have different TLBs for the same virtual address and at the very least it would > generate spurious faults. This patch only unmaps the pages in change_pte_range for > a full protection change. Turns out the patch optimized out not one, but both pte writes. Oops. We'll need this one too, Andrew :) ---8<--- Subject: mm,numa: write pte_numa pte back to the page tables The patch "mm: numa: Do not clear PTE for pte_numa update" cleverly optimizes out an extraneous PTE write when changing the protection of pages to pte_numa. It also optimizes out actually writing the new pte_numa entry back to the page tables. Oops. Signed-off-by: Rik van Riel Reported-by: Chegu Vinod --- mm/mprotect.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/mprotect.c b/mm/mprotect.c index edc4e22..4114acf 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -67,6 +67,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (page && !PageKsm(page)) { if (!pte_numa(oldpte)) { ptent = pte_mknuma(ptent); + set_pte_at(mm, addr, pte, ptent); updated = true; } } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753235Ab3LQWyB (ORCPT ); Tue, 17 Dec 2013 17:54:01 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:43293 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753161Ab3LQWx6 (ORCPT ); Tue, 17 Dec 2013 17:53:58 -0500 Message-ID: <52B0D5F9.5030208@oracle.com> Date: Tue, 17 Dec 2013 17:53:45 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> In-Reply-To: <1386690695-27380-11-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet21.oracle.com [156.151.31.93] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On 12/10/2013 10:51 AM, Mel Gorman wrote: > + > + /* mmap_sem prevents this happening but warn if that changes */ > + WARN_ON(pmd_trans_migrating(pmd)); > + I seem to be hitting this warning with latest -next kernel: [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0() [ 1704.597258] Modules linked in: [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G W 3.13.0-rc4- next-20131217-sasha-00013-ga878504-dirty #4149 [ 1704.599924] 0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1 [ 1704.604846] 0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0 [ 1704.606391] ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000 [ 1704.608008] Call Trace: [ 1704.608511] [] dump_stack+0x52/0x7f [ 1704.609699] [] warn_slowpath_common+0x8c/0xc0 [ 1704.612617] [] warn_slowpath_null+0x1a/0x20 [ 1704.614043] [] copy_huge_pmd+0x145/0x3a0 [ 1704.615587] [] copy_page_range+0x3f2/0x560 [ 1704.616869] [] ? rwsem_wake+0x51/0x70 [ 1704.617942] [] dup_mmap+0x2c9/0x3d0 [ 1704.619146] [] dup_mm+0xad/0x150 [ 1704.620051] [] copy_process+0xa68/0x12e0 [ 1704.622976] [] ? __lock_release+0x1da/0x1f0 [ 1704.624234] [] do_fork+0x96/0x270 [ 1704.624975] [] ? context_tracking_user_exit+0x195/0x1d0 [ 1704.626427] [] ? trace_hardirqs_on+0xd/0x10 [ 1704.627681] [] SyS_clone+0x16/0x20 [ 1704.628833] [] stub_clone+0x69/0x90 [ 1704.629672] [] ? tracesys+0xdd/0xe2 Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756393Ab3LSMEB (ORCPT ); Thu, 19 Dec 2013 07:04:01 -0500 Received: from cantor2.suse.de ([195.135.220.15]:50497 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753448Ab3LSMAK (ORCPT ); Thu, 19 Dec 2013 07:00:10 -0500 Date: Thu, 19 Dec 2013 12:00:07 +0000 From: Mel Gorman To: Andrew Morton Cc: Alex Thorlton , Rik van Riel , Sasha Levin , Linux-MM , LKML Subject: [PATCH] mm: Remove bogus warning in copy_huge_pmd Message-ID: <20131219120007.GJ11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> <52B0D5F9.5030208@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <52B0D5F9.5030208@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sasha Levin reported the following warning being triggered [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0() [ 1704.597258] Modules linked in: [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G W 3.13.0-rc4-next-20131217-sasha-00013-ga878504-dirty #4149 [ 1704.599924] 0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1 [ 1704.604846] 0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0 [ 1704.606391] ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000 [ 1704.608008] Call Trace: [ 1704.608511] [] dump_stack+0x52/0x7f [ 1704.609699] [] warn_slowpath_common+0x8c/0xc0 [ 1704.612617] [] warn_slowpath_null+0x1a/0x20 [ 1704.614043] [] copy_huge_pmd+0x145/0x3a0 [ 1704.615587] [] copy_page_range+0x3f2/0x560 [ 1704.616869] [] ? rwsem_wake+0x51/0x70 [ 1704.617942] [] dup_mmap+0x2c9/0x3d0 [ 1704.619146] [] dup_mm+0xad/0x150 [ 1704.620051] [] copy_process+0xa68/0x12e0 [ 1704.622976] [] ? __lock_release+0x1da/0x1f0 [ 1704.624234] [] do_fork+0x96/0x270 [ 1704.624975] [] ? context_tracking_user_exit+0x195/0x1d0 [ 1704.626427] [] ? trace_hardirqs_on+0xd/0x10 [ 1704.627681] [] SyS_clone+0x16/0x20 [ 1704.628833] [] stub_clone+0x69/0x90 [ 1704.629672] [] ? tracesys+0xdd/0xe2 This warning was introduced by "mm: numa: Avoid unnecessary disruption of NUMA hinting during migration" for paranoia reasons but the warning is bogus. I was thinking of parallel races between NUMA hinting faults and forks but this warning would also be triggered by a parallel reclaim splitting a THP during a fork. Remote the bogus warning. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3b6a75..468bd3a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,9 +883,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, goto out_unlock; } - /* mmap_sem prevents this happening but warn if that changes */ - WARN_ON(pmd_trans_migrating(pmd)); - if (unlikely(pmd_trans_splitting(pmd))) { /* split huge page running from under us */ spin_unlock(src_ptl); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932172Ab3LSMQR (ORCPT ); Thu, 19 Dec 2013 07:16:17 -0500 Received: from cantor2.suse.de ([195.135.220.15]:50455 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755784Ab3LSL7I (ORCPT ); Thu, 19 Dec 2013 06:59:08 -0500 Date: Thu, 19 Dec 2013 11:59:05 +0000 From: Mel Gorman To: Sasha Levin Cc: Andrew Morton , Alex Thorlton , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH 10/18] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Message-ID: <20131219115905.GI11295@suse.de> References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> <52B0D5F9.5030208@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <52B0D5F9.5030208@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 17, 2013 at 05:53:45PM -0500, Sasha Levin wrote: > Hi Mel, > > On 12/10/2013 10:51 AM, Mel Gorman wrote: > >+ > >+ /* mmap_sem prevents this happening but warn if that changes */ > >+ WARN_ON(pmd_trans_migrating(pmd)); > >+ > > I seem to be hitting this warning with latest -next kernel: > Patch will follow shortly. I appreciate these trinity bug reports but in the future is there any chance you could include the trinity command line and the config file you used? Details on the machine would also be nice. In this case, knowing if the machine was NUMA or not would have been helpful. Thanks! -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755556Ab3LSSgN (ORCPT ); Thu, 19 Dec 2013 13:36:13 -0500 Received: from mx1.redhat.com ([209.132.183.28]:20158 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753698Ab3LSSgM (ORCPT ); Thu, 19 Dec 2013 13:36:12 -0500 Message-ID: <52B33C93.2020006@redhat.com> Date: Thu, 19 Dec 2013 13:36:03 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Alex Thorlton , Sasha Levin , Linux-MM , LKML Subject: Re: [PATCH] mm: Remove bogus warning in copy_huge_pmd References: <1386690695-27380-1-git-send-email-mgorman@suse.de> <1386690695-27380-11-git-send-email-mgorman@suse.de> <52B0D5F9.5030208@oracle.com> <20131219120007.GJ11295@suse.de> In-Reply-To: <20131219120007.GJ11295@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/19/2013 07:00 AM, Mel Gorman wrote: > Sasha Levin reported the following warning being triggered > > [ 1704.594807] WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0() > [ 1704.597258] Modules linked in: > [ 1704.597844] CPU: 28 PID: 35287 Comm: trinity-main Tainted: G W 3.13.0-rc4-next-20131217-sasha-00013-ga878504-dirty #4149 > [ 1704.599924] 0000000000000377e delta! pid slot 27 [36258]: old:2 now:537927697 diff: 537927695 ffff8803593ddb90 ffffffff8439501c ffffffff854722c1 > [ 1704.604846] 0000000000000000 ffff8803593ddbd0 ffffffff8112f8ac ffff8803593ddbe0 > [ 1704.606391] ffff88034bc137f0 ffff880e41677000 8000000b47c009e4 ffff88034a638000 > [ 1704.608008] Call Trace: > [ 1704.608511] [] dump_stack+0x52/0x7f > [ 1704.609699] [] warn_slowpath_common+0x8c/0xc0 > [ 1704.612617] [] warn_slowpath_null+0x1a/0x20 > [ 1704.614043] [] copy_huge_pmd+0x145/0x3a0 > [ 1704.615587] [] copy_page_range+0x3f2/0x560 > [ 1704.616869] [] ? rwsem_wake+0x51/0x70 > [ 1704.617942] [] dup_mmap+0x2c9/0x3d0 > [ 1704.619146] [] dup_mm+0xad/0x150 > [ 1704.620051] [] copy_process+0xa68/0x12e0 > [ 1704.622976] [] ? __lock_release+0x1da/0x1f0 > [ 1704.624234] [] do_fork+0x96/0x270 > [ 1704.624975] [] ? context_tracking_user_exit+0x195/0x1d0 > [ 1704.626427] [] ? trace_hardirqs_on+0xd/0x10 > [ 1704.627681] [] SyS_clone+0x16/0x20 > [ 1704.628833] [] stub_clone+0x69/0x90 > [ 1704.629672] [] ? tracesys+0xdd/0xe2 > > This warning was introduced by "mm: numa: Avoid unnecessary disruption > of NUMA hinting during migration" for paranoia reasons but the warning > is bogus. I was thinking of parallel races between NUMA hinting faults > and forks but this warning would also be triggered by a parallel reclaim > splitting a THP during a fork. Remote the bogus warning. > > Signed-off-by: Mel Gorman Acked-by: Rik van Riel