From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f54.google.com (mail-la0-f54.google.com [209.85.215.54]) by kanga.kvack.org (Postfix) with ESMTP id 2998F6B0032 for ; Thu, 11 Jun 2015 10:14:00 -0400 (EDT) Received: by laew7 with SMTP id w7so5171238lae.1 for ; Thu, 11 Jun 2015 07:13:59 -0700 (PDT) Received: from mail-wi0-x22d.google.com (mail-wi0-x22d.google.com. [2a00:1450:400c:c05::22d]) by mx.google.com with ESMTPS id g7si1437519wjy.213.2015.06.11.07.13.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Jun 2015 07:13:58 -0700 (PDT) Received: by wigg3 with SMTP id g3so76521086wig.1 for ; Thu, 11 Jun 2015 07:13:57 -0700 (PDT) Date: Thu, 11 Jun 2015 16:13:53 +0200 From: Ingo Molnar Subject: Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Message-ID: <20150611141353.GA9447@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long [ I fat-fingered the linux-mm Cc:, so every reply will bounce on that, sorry about that :-/ Fixed it in this mail's Cc: list. ] * Ingo Molnar wrote: > Waiman Long reported 'pgd_lock' contention on high CPU count systems and > proposed moving pgd_lock on a separate cacheline to eliminate false sharing and > to reduce some of the lock bouncing overhead. So 'pgd_lock' is a global lock, used for every new task creation: arch/x86/mm/fault.c:DEFINE_SPINLOCK(pgd_lock); which with a sufficiently high CPU count starts to hurt. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by kanga.kvack.org (Postfix) with ESMTP id B399C6B006E for ; Sat, 13 Jun 2015 05:49:40 -0400 (EDT) Received: by wibut5 with SMTP id ut5so35641945wib.1 for ; Sat, 13 Jun 2015 02:49:40 -0700 (PDT) Received: from mail-wg0-x232.google.com (mail-wg0-x232.google.com. [2a00:1450:400c:c00::232]) by mx.google.com with ESMTPS id qo2si11594985wjc.150.2015.06.13.02.49.38 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 13 Jun 2015 02:49:39 -0700 (PDT) Received: by wgbhy7 with SMTP id hy7so4474275wgb.2 for ; Sat, 13 Jun 2015 02:49:38 -0700 (PDT) From: Ingo Molnar Subject: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Date: Sat, 13 Jun 2015 11:49:07 +0200 Message-Id: <1434188955-31397-5-git-send-email-mingo@kernel.org> In-Reply-To: <1434188955-31397-1-git-send-email-mingo@kernel.org> References: <1434188955-31397-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Now that the memory hotplug code does not remove PGD entries anymore, the only users of sync_global_pgds() use it after extending the PGD. So remove the 'removed' parameter and simplify the call sites. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable_64.h | 3 +-- arch/x86/mm/fault.c | 2 +- arch/x86/mm/init_64.c | 27 ++++++++------------------- 3 files changed, 10 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 2ee781114d34..f405fc3bb719 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd) native_set_pgd(pgd, native_make_pgd(0)); } -extern void sync_global_pgds(unsigned long start, unsigned long end, - int removed); +extern void sync_global_pgds(unsigned long start, unsigned long end); /* * Conversion functions: convert a page and protection to a page entry, diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 181c53bac3a7..50342825f221 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address) void vmalloc_sync_all(void) { - sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0); + sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END); } /* diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 7a988dbad240..dcb2f45caf0e 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str) __setup("noexec32=", nonx32_setup); /* - * When memory was added/removed make sure all the process MMs have + * When memory was added make sure all the process MMs have * matching PGD entries in the local PGD level page as well. */ -void sync_global_pgds(unsigned long start, unsigned long end, int removed) +void sync_global_pgds(unsigned long start, unsigned long end) { unsigned long address; @@ -171,14 +171,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) const pgd_t *pgd_ref = pgd_offset_k(address); struct task_struct *g, *p; - /* - * When this function is called after memory hot remove, - * pgd_none() already returns true, but only the reference - * kernel PGD has been cleared, not the process PGDs. - * - * So clear the affected entries in every process PGD as well: - */ - if (pgd_none(*pgd_ref) && !removed) + /* Only sync (potentially) newly added PGD entries: */ + if (pgd_none(*pgd_ref)) continue; spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */ @@ -204,13 +198,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) if (!pgd_none(*pgd_ref) && !pgd_none(*pgd)) BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); - if (removed) { - if (pgd_none(*pgd_ref) && !pgd_none(*pgd)) - pgd_clear(pgd); - } else { - if (pgd_none(*pgd)) - set_pgd(pgd, *pgd_ref); - } + if (pgd_none(*pgd)) + set_pgd(pgd, *pgd_ref); spin_unlock(pgt_lock); task_unlock(p); @@ -644,7 +633,7 @@ kernel_physical_mapping_init(unsigned long start, } if (pgd_changed) - sync_global_pgds(addr, end - 1, 0); + sync_global_pgds(addr, end - 1); __flush_tlb_all(); @@ -1284,7 +1273,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) else err = vmemmap_populate_basepages(start, end, node); if (!err) - sync_global_pgds(start, end - 1, 0); + sync_global_pgds(start, end - 1); return err; } -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932216AbbFKOHh (ORCPT ); Thu, 11 Jun 2015 10:07:37 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:36793 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753315AbbFKOHd (ORCPT ); Thu, 11 Jun 2015 10:07:33 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Date: Thu, 11 Jun 2015 16:07:05 +0200 Message-Id: <1434031637-9091-1-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce some of the lock bouncing overhead. I think we can do much better: this series eliminates the pgd_list and makes pgd_alloc()/pgd_free() lockless. Now the lockless initialization of the PGD has a few preconditions, which the initial part of the series implements: - no PGD clearing is allowed, only additions. This makes sense as a single PGD entry covers 512 GB of RAM so the 4K overhead per 0.5TB of RAM mapped is miniscule. The patches after that convert existing pgd_list users to walk the task list. PGD locking is kept intact: coherency guarantees between the CPA, vmalloc, hotplug, etc. code are unchanged. The final patches eliminate the pgd_list and thus make pgd_alloc()/pgd_free() lockless. The patches have been boot tested on 64-bit and 32-bit x86 systems. Architectures not making use of the new facility are unaffected. Thanks, Ingo ===== Ingo Molnar (12): x86/mm/pat: Don't free PGD entries on memory unmap x86/mm/hotplug: Remove pgd_list use from the memory hotplug code x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() x86/mm/hotplug: Simplify sync_global_pgds() mm: Introduce arch_pgd_init_late() x86/mm: Enable and use the arch_pgd_init_late() method x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code x86/mm: Remove pgd_list use from vmalloc_sync_all() x86/mm/pat/32: Remove pgd_list use from the PAT code x86/mm: Make pgd_alloc()/pgd_free() lockless x86/mm: Remove pgd_list leftovers x86/mm: Simplify pgd_alloc() arch/Kconfig | 9 ++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 3 -- arch/x86/include/asm/pgtable_64.h | 3 +- arch/x86/mm/fault.c | 24 +++++++---- arch/x86/mm/init_64.c | 73 +++++++++++---------------------- arch/x86/mm/pageattr.c | 34 ++++++++-------- arch/x86/mm/pgtable.c | 129 +++++++++++++++++++++++++++++----------------------------- arch/x86/xen/mmu.c | 34 +++++++++++++--- fs/exec.c | 3 ++ include/linux/mm.h | 6 +++ kernel/fork.c | 16 ++++++++ 12 files changed, 183 insertions(+), 152 deletions(-) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754886AbbFKOLN (ORCPT ); Thu, 11 Jun 2015 10:11:13 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:37547 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754388AbbFKOHf (ORCPT ); Thu, 11 Jun 2015 10:07:35 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Date: Thu, 11 Jun 2015 16:07:06 +0200 Message-Id: <1434031637-9091-2-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org So when we unmap kernel memory range then we also check whether a PUD is completely clear, and free it and clear the PGD entry. This complicates PGD management, so don't do this. We can keep the PGD mapped and the PUD table all clear - it's only a single 4K page per 512 GB of memory mapped. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 15 --------------- 1 file changed, 15 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 70d221fe2eb4..997fc97e9072 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -717,18 +717,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd) return true; } -static bool try_to_free_pud_page(pud_t *pud) -{ - int i; - - for (i = 0; i < PTRS_PER_PUD; i++) - if (!pud_none(pud[i])) - return false; - - free_page((unsigned long)pud); - return true; -} - static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end) { pte_t *pte = pte_offset_kernel(pmd, start); @@ -847,9 +835,6 @@ static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end) pgd_t *pgd_entry = root + pgd_index(addr); unmap_pud_range(pgd_entry, addr, end); - - if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry))) - pgd_clear(pgd_entry); } static int alloc_pte_page(pmd_t *pmd) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932922AbbFKOHq (ORCPT ); Thu, 11 Jun 2015 10:07:46 -0400 Received: from mail-wg0-f51.google.com ([74.125.82.51]:33653 "EHLO mail-wg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932088AbbFKOHh (ORCPT ); Thu, 11 Jun 2015 10:07:37 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Date: Thu, 11 Jun 2015 16:07:07 +0200 Message-Id: <1434031637-9091-3-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The memory hotplug code uses sync_global_pgds() to synchronize updates to the global (&init_mm) kernel PGD and the task PGDs. It does this by iterating over the pgd_list - which list closely tracks task creation/destruction via fork/clone. But we want to remove this list, so that it does not have to be maintained from fork()/exit(), so convert the memory hotplug code to use the task list to iterate over all pgds in the system. Also improve the comments a bit, to make this function easier to understand. Only lightly tested, as I don't have a memory hotplug setup. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Signed-off-by: Ingo Molnar --- arch/x86/mm/init_64.c | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3fba623e3ba5..1921acbd49fd 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str) __setup("noexec32=", nonx32_setup); /* - * When memory was added/removed make sure all the processes MM have - * suitable PGD entries in the local PGD level page. + * When memory was added/removed make sure all the process MMs have + * matching PGD entries in the local PGD level page as well. */ void sync_global_pgds(unsigned long start, unsigned long end, int removed) { @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) for (address = start; address <= end; address += PGDIR_SIZE) { const pgd_t *pgd_ref = pgd_offset_k(address); - struct page *page; + struct task_struct *g, *p; /* - * When it is called after memory hot remove, pgd_none() - * returns true. In this case (removed == 1), we must clear - * the PGD entries in the local PGD level page. + * When this function is called after memory hot remove, + * pgd_none() already returns true, but only the reference + * kernel PGD has been cleared, not the process PGDs. + * + * So clear the affected entries in every process PGD as well: */ if (pgd_none(*pgd_ref) && !removed) continue; spin_lock(&pgd_lock); - list_for_each_entry(page, &pgd_list, lru) { - pgd_t *pgd; + + for_each_process_thread(g, p) { + pgd_t *pgd = p->mm->pgd; spinlock_t *pgt_lock; - pgd = (pgd_t *)page_address(page) + pgd_index(address); - /* the pgt_lock only for Xen */ - pgt_lock = &pgd_page_get_mm(page)->page_table_lock; + if (!p->mm) + continue; + + /* The pgt_lock is only used by Xen: */ + pgt_lock = &p->mm->page_table_lock; spin_lock(pgt_lock); if (!pgd_none(*pgd_ref) && !pgd_none(*pgd)) - BUG_ON(pgd_page_vaddr(*pgd) - != pgd_page_vaddr(*pgd_ref)); + BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); if (removed) { if (pgd_none(*pgd_ref) && !pgd_none(*pgd)) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932556AbbFKOHn (ORCPT ); Thu, 11 Jun 2015 10:07:43 -0400 Received: from mail-wg0-f46.google.com ([74.125.82.46]:35802 "EHLO mail-wg0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753315AbbFKOHj (ORCPT ); Thu, 11 Jun 2015 10:07:39 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Date: Thu, 11 Jun 2015 16:07:08 +0200 Message-Id: <1434031637-9091-4-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org So when memory hotplug removes a piece of physical memory from pagetable mappings, it also frees the underlying PGD entry. This complicates PGD management, so don't do this. We can keep the PGD mapped and the PUD table all clear - it's only a single 4K page per 512 GB of memory hotplugged. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/init_64.c | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 1921acbd49fd..2d5931d343cd 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -770,27 +770,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) spin_unlock(&init_mm.page_table_lock); } -/* Return true if pgd is changed, otherwise return false. */ -static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd) -{ - pud_t *pud; - int i; - - for (i = 0; i < PTRS_PER_PUD; i++) { - pud = pud_start + i; - if (pud_val(*pud)) - return false; - } - - /* free a pud table */ - free_pagetable(pgd_page(*pgd), 0); - spin_lock(&init_mm.page_table_lock); - pgd_clear(pgd); - spin_unlock(&init_mm.page_table_lock); - - return true; -} - static void __meminit remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end, bool direct) @@ -982,7 +961,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct) unsigned long addr; pgd_t *pgd; pud_t *pud; - bool pgd_changed = false; for (addr = start; addr < end; addr = next) { next = pgd_addr_end(addr, end); @@ -993,13 +971,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct) pud = (pud_t *)pgd_page_vaddr(*pgd); remove_pud_table(pud, addr, next, direct); - if (free_pud_table(pud, pgd)) - pgd_changed = true; } - if (pgd_changed) - sync_global_pgds(start, end - 1, 1); - flush_tlb_all(); } -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754879AbbFKOKv (ORCPT ); Thu, 11 Jun 2015 10:10:51 -0400 Received: from mail-wi0-f171.google.com ([209.85.212.171]:35560 "EHLO mail-wi0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754687AbbFKOHl (ORCPT ); Thu, 11 Jun 2015 10:07:41 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Date: Thu, 11 Jun 2015 16:07:09 +0200 Message-Id: <1434031637-9091-5-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that the memory hotplug code does not remove PGD entries anymore, the only users of sync_global_pgds() use it after extending the PGD. So remove the 'removed' parameter and simplify the call sites. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable_64.h | 3 +-- arch/x86/mm/fault.c | 2 +- arch/x86/mm/init_64.c | 19 +++++++------------ 3 files changed, 9 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 2ee781114d34..f405fc3bb719 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd) native_set_pgd(pgd, native_make_pgd(0)); } -extern void sync_global_pgds(unsigned long start, unsigned long end, - int removed); +extern void sync_global_pgds(unsigned long start, unsigned long end); /* * Conversion functions: convert a page and protection to a page entry, diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 181c53bac3a7..50342825f221 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address) void vmalloc_sync_all(void) { - sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0); + sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END); } /* diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 2d5931d343cd..beee532b76a7 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str) __setup("noexec32=", nonx32_setup); /* - * When memory was added/removed make sure all the process MMs have + * When memory was added make sure all the process MMs have * matching PGD entries in the local PGD level page as well. */ -void sync_global_pgds(unsigned long start, unsigned long end, int removed) +void sync_global_pgds(unsigned long start, unsigned long end) { unsigned long address; @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) * * So clear the affected entries in every process PGD as well: */ - if (pgd_none(*pgd_ref) && !removed) + if (pgd_none(*pgd_ref)) continue; spin_lock(&pgd_lock); @@ -197,13 +197,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) if (!pgd_none(*pgd_ref) && !pgd_none(*pgd)) BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); - if (removed) { - if (pgd_none(*pgd_ref) && !pgd_none(*pgd)) - pgd_clear(pgd); - } else { - if (pgd_none(*pgd)) - set_pgd(pgd, *pgd_ref); - } + if (pgd_none(*pgd)) + set_pgd(pgd, *pgd_ref); spin_unlock(pgt_lock); } @@ -636,7 +631,7 @@ kernel_physical_mapping_init(unsigned long start, } if (pgd_changed) - sync_global_pgds(addr, end - 1, 0); + sync_global_pgds(addr, end - 1); __flush_tlb_all(); @@ -1276,7 +1271,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) else err = vmemmap_populate_basepages(start, end, node); if (!err) - sync_global_pgds(start, end - 1, 0); + sync_global_pgds(start, end - 1); return err; } -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964840AbbFKOKM (ORCPT ); Thu, 11 Jun 2015 10:10:12 -0400 Received: from mail-wi0-f174.google.com ([209.85.212.174]:35918 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932297AbbFKOHm (ORCPT ); Thu, 11 Jun 2015 10:07:42 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 05/12] mm: Introduce arch_pgd_init_late() Date: Thu, 11 Jun 2015 16:07:10 +0200 Message-Id: <1434031637-9091-6-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add a late PGD init callback to places that allocate a new MM with a new PGD: copy_process() and exec(). The purpose of this callback is to allow architectures to implement lockless initialization of task PGDs, to remove the scalability limit of pgd_list/pgd_lock. Architectures can opt in to this callback via the ARCH_HAS_PGD_INIT_LATE Kconfig flag. There's zero overhead on architectures that are not using it. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-arch@vger.kernel.org Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/Kconfig | 9 +++++++++ fs/exec.c | 3 +++ include/linux/mm.h | 6 ++++++ kernel/fork.c | 16 ++++++++++++++++ 4 files changed, 34 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index a65eafb24997..a8e866cd4247 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -491,6 +491,15 @@ config PGTABLE_LEVELS int default 2 +config ARCH_HAS_PGD_INIT_LATE + bool + help + Architectures that want a late PGD initialization can define + the arch_pgd_init_late() callback and it will be called + by the generic new task (fork()) code after a new task has + been made visible on the task list, but before it has been + first scheduled. + config ARCH_HAS_ELF_RANDOMIZE bool help diff --git a/fs/exec.c b/fs/exec.c index 1977c2a553ac..c1d213c64fda 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -860,7 +860,10 @@ static int exec_mmap(struct mm_struct *mm) } task_lock(tsk); active_mm = tsk->active_mm; + tsk->mm = mm; + arch_pgd_init_late(mm, mm->pgd); + tsk->active_mm = mm; activate_mm(active_mm, mm); tsk->mm->vmacache_seqnum = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 0755b9fd03a7..35d887d2b038 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1134,6 +1134,12 @@ int follow_phys(struct vm_area_struct *vma, unsigned long address, int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); +#ifdef CONFIG_ARCH_HAS_PGD_INIT_LATE +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd); +#else +static inline void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) { } +#endif + static inline void unmap_shared_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen) { diff --git a/kernel/fork.c b/kernel/fork.c index 03c1eaaa6ef5..1f83ceca6c6c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags, syscall_tracepoint_update(p); write_unlock_irq(&tasklist_lock); + /* + * If we have a new PGD then initialize it: + * + * This method is called after a task has been made visible + * on the task list already. + * + * Architectures that manage per task kernel pagetables + * might use this callback to initialize them after they + * are already visible to new updates. + * + * NOTE: any user-space parts of the PGD are already initialized + * and must not be clobbered. + */ + if (p->mm != current->mm) + arch_pgd_init_late(p->mm, p->mm->pgd); + proc_fork_connector(p); cgroup_post_fork(p); if (clone_flags & CLONE_THREAD) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754836AbbFKOJr (ORCPT ); Thu, 11 Jun 2015 10:09:47 -0400 Received: from mail-wg0-f49.google.com ([74.125.82.49]:33739 "EHLO mail-wg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932344AbbFKOHp (ORCPT ); Thu, 11 Jun 2015 10:07:45 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Date: Thu, 11 Jun 2015 16:07:11 +0200 Message-Id: <1434031637-9091-7-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Prepare for lockless PGD init: enable the arch_pgd_init_late() callback and add a 'careful' implementation of PGD init to it: only copy over non-zero entries. Since PGD entries only ever get added, this method catches any updates to swapper_pg_dir[] that might have occured between early PGD init and late PGD init. Note that this only matters for code that does not use the pgd_list but the task list to find all PGDs in the system. Subsequent patches will convert pgd_list users to task-list iterations. [ This adds extra overhead in that we do the PGD initialization for a second time - a later patch will simplify this, once we don't have old pgd_list users. ] Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- arch/x86/Kconfig | 1 + arch/x86/mm/pgtable.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7e39f9b22705..15c19ce149f0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -27,6 +27,7 @@ config X86 select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_FAST_MULTIPLIER select ARCH_HAS_GCOV_PROFILE_ALL + select ARCH_HAS_PGD_INIT_LATE select ARCH_HAS_SG_CHAIN select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index fb0a9dd1d6e4..e0bf90470d70 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm) return NULL; } +/* + * Initialize the kernel portion of the PGD. + * + * This is done separately, because pgd_alloc() happens when + * the task is not on the task list yet - and PGD updates + * happen by walking the task list. + * + * No locking is needed here, as we just copy over the reference + * PGD. The reference PGD (pgtable_init) is only ever expanded + * at the highest, PGD level. Thus any other task extending it + * will first update the reference PGD, then modify the task PGDs. + */ +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) +{ + /* + * This is called after a new MM has been made visible + * in fork() or exec(). + * + * This barrier makes sure the MM is visible to new RCU + * walkers before we initialize it, so that we don't miss + * updates: + */ + smp_wmb(); + + /* + * If the pgd points to a shared pagetable level (either the + * ptes in non-PAE, or shared PMD in PAE), then just copy the + * references from swapper_pg_dir: + */ + if (CONFIG_PGTABLE_LEVELS == 2 || + (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || + CONFIG_PGTABLE_LEVELS == 4) { + + pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY; + pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY; + int i; + + for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) { + /* + * This is lock-less, so it can race with PGD updates + * coming from vmalloc() or CPA methods, but it's safe, + * because: + * + * 1) this PGD is not in use yet, we have still not + * scheduled this task. + * 2) we only ever extend PGD entries + * + * So if we observe a non-zero PGD entry we can copy it, + * it won't change from under us. Parallel updates (new + * allocations) will modify our (already visible) PGD: + */ + if (pgd_val(*pgd_src)) + WRITE_ONCE(*pgd_dst, *pgd_src); + } + } +} + void pgd_free(struct mm_struct *mm, pgd_t *pgd) { pgd_mop_up_pmds(mm, pgd); -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933435AbbFKOHz (ORCPT ); Thu, 11 Jun 2015 10:07:55 -0400 Received: from mail-wg0-f45.google.com ([74.125.82.45]:36422 "EHLO mail-wg0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932742AbbFKOHq (ORCPT ); Thu, 11 Jun 2015 10:07:46 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Date: Thu, 11 Jun 2015 16:07:12 +0200 Message-Id: <1434031637-9091-8-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org xen_mm_pin_all()/unpin_all() are used to implement full guest instance suspend/restore. It's a stop-all method that needs to iterate through all allocated pgds in the system to fix them up for Xen's use. This code uses pgd_list, probably because it was an easy interface. But we want to remove the pgd_list, so convert the code over to walk all tasks in the system. This is an equivalent method. (As I don't use Xen this is was only build tested.) Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Signed-off-by: Ingo Molnar --- arch/x86/xen/mmu.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index dd151b2045b0..87a8354435f8 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -853,18 +853,29 @@ static void xen_pgd_pin(struct mm_struct *mm) */ void xen_mm_pin_all(void) { - struct page *page; + struct task_struct *g, *p; + rcu_read_lock(); spin_lock(&pgd_lock); - list_for_each_entry(page, &pgd_list, lru) { + for_each_process_thread(g, p) { + struct page *page; + pgd_t *pgd; + + if (!p->mm) + continue; + + pgd = p->mm->pgd; + page = virt_to_page(pgd); + if (!PagePinned(page)) { - __xen_pgd_pin(&init_mm, (pgd_t *)page_address(page)); + __xen_pgd_pin(&init_mm, pgd); SetPageSavePinned(page); } } spin_unlock(&pgd_lock); + rcu_read_unlock(); } /* @@ -967,19 +978,30 @@ static void xen_pgd_unpin(struct mm_struct *mm) */ void xen_mm_unpin_all(void) { - struct page *page; + struct task_struct *g, *p; + rcu_read_lock(); spin_lock(&pgd_lock); - list_for_each_entry(page, &pgd_list, lru) { + for_each_process_thread(g, p) { + struct page *page; + pgd_t *pgd; + + if (!p->mm) + continue; + + pgd = p->mm->pgd; + page = virt_to_page(pgd); + if (PageSavePinned(page)) { BUG_ON(!PagePinned(page)); - __xen_pgd_unpin(&init_mm, (pgd_t *)page_address(page)); + __xen_pgd_unpin(&init_mm, pgd); ClearPageSavePinned(page); } } spin_unlock(&pgd_lock); + rcu_read_unlock(); } static void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754738AbbFKOH6 (ORCPT ); Thu, 11 Jun 2015 10:07:58 -0400 Received: from mail-wi0-f174.google.com ([209.85.212.174]:36006 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932088AbbFKOHs (ORCPT ); Thu, 11 Jun 2015 10:07:48 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Date: Thu, 11 Jun 2015 16:07:13 +0200 Message-Id: <1434031637-9091-9-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The vmalloc() code uses vmalloc_sync_all() to synchronize changes to the global reference kernel PGD to task PGDs. This uses pgd_list currently, but we don't need the global list, as we can walk the task list under RCU. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/fault.c | 21 ++++++++++++++------- arch/x86/mm/init_64.c | 3 ++- 2 files changed, 16 insertions(+), 8 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 50342825f221..2d587e548b59 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -235,24 +235,31 @@ void vmalloc_sync_all(void) for (address = VMALLOC_START & PMD_MASK; address >= TASK_SIZE && address < FIXADDR_TOP; address += PMD_SIZE) { - struct page *page; + struct task_struct *g, *p; + + rcu_read_lock(); spin_lock(&pgd_lock); - list_for_each_entry(page, &pgd_list, lru) { + + for_each_process_thread(g, p) { spinlock_t *pgt_lock; - pmd_t *ret; + pmd_t *pmd_ret; - /* the pgt_lock only for Xen */ - pgt_lock = &pgd_page_get_mm(page)->page_table_lock; + if (!p->mm) + continue; + /* The pgt_lock is only used on Xen: */ + pgt_lock = &p->mm->page_table_lock; spin_lock(pgt_lock); - ret = vmalloc_sync_one(page_address(page), address); + pmd_ret = vmalloc_sync_one(p->mm->pgd, address); spin_unlock(pgt_lock); - if (!ret) + if (!pmd_ret) break; } + spin_unlock(&pgd_lock); + rcu_read_unlock(); } } diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index beee532b76a7..730560c4873e 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -184,11 +184,12 @@ void sync_global_pgds(unsigned long start, unsigned long end) spin_lock(&pgd_lock); for_each_process_thread(g, p) { - pgd_t *pgd = p->mm->pgd; + pgd_t *pgd; spinlock_t *pgt_lock; if (!p->mm) continue; + pgd = p->mm->pgd; /* The pgt_lock is only used by Xen: */ pgt_lock = &p->mm->page_table_lock; -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754817AbbFKOJW (ORCPT ); Thu, 11 Jun 2015 10:09:22 -0400 Received: from mail-wg0-f53.google.com ([74.125.82.53]:32776 "EHLO mail-wg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932974AbbFKOHt (ORCPT ); Thu, 11 Jun 2015 10:07:49 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code Date: Thu, 11 Jun 2015 16:07:14 +0200 Message-Id: <1434031637-9091-10-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The 32-bit x86 PAT code uses __set_pmd_pte() to update pmds. This uses pgd_list currently, but we don't need the global list as we can walk the task list under RCU. (This code already holds the pgd_lock.) Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 997fc97e9072..93c134fdb398 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -438,18 +438,31 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) set_pte_atomic(kpte, pte); #ifdef CONFIG_X86_32 if (!SHARED_KERNEL_PMD) { - struct page *page; + struct task_struct *g, *p; - list_for_each_entry(page, &pgd_list, lru) { + rcu_read_lock(); + + for_each_process_thread(g, p) { + spinlock_t *pgt_lock; pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pgd = (pgd_t *)page_address(page) + pgd_index(address); + if (!p->mm) + continue; + + pgt_lock = &p->mm->page_table_lock; + spin_lock(pgt_lock); + + pgd = p->mm->pgd + pgd_index(address); pud = pud_offset(pgd, address); pmd = pmd_offset(pud, address); set_pte_atomic((pte_t *)pmd, pte); + + spin_unlock(pgt_lock); } + + rcu_read_unlock(); } #endif } -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754799AbbFKOJF (ORCPT ); Thu, 11 Jun 2015 10:09:05 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:37819 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752875AbbFKOHv (ORCPT ); Thu, 11 Jun 2015 10:07:51 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless Date: Thu, 11 Jun 2015 16:07:15 +0200 Message-Id: <1434031637-9091-11-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The fork()/exit() code uses pgd_alloc()/pgd_free() to allocate/deallocate the PGD, with platform specific code setting up kernel pagetables. The x86 code uses a global pgd_list with an associated lock to update all PGDs of all tasks in the system synchronously. The lock is still kept to synchronize updates to all PGDs in the system, but all users of the list have been migrated to use the task list. So we can remove the pgd_list addition/removal from this code. The new PGD is private while constructed, so it needs no extra locking. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/pgtable.c | 27 +++------------------------ 1 file changed, 3 insertions(+), 24 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index e0bf90470d70..d855010a111f 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -125,22 +125,6 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd) swapper_pg_dir + KERNEL_PGD_BOUNDARY, KERNEL_PGD_PTRS); } - - /* list required to sync kernel mapping updates */ - if (!SHARED_KERNEL_PMD) { - pgd_set_mm(pgd, mm); - pgd_list_add(pgd); - } -} - -static void pgd_dtor(pgd_t *pgd) -{ - if (SHARED_KERNEL_PMD) - return; - - spin_lock(&pgd_lock); - pgd_list_del(pgd); - spin_unlock(&pgd_lock); } /* @@ -370,17 +354,13 @@ pgd_t *pgd_alloc(struct mm_struct *mm) goto out_free_pmds; /* - * Make sure that pre-populating the pmds is atomic with - * respect to anything walking the pgd_list, so that they - * never see a partially populated pgd. + * No locking is needed here, as the PGD is still private, + * so no code walking the task list and looking at mm->pgd + * will be able to see it before it's fully constructed: */ - spin_lock(&pgd_lock); - pgd_ctor(mm, pgd); pgd_prepopulate_pmd(mm, pgd, pmds); - spin_unlock(&pgd_lock); - return pgd; out_free_pmds: @@ -451,7 +431,6 @@ void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) void pgd_free(struct mm_struct *mm, pgd_t *pgd) { pgd_mop_up_pmds(mm, pgd); - pgd_dtor(pgd); paravirt_pgd_free(mm, pgd); _pgd_free(pgd); } -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754784AbbFKOIt (ORCPT ); Thu, 11 Jun 2015 10:08:49 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:36090 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933255AbbFKOHx (ORCPT ); Thu, 11 Jun 2015 10:07:53 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 11/12] x86/mm: Remove pgd_list leftovers Date: Thu, 11 Jun 2015 16:07:16 +0200 Message-Id: <1434031637-9091-12-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Nothing uses the pgd_list anymore - remove the list itself and its helpers. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable.h | 3 --- arch/x86/mm/fault.c | 1 - arch/x86/mm/pgtable.c | 26 -------------------------- 3 files changed, 30 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index fe57e7a98839..7e93438d8b53 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -29,9 +29,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) extern spinlock_t pgd_lock; -extern struct list_head pgd_list; - -extern struct mm_struct *pgd_page_get_mm(struct page *page); #ifdef CONFIG_PARAVIRT #include diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2d587e548b59..bebdd97f888b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -186,7 +186,6 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address, } DEFINE_SPINLOCK(pgd_lock); -LIST_HEAD(pgd_list); #ifdef CONFIG_X86_32 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index d855010a111f..0666f7796806 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -84,35 +84,9 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud) #endif /* CONFIG_PGTABLE_LEVELS > 3 */ #endif /* CONFIG_PGTABLE_LEVELS > 2 */ -static inline void pgd_list_add(pgd_t *pgd) -{ - struct page *page = virt_to_page(pgd); - - list_add(&page->lru, &pgd_list); -} - -static inline void pgd_list_del(pgd_t *pgd) -{ - struct page *page = virt_to_page(pgd); - - list_del(&page->lru); -} - #define UNSHARED_PTRS_PER_PGD \ (SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD) - -static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm) -{ - BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm)); - virt_to_page(pgd)->index = (pgoff_t)mm; -} - -struct mm_struct *pgd_page_get_mm(struct page *page) -{ - return (struct mm_struct *)page->index; -} - static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd) { /* If the pgd points to a shared pagetable level (either the -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754766AbbFKOIA (ORCPT ); Thu, 11 Jun 2015 10:08:00 -0400 Received: from mail-wg0-f52.google.com ([74.125.82.52]:34754 "EHLO mail-wg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933394AbbFKOHz (ORCPT ); Thu, 11 Jun 2015 10:07:55 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 12/12] x86/mm: Simplify pgd_alloc() Date: Thu, 11 Jun 2015 16:07:17 +0200 Message-Id: <1434031637-9091-13-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Right now pgd_alloc() uses pgd_ctor(), which copies over the current swapper_pg_dir[] to a new task's PGD. This is not necessary, it's enough if we clear it: the PGD will then be properly updated by arch_pgd_init_late(). Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/pgtable.c | 27 +++++++++------------------ 1 file changed, 9 insertions(+), 18 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 0666f7796806..b1bd35f452ef 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -87,20 +87,6 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud) #define UNSHARED_PTRS_PER_PGD \ (SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD) -static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd) -{ - /* If the pgd points to a shared pagetable level (either the - ptes in non-PAE, or shared PMD in PAE), then just copy the - references from swapper_pg_dir. */ - if (CONFIG_PGTABLE_LEVELS == 2 || - (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || - CONFIG_PGTABLE_LEVELS == 4) { - clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY, - swapper_pg_dir + KERNEL_PGD_BOUNDARY, - KERNEL_PGD_PTRS); - } -} - /* * List of all pgd's needed for non-PAE so it can invalidate entries * in both cached and uncached pgd's; not needed for PAE since the @@ -328,11 +314,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm) goto out_free_pmds; /* - * No locking is needed here, as the PGD is still private, - * so no code walking the task list and looking at mm->pgd - * will be able to see it before it's fully constructed: + * Zero out the kernel portion here, we'll set them up in + * arch_pgd_init_late(), when the pgd is globally + * visible already per the task list, so that it cannot + * miss updates. + * + * We need to zero it here, to make sure arch_pgd_init_late() + * can initialize them without locking. */ - pgd_ctor(mm, pgd); + memset(pgd + KERNEL_PGD_BOUNDARY, 0, KERNEL_PGD_PTRS*sizeof(pgd_t)); + pgd_prepopulate_pmd(mm, pgd, pmds); return pgd; -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754920AbbFKOOA (ORCPT ); Thu, 11 Jun 2015 10:14:00 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:38632 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752432AbbFKON6 (ORCPT ); Thu, 11 Jun 2015 10:13:58 -0400 Date: Thu, 11 Jun 2015 16:13:53 +0200 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() Message-ID: <20150611141353.GA9447@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [ I fat-fingered the linux-mm Cc:, so every reply will bounce on that, sorry about that :-/ Fixed it in this mail's Cc: list. ] * Ingo Molnar wrote: > Waiman Long reported 'pgd_lock' contention on high CPU count systems and > proposed moving pgd_lock on a separate cacheline to eliminate false sharing and > to reduce some of the lock bouncing overhead. So 'pgd_lock' is a global lock, used for every new task creation: arch/x86/mm/fault.c:DEFINE_SPINLOCK(pgd_lock); which with a sufficiently high CPU count starts to hurt. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755057AbbFKPEs (ORCPT ); Thu, 11 Jun 2015 11:04:48 -0400 Received: from mail-lb0-f179.google.com ([209.85.217.179]:35557 "EHLO mail-lb0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752697AbbFKPEp (ORCPT ); Thu, 11 Jun 2015 11:04:45 -0400 MIME-Version: 1.0 In-Reply-To: <1434031637-9091-9-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-9-git-send-email-mingo@kernel.org> From: Andy Lutomirski Date: Thu, 11 Jun 2015 08:04:23 -0700 Message-ID: Subject: Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() To: Ingo Molnar Cc: "linux-kernel@vger.kernel.org" , linux-mml@vger.kernel.org, Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar wrote: > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to > the global reference kernel PGD to task PGDs. Does it? AFAICS the only caller is register_die_notifier, and it's not really clear to me why that exists. At some point I'd love to remove lazy kernel PGD sync from the kernel entirely (or at least from x86) and just do it when we switch mms. Now that you're removing all code that deletes kernel PGD entries, I think all we'd need to do is to add a per-PGD or per-mm count of the number of kernel entries populated and to fix it up when we switch to an mm with fewer entries populated than init_mm. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752997AbbFKPuI (ORCPT ); Thu, 11 Jun 2015 11:50:08 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:38135 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751067AbbFKPuE (ORCPT ); Thu, 11 Jun 2015 11:50:04 -0400 Date: Thu, 11 Jun 2015 17:49:58 +0200 From: Ingo Molnar To: Andy Lutomirski Cc: "linux-kernel@vger.kernel.org" , linux-mml@vger.kernel.org, Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Message-ID: <20150611154958.GA16799@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-9-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Andy Lutomirski wrote: > On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar wrote: > > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to > > the global reference kernel PGD to task PGDs. > > Does it? AFAICS the only caller is register_die_notifier, and it's > not really clear to me why that exists. Doh, indeed, got confused in that changelog - we are filling it in opportunistically via vmalloc_fault(). > At some point I'd love to remove lazy kernel PGD sync from the kernel entirely > (or at least from x86) and just do it when we switch mms. Now that you're > removing all code that deletes kernel PGD entries, I think all we'd need to do > is to add a per-PGD or per-mm count of the number of kernel entries populated > and to fix it up when we switch to an mm with fewer entries populated than > init_mm. That would add a (cheap but nonzero) runtime check to every context switch. It's a relative slow path, but in comparison vmalloc() is an even slower slowpath, so why not do it there and just do synchronous updates and remove the vmalloc faults altogether? Also, on 64-bit it should not matter much: there the only change is the once in a blue moon case where we allocate a new pgd for a 512 GB block of address space that a single pgd entry covers. I'd hate to add a check to every context switch, no matter how cheap, just for a case that essentially never triggers... So how about this solution instead: - we add a generation counter to sync_global_pgds() so that it can detect when the number of pgds populated in init_mm changes. - we change vmalloc() to call sync_global_pgds(): this will be very cheap in the overwhelming majority of cases. - we eliminate vmalloc_fault(), on 64-bit at least. Yay! :-) Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754555AbbFKQKo (ORCPT ); Thu, 11 Jun 2015 12:10:44 -0400 Received: from mail-la0-f53.google.com ([209.85.215.53]:34194 "EHLO mail-la0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754445AbbFKQKl (ORCPT ); Thu, 11 Jun 2015 12:10:41 -0400 MIME-Version: 1.0 In-Reply-To: <20150611154958.GA16799@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-9-git-send-email-mingo@kernel.org> <20150611154958.GA16799@gmail.com> From: Andy Lutomirski Date: Thu, 11 Jun 2015 09:10:18 -0700 Message-ID: Subject: Re: [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() To: Ingo Molnar Cc: "linux-kernel@vger.kernel.org" , linux-mml@vger.kernel.org, Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 11, 2015 at 8:49 AM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar wrote: >> > The vmalloc() code uses vmalloc_sync_all() to synchronize changes to >> > the global reference kernel PGD to task PGDs. >> >> Does it? AFAICS the only caller is register_die_notifier, and it's >> not really clear to me why that exists. > > Doh, indeed, got confused in that changelog - we are filling it in > opportunistically via vmalloc_fault(). > >> At some point I'd love to remove lazy kernel PGD sync from the kernel entirely >> (or at least from x86) and just do it when we switch mms. Now that you're >> removing all code that deletes kernel PGD entries, I think all we'd need to do >> is to add a per-PGD or per-mm count of the number of kernel entries populated >> and to fix it up when we switch to an mm with fewer entries populated than >> init_mm. > > That would add a (cheap but nonzero) runtime check to every context switch. It's a > relative slow path, but in comparison vmalloc() is an even slower slowpath, so why > not do it there and just do synchronous updates and remove the vmalloc faults > altogether? > > Also, on 64-bit it should not matter much: there the only change is the once in a > blue moon case where we allocate a new pgd for a 512 GB block of address space > that a single pgd entry covers. > > I'd hate to add a check to every context switch, no matter how cheap, just for a > case that essentially never triggers... > > So how about this solution instead: > > - we add a generation counter to sync_global_pgds() so that it can detect when > the number of pgds populated in init_mm changes. > > - we change vmalloc() to call sync_global_pgds(): this will be very cheap in the > overwhelming majority of cases. > > - we eliminate vmalloc_fault(), on 64-bit at least. Yay! :-) Should be good. The only real down side is that the update becomes a walk over all tasks or over all mms, whereas if we checked on context switches, then the update is just a single PGD entry copy. That being said, the updates should be *really* rare. I think it's completely impossible for it to happen more than 255 times per boot. So I like your solution. Then I can stop wondering what happens when we manage to take a vmalloc fault early in entry_SYSCALL_64 or somewhere in the page_fault prologue. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753096AbbFKQr5 (ORCPT ); Thu, 11 Jun 2015 12:47:57 -0400 Received: from terminus.zytor.com ([198.137.202.10]:53352 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750958AbbFKQrz (ORCPT ); Thu, 11 Jun 2015 12:47:55 -0400 Message-ID: <5579BB76.1020508@zytor.com> Date: Thu, 11 Jun 2015 09:46:46 -0700 From: "H. Peter Anvin" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , linux-kernel@vger.kernel.org CC: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: Re: [RFC PATCH 00/12] x86/mm: Implement lockless pgd_alloc()/pgd_free() References: <1434031637-9091-1-git-send-email-mingo@kernel.org> In-Reply-To: <1434031637-9091-1-git-send-email-mingo@kernel.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11/2015 07:07 AM, Ingo Molnar wrote: > Waiman Long reported 'pgd_lock' contention on high CPU count systems and proposed > moving pgd_lock on a separate cacheline to eliminate false sharing and to reduce > some of the lock bouncing overhead. > > I think we can do much better: this series eliminates the pgd_list and makes > pgd_alloc()/pgd_free() lockless. This ties into a slightly different issue, which is how to deal *properly* with "special" PGDs like the 1:1 trampoline and the UEFI page tables. These, too, should be able to incorporate, possibly more than once, page tables further down. -hpa From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754930AbbFKSXx (ORCPT ); Thu, 11 Jun 2015 14:23:53 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:51927 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753426AbbFKSXv (ORCPT ); Thu, 11 Jun 2015 14:23:51 -0400 Date: Thu, 11 Jun 2015 11:23:49 -0700 From: Andrew Morton To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late() Message-Id: <20150611112349.e63bf0d2116c583e51c7e881@linux-foundation.org> In-Reply-To: <1434031637-9091-6-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-6-git-send-email-mingo@kernel.org> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 11 Jun 2015 16:07:10 +0200 Ingo Molnar wrote: > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags, > syscall_tracepoint_update(p); > write_unlock_irq(&tasklist_lock); > > + /* > + * If we have a new PGD then initialize it: > + * > + * This method is called after a task has been made visible > + * on the task list already. > + * > + * Architectures that manage per task kernel pagetables > + * might use this callback to initialize them after they > + * are already visible to new updates. > + * > + * NOTE: any user-space parts of the PGD are already initialized > + * and must not be clobbered. > + */ > + if (p->mm != current->mm) > + arch_pgd_init_late(p->mm, p->mm->pgd); Couldn't this be arch_pgd_init_late(p->mm) or arch_pgd_init_late(p)? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755050AbbFKUFZ (ORCPT ); Thu, 11 Jun 2015 16:05:25 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:25616 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753129AbbFKUFW (ORCPT ); Thu, 11 Jun 2015 16:05:22 -0400 Message-ID: <5579E9B4.7080601@oracle.com> Date: Thu, 11 Jun 2015 16:04:04 -0400 From: Boris Ostrovsky User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ingo Molnar , linux-kernel@vger.kernel.org CC: linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long , Konrad Rzeszutek Wilk , David Vrabel Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> In-Reply-To: <1434031637-9091-7-git-send-email-mingo@kernel.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: aserv0022.oracle.com [141.146.126.234] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11/2015 10:07 AM, Ingo Molnar wrote: > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > index fb0a9dd1d6e4..e0bf90470d70 100644 > --- a/arch/x86/mm/pgtable.c > +++ b/arch/x86/mm/pgtable.c > @@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > return NULL; > } > > +/* > + * Initialize the kernel portion of the PGD. > + * > + * This is done separately, because pgd_alloc() happens when > + * the task is not on the task list yet - and PGD updates > + * happen by walking the task list. > + * > + * No locking is needed here, as we just copy over the reference > + * PGD. The reference PGD (pgtable_init) is only ever expanded > + * at the highest, PGD level. Thus any other task extending it > + * will first update the reference PGD, then modify the task PGDs. > + */ > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) > +{ > + /* > + * This is called after a new MM has been made visible > + * in fork() or exec(). > + * > + * This barrier makes sure the MM is visible to new RCU > + * walkers before we initialize it, so that we don't miss > + * updates: > + */ > + smp_wmb(); > + > + /* > + * If the pgd points to a shared pagetable level (either the > + * ptes in non-PAE, or shared PMD in PAE), then just copy the > + * references from swapper_pg_dir: > + */ > + if (CONFIG_PGTABLE_LEVELS == 2 || > + (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || > + CONFIG_PGTABLE_LEVELS == 4) { > + > + pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY; > + pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY; > + int i; > + > + for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) { > + /* > + * This is lock-less, so it can race with PGD updates > + * coming from vmalloc() or CPA methods, but it's safe, > + * because: > + * > + * 1) this PGD is not in use yet, we have still not > + * scheduled this task. > + * 2) we only ever extend PGD entries > + * > + * So if we observe a non-zero PGD entry we can copy it, > + * it won't change from under us. Parallel updates (new > + * allocations) will modify our (already visible) PGD: > + */ > + if (pgd_val(*pgd_src)) > + WRITE_ONCE(*pgd_dst, *pgd_src); This should be set_pgd(pgd_dst, *pgd_src) in order for it to work as a Xen PV guest. I don't know whether anything would need to be done wrt WRITE_ONCE. Perhaps put it into native_set_pgd()? -boris From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754982AbbFKUh4 (ORCPT ); Thu, 11 Jun 2015 16:37:56 -0400 Received: from mail-ie0-f170.google.com ([209.85.223.170]:35028 "EHLO mail-ie0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753130AbbFKUhz (ORCPT ); Thu, 11 Jun 2015 16:37:55 -0400 MIME-Version: 1.0 In-Reply-To: <1434031637-9091-8-git-send-email-mingo@kernel.org> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> Date: Thu, 11 Jun 2015 13:37:54 -0700 X-Google-Sender-Auth: qZTR8jdHfGHPiTWdgxOmtIPQuLQ Message-ID: Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code From: Linus Torvalds To: Ingo Molnar Cc: Linux Kernel Mailing List , linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Thomas Gleixner , Waiman Long Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 11, 2015 at 7:07 AM, Ingo Molnar wrote: > > But we want to remove the pgd_list, so convert the code over to walk > all tasks in the system. This is an equivalent method. This is not at all equivalent, and it looks stupid. Don't use "for_each_process_thread(g, p)". You only care about each mm, and threads all share the same mm, so just do for_each_process(p) instead of iterating over all threads too. No? Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755252AbbFKUla (ORCPT ); Thu, 11 Jun 2015 16:41:30 -0400 Received: from mail-ig0-f174.google.com ([209.85.213.174]:35568 "EHLO mail-ig0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751767AbbFKUl2 (ORCPT ); Thu, 11 Jun 2015 16:41:28 -0400 MIME-Version: 1.0 In-Reply-To: References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> Date: Thu, 11 Jun 2015 13:41:27 -0700 X-Google-Sender-Auth: Kkfo-d80p5VcEcrKyJxIVbeTBE4 Message-ID: Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code From: Linus Torvalds To: Ingo Molnar Cc: Linux Kernel Mailing List , linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Thomas Gleixner , Waiman Long Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds wrote: > > Don't use "for_each_process_thread(g, p)". You only care about each > mm, and threads all share the same mm, so just do > > for_each_process(p) > > instead of iterating over all threads too. Hmm. I may be wrong. It strikes me that one of the group leaders might have exited but the subthreads live on. We'd see p->mm being NULL, even though the mm was originally in use. Ugh. So maybe the code really does need to iterate over all threads. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755079AbbFLHXK (ORCPT ); Fri, 12 Jun 2015 03:23:10 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:34219 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752546AbbFLHXH (ORCPT ); Fri, 12 Jun 2015 03:23:07 -0400 Date: Fri, 12 Jun 2015 09:23:02 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Linux Kernel Mailing List , linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150612072302.GA7509@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Thu, Jun 11, 2015 at 1:37 PM, Linus Torvalds > wrote: > > > > Don't use "for_each_process_thread(g, p)". You only care about each mm, and > > threads all share the same mm, so just do > > > > for_each_process(p) > > > > instead of iterating over all threads too. > > Hmm. I may be wrong. It strikes me that one of the group leaders might have > exited but the subthreads live on. We'd see p->mm being NULL, even though the mm > was originally in use. > > Ugh. So maybe the code really does need to iterate over all threads. Yeah, for_each_process() is indeed no guarantee that we iterate over all mm's. We might make it so: but that would mean restricting certain clone_flags variants - not sure that's possible with our current ABI usage? Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932461AbbFLIEl (ORCPT ); Fri, 12 Jun 2015 04:04:41 -0400 Received: from mail-wg0-f45.google.com ([74.125.82.45]:34380 "EHLO mail-wg0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753162AbbFLIEb (ORCPT ); Fri, 12 Jun 2015 04:04:31 -0400 Date: Fri, 12 Jun 2015 10:04:25 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Oleg Nesterov , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150612080425.GC8759@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Jun 12, 2015 00:23, "Ingo Molnar" wrote: > > > > We might make it so: but that would mean restricting certain clone_flags > > variants - not sure that's possible with our current ABI usage? > > We already do that. You can't share signal info unless you share the mm. And a > shared signal state is what defines a thread group. > > So I think the only issue is that ->mm can become NULL when the thread group > leader dies - a non-NULL mm should always be shared among all threads. Indeed, we do that in exit_mm(). So we could add tsk->mm_leader or so, which does not get cleared and which the scheduler does not look at, but I'm not sure it's entirely safe that way: we don't have a refcount, and when the last thread exits it becomes bogus for a small window until the zombie leader is unlinked from the task list. To close that race we'd have __mmdrop() or so clear out tsk->mm_leader - but the task doing the mmdrop() might be a lazy thread totally unrelated to the original thread group so we don't know which tsk->mm_leader to clear out. To solve that we'd have to track the leader owning an MM in mm_struct - which gets interesting for the exec() case where the thread group gets a new leader, so we'd have to re-link the mm's leader pointer there. So unless I missed some simpler solution there a good number of steps where this could go wrong, in small looking race windows - how about we just live with iterating through all tasks instead of just all processes, once per 512 GB of memory mapped? Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754936AbbFLIQT (ORCPT ); Fri, 12 Jun 2015 04:16:19 -0400 Received: from mail-wi0-f174.google.com ([209.85.212.174]:32814 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751957AbbFLIQI (ORCPT ); Fri, 12 Jun 2015 04:16:08 -0400 Date: Fri, 12 Jun 2015 10:16:03 +0200 From: Ingo Molnar To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late() Message-ID: <20150612081603.GA13868@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-6-git-send-email-mingo@kernel.org> <20150611112349.e63bf0d2116c583e51c7e881@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150611112349.e63bf0d2116c583e51c7e881@linux-foundation.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Andrew Morton wrote: > On Thu, 11 Jun 2015 16:07:10 +0200 Ingo Molnar wrote: > > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags, > > syscall_tracepoint_update(p); > > write_unlock_irq(&tasklist_lock); > > > > + /* > > + * If we have a new PGD then initialize it: > > + * > > + * This method is called after a task has been made visible > > + * on the task list already. > > + * > > + * Architectures that manage per task kernel pagetables > > + * might use this callback to initialize them after they > > + * are already visible to new updates. > > + * > > + * NOTE: any user-space parts of the PGD are already initialized > > + * and must not be clobbered. > > + */ > > + if (p->mm != current->mm) > > + arch_pgd_init_late(p->mm, p->mm->pgd); > > Couldn't this be arch_pgd_init_late(p->mm) or arch_pgd_init_late(p)? Indeed - fixed. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753368AbbFLIUB (ORCPT ); Fri, 12 Jun 2015 04:20:01 -0400 Received: from mail-wg0-f54.google.com ([74.125.82.54]:33004 "EHLO mail-wg0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751163AbbFLIT4 (ORCPT ); Fri, 12 Jun 2015 04:19:56 -0400 Date: Fri, 12 Jun 2015 10:19:51 +0200 From: Ingo Molnar To: Boris Ostrovsky Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long , Konrad Rzeszutek Wilk , David Vrabel Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150612081950.GB13868@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <5579E9B4.7080601@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5579E9B4.7080601@oracle.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Boris Ostrovsky wrote: > On 06/11/2015 10:07 AM, Ingo Molnar wrote: > > >diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > >index fb0a9dd1d6e4..e0bf90470d70 100644 > >--- a/arch/x86/mm/pgtable.c > >+++ b/arch/x86/mm/pgtable.c > >@@ -391,6 +391,63 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > return NULL; > > } > > > >+/* > >+ * Initialize the kernel portion of the PGD. > >+ * > >+ * This is done separately, because pgd_alloc() happens when > >+ * the task is not on the task list yet - and PGD updates > >+ * happen by walking the task list. > >+ * > >+ * No locking is needed here, as we just copy over the reference > >+ * PGD. The reference PGD (pgtable_init) is only ever expanded > >+ * at the highest, PGD level. Thus any other task extending it > >+ * will first update the reference PGD, then modify the task PGDs. > >+ */ > >+void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) > >+{ > >+ /* > >+ * This is called after a new MM has been made visible > >+ * in fork() or exec(). > >+ * > >+ * This barrier makes sure the MM is visible to new RCU > >+ * walkers before we initialize it, so that we don't miss > >+ * updates: > >+ */ > >+ smp_wmb(); > >+ > >+ /* > >+ * If the pgd points to a shared pagetable level (either the > >+ * ptes in non-PAE, or shared PMD in PAE), then just copy the > >+ * references from swapper_pg_dir: > >+ */ > >+ if (CONFIG_PGTABLE_LEVELS == 2 || > >+ (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || > >+ CONFIG_PGTABLE_LEVELS == 4) { > >+ > >+ pgd_t *pgd_src = swapper_pg_dir + KERNEL_PGD_BOUNDARY; > >+ pgd_t *pgd_dst = pgd + KERNEL_PGD_BOUNDARY; > >+ int i; > >+ > >+ for (i = 0; i < KERNEL_PGD_PTRS; i++, pgd_src++, pgd_dst++) { > >+ /* > >+ * This is lock-less, so it can race with PGD updates > >+ * coming from vmalloc() or CPA methods, but it's safe, > >+ * because: > >+ * > >+ * 1) this PGD is not in use yet, we have still not > >+ * scheduled this task. > >+ * 2) we only ever extend PGD entries > >+ * > >+ * So if we observe a non-zero PGD entry we can copy it, > >+ * it won't change from under us. Parallel updates (new > >+ * allocations) will modify our (already visible) PGD: > >+ */ > >+ if (pgd_val(*pgd_src)) > >+ WRITE_ONCE(*pgd_dst, *pgd_src); > > > This should be set_pgd(pgd_dst, *pgd_src) in order for it to work as a Xen > PV guest. Thanks, fixed. > I don't know whether anything would need to be done wrt WRITE_ONCE. Perhaps put > it into native_set_pgd()? So this was just write-tearing paranoia at the raw pgd value copy I did which is an unusual pattern - but I'll use set_pgd(), as it clearly works and has the paravirt callback as well. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753343AbbFLUjk (ORCPT ); Fri, 12 Jun 2015 16:39:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38829 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751896AbbFLUjh (ORCPT ); Fri, 12 Jun 2015 16:39:37 -0400 Date: Fri, 12 Jun 2015 22:38:32 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: Linus Torvalds , Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150612203832.GA18966@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> <20150612080425.GC8759@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612080425.GC8759@gmail.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/12, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > So I think the only issue is that ->mm can become NULL when the thread group > > leader dies - a non-NULL mm should always be shared among all threads. > > Indeed, we do that in exit_mm(). Yes, > So we could add tsk->mm_leader or so, No, no, please do not. Just do something like for_each_process(p) { for_each_thread(p, t) { if (t->mm) { do_something(t->mm); break; } } } But either way I don't understand what protects this ->mm. Perhaps this needs find_lock_task_mm(). Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752361AbbFLUyk (ORCPT ); Fri, 12 Jun 2015 16:54:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35660 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750856AbbFLUyh (ORCPT ); Fri, 12 Jun 2015 16:54:37 -0400 Date: Fri, 12 Jun 2015 22:53:31 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: Linus Torvalds , Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150612205331.GB18966@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> <20150612080425.GC8759@gmail.com> <20150612203832.GA18966@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612203832.GA18966@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/12, Oleg Nesterov wrote: > > On 06/12, Ingo Molnar wrote: > > > > * Linus Torvalds wrote: > > > > > So I think the only issue is that ->mm can become NULL when the thread group > > > leader dies - a non-NULL mm should always be shared among all threads. > > > > Indeed, we do that in exit_mm(). > > Yes, > > > So we could add tsk->mm_leader or so, > > No, no, please do not. Just do something like > > for_each_process(p) { > > for_each_thread(p, t) { > if (t->mm) { > do_something(t->mm); > break; > } > } > } > > But either way I don't understand what protects this ->mm. Perhaps this needs > find_lock_task_mm(). And, I don't understand this code, probably this doesn't matter, but. unpin_all() is probably fine, but xen_mm_pin_all() can race with fork() and miss the new child. Is it OK? Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753571AbbFLVN1 (ORCPT ); Fri, 12 Jun 2015 17:13:27 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47635 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753377AbbFLVNZ (ORCPT ); Fri, 12 Jun 2015 17:13:25 -0400 Date: Fri, 12 Jun 2015 23:12:20 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late() Message-ID: <20150612211220.GC18966@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-6-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-6-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11, Ingo Molnar wrote: > > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags, > syscall_tracepoint_update(p); > write_unlock_irq(&tasklist_lock); > > + /* > + * If we have a new PGD then initialize it: > + * > + * This method is called after a task has been made visible > + * on the task list already. > + * > + * Architectures that manage per task kernel pagetables > + * might use this callback to initialize them after they > + * are already visible to new updates. > + * > + * NOTE: any user-space parts of the PGD are already initialized > + * and must not be clobbered. > + */ > + if (p->mm != current->mm) > + arch_pgd_init_late(p->mm, p->mm->pgd); > + Cosmetic, but imo if (!(clone_flags & CLONE_VM)) arch_pgd_init_late(...); will look better and more consistent. Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754007AbbFLWXf (ORCPT ); Fri, 12 Jun 2015 18:23:35 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42243 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751459AbbFLWXe (ORCPT ); Fri, 12 Jun 2015 18:23:34 -0400 Date: Sat, 13 Jun 2015 00:22:29 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Message-ID: <20150612222229.GA23071@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-3-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-3-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11, Ingo Molnar wrote: > > void sync_global_pgds(unsigned long start, unsigned long end, int removed) > { > @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > for (address = start; address <= end; address += PGDIR_SIZE) { > const pgd_t *pgd_ref = pgd_offset_k(address); > - struct page *page; > + struct task_struct *g, *p; > > /* > - * When it is called after memory hot remove, pgd_none() > - * returns true. In this case (removed == 1), we must clear > - * the PGD entries in the local PGD level page. > + * When this function is called after memory hot remove, > + * pgd_none() already returns true, but only the reference > + * kernel PGD has been cleared, not the process PGDs. > + * > + * So clear the affected entries in every process PGD as well: > */ > if (pgd_none(*pgd_ref) && !removed) > continue; > > spin_lock(&pgd_lock); > - list_for_each_entry(page, &pgd_list, lru) { > - pgd_t *pgd; > + > + for_each_process_thread(g, p) { Well, this looks obvously unsafe without rcu_read_lock() at least. The usage of ->mm doesn't look safe too but this is fixeable, see my previous reply to 7/12. And probably I am totally confused, but it seems that 06/12 should come before this patch? Otherwise, why we can't race with fork() and miss the new process? Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754833AbbFLW3i (ORCPT ); Fri, 12 Jun 2015 18:29:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35497 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750814AbbFLW3f (ORCPT ); Fri, 12 Jun 2015 18:29:35 -0400 Date: Sat, 13 Jun 2015 00:28:30 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Message-ID: <20150612222830.GA23855@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-5-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-5-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Yes I guess I am totally confused ;) On 06/11, Ingo Molnar wrote: > > @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > * > * So clear the affected entries in every process PGD as well: > */ > - if (pgd_none(*pgd_ref) && !removed) > + if (pgd_none(*pgd_ref)) But doesn't this change invalidate the comment above? Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754583AbbFLWbf (ORCPT ); Fri, 12 Jun 2015 18:31:35 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:40774 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752019AbbFLWbd (ORCPT ); Fri, 12 Jun 2015 18:31:33 -0400 Message-ID: <557B5D7E.8080607@oracle.com> Date: Fri, 12 Jun 2015 18:30:22 -0400 From: Boris Ostrovsky User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Oleg Nesterov , Ingo Molnar CC: Linus Torvalds , Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> <20150612080425.GC8759@gmail.com> <20150612203832.GA18966@redhat.com> <20150612205331.GB18966@redhat.com> In-Reply-To: <20150612205331.GB18966@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userv0022.oracle.com [156.151.31.74] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/12/2015 04:53 PM, Oleg Nesterov wrote: > On 06/12, Oleg Nesterov wrote: >> >> On 06/12, Ingo Molnar wrote: >>> >>> * Linus Torvalds wrote: >>> >>>> So I think the only issue is that ->mm can become NULL when the thread group >>>> leader dies - a non-NULL mm should always be shared among all threads. >>> >>> Indeed, we do that in exit_mm(). >> >> Yes, >> >>> So we could add tsk->mm_leader or so, >> >> No, no, please do not. Just do something like >> >> for_each_process(p) { >> >> for_each_thread(p, t) { >> if (t->mm) { >> do_something(t->mm); >> break; >> } >> } >> } >> >> But either way I don't understand what protects this ->mm. Perhaps this needs >> find_lock_task_mm(). > > And, I don't understand this code, probably this doesn't matter, but. > > unpin_all() is probably fine, but xen_mm_pin_all() can race with fork() > and miss the new child. Is it OK? Currently xen_mm_pin_all() is only called in the suspend path, out of stop_machine(), so presumably at that time fork is not possible. -boris From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755362AbbFLWsc (ORCPT ); Fri, 12 Jun 2015 18:48:32 -0400 Received: from g9t5008.houston.hp.com ([15.240.92.66]:54063 "EHLO g9t5008.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753481AbbFLWs3 (ORCPT ); Fri, 12 Jun 2015 18:48:29 -0400 Message-ID: <557B61B7.4080301@hp.com> Date: Fri, 12 Jun 2015 18:48:23 -0400 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Ingo Molnar CC: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-3-git-send-email-mingo@kernel.org> In-Reply-To: <1434031637-9091-3-git-send-email-mingo@kernel.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11/2015 10:07 AM, Ingo Molnar wrote: > The memory hotplug code uses sync_global_pgds() to synchronize updates > to the global (&init_mm) kernel PGD and the task PGDs. It does this > by iterating over the pgd_list - which list closely tracks task > creation/destruction via fork/clone. > > But we want to remove this list, so that it does not have to be > maintained from fork()/exit(), so convert the memory hotplug code > to use the task list to iterate over all pgds in the system. > > Also improve the comments a bit, to make this function easier > to understand. > > Only lightly tested, as I don't have a memory hotplug setup. > > Cc: Andrew Morton > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Brian Gerst > Cc: Denys Vlasenko > Cc: H. Peter Anvin > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: Waiman Long > Signed-off-by: Ingo Molnar > --- > arch/x86/mm/init_64.c | 30 +++++++++++++++++------------- > 1 file changed, 17 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 3fba623e3ba5..1921acbd49fd 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -160,8 +160,8 @@ static int __init nonx32_setup(char *str) > __setup("noexec32=", nonx32_setup); > > /* > - * When memory was added/removed make sure all the processes MM have > - * suitable PGD entries in the local PGD level page. > + * When memory was added/removed make sure all the process MMs have > + * matching PGD entries in the local PGD level page as well. > */ > void sync_global_pgds(unsigned long start, unsigned long end, int removed) > { > @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > for (address = start; address<= end; address += PGDIR_SIZE) { > const pgd_t *pgd_ref = pgd_offset_k(address); > - struct page *page; > + struct task_struct *g, *p; > > /* > - * When it is called after memory hot remove, pgd_none() > - * returns true. In this case (removed == 1), we must clear > - * the PGD entries in the local PGD level page. > + * When this function is called after memory hot remove, > + * pgd_none() already returns true, but only the reference > + * kernel PGD has been cleared, not the process PGDs. > + * > + * So clear the affected entries in every process PGD as well: > */ > if (pgd_none(*pgd_ref)&& !removed) > continue; > > spin_lock(&pgd_lock); > - list_for_each_entry(page,&pgd_list, lru) { > - pgd_t *pgd; > + > + for_each_process_thread(g, p) { > + pgd_t *pgd = p->mm->pgd; > spinlock_t *pgt_lock; > > - pgd = (pgd_t *)page_address(page) + pgd_index(address); > - /* the pgt_lock only for Xen */ > - pgt_lock =&pgd_page_get_mm(page)->page_table_lock; > + if (!p->mm) > + continue; pgd was initialized to p->mm->pgd before the "p->mm" check is done. Shouldn't the initialization be moved after that. Cheers, Longman From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754984AbbFLWvJ (ORCPT ); Fri, 12 Jun 2015 18:51:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59977 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753432AbbFLWvE (ORCPT ); Fri, 12 Jun 2015 18:51:04 -0400 Date: Sat, 13 Jun 2015 00:50:00 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150612225000.GA24699@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-7-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11, Ingo Molnar wrote: > > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) > +{ > + /* > + * This is called after a new MM has been made visible > + * in fork() or exec(). > + * > + * This barrier makes sure the MM is visible to new RCU > + * walkers before we initialize it, so that we don't miss > + * updates: > + */ > + smp_wmb(); I can't understand the comment and the barrier... Afaics, we need to ensure that: > + if (pgd_val(*pgd_src)) > + WRITE_ONCE(*pgd_dst, *pgd_src); either we notice the recent update of this PGD, or (say) the subsequent sync_global_pgds() can miss the child. How the write barrier can help? Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754934AbbFLW6R (ORCPT ); Fri, 12 Jun 2015 18:58:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47724 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751344AbbFLW6Q (ORCPT ); Fri, 12 Jun 2015 18:58:16 -0400 Date: Sat, 13 Jun 2015 00:57:11 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150612225711.GB24699@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612225000.GA24699@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/13, Oleg Nesterov wrote: > > Afaics, we need to ensure that: > > > + if (pgd_val(*pgd_src)) > > + WRITE_ONCE(*pgd_dst, *pgd_src); > > either we notice the recent update of this PGD, or (say) the subsequent > sync_global_pgds() can miss the child. and perhaps !pgd_none(pgd_src) will look better. Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755272AbbFLXQZ (ORCPT ); Fri, 12 Jun 2015 19:16:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51919 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755210AbbFLXQX (ORCPT ); Fri, 12 Jun 2015 19:16:23 -0400 Date: Sat, 13 Jun 2015 01:15:18 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Message-ID: <20150612231518.GC24699@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-3-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434031637-9091-3-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11, Ingo Molnar wrote: > > void sync_global_pgds(unsigned long start, unsigned long end, int removed) > { > @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > for (address = start; address <= end; address += PGDIR_SIZE) { > const pgd_t *pgd_ref = pgd_offset_k(address); > - struct page *page; > + struct task_struct *g, *p; > > /* > - * When it is called after memory hot remove, pgd_none() > - * returns true. In this case (removed == 1), we must clear > - * the PGD entries in the local PGD level page. > + * When this function is called after memory hot remove, > + * pgd_none() already returns true, but only the reference > + * kernel PGD has been cleared, not the process PGDs. > + * > + * So clear the affected entries in every process PGD as well: > */ > if (pgd_none(*pgd_ref) && !removed) > continue; > > spin_lock(&pgd_lock); > - list_for_each_entry(page, &pgd_list, lru) { > - pgd_t *pgd; > + > + for_each_process_thread(g, p) { > + pgd_t *pgd = p->mm->pgd; and we use the same pgd later: set_pgd(pgd, *pgd_ref). looks like this should be pgd_offset(p->mm, address) ? And obviously you need to check ->mm != NULL first? And perhaps it makes sense to swap "for (address)" and for_each_process() loops... Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755124AbbFLXh1 (ORCPT ); Fri, 12 Jun 2015 19:37:27 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41807 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751590AbbFLXh0 (ORCPT ); Fri, 12 Jun 2015 19:37:26 -0400 Date: Sat, 13 Jun 2015 01:36:21 +0200 From: Oleg Nesterov To: Boris Ostrovsky Cc: Ingo Molnar , Linus Torvalds , Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150612233620.GA26205@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> <20150612080425.GC8759@gmail.com> <20150612203832.GA18966@redhat.com> <20150612205331.GB18966@redhat.com> <557B5D7E.8080607@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <557B5D7E.8080607@oracle.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/12, Boris Ostrovsky wrote: > > On 06/12/2015 04:53 PM, Oleg Nesterov wrote: >>> >>> for_each_process(p) { >>> >>> for_each_thread(p, t) { >>> if (t->mm) { >>> do_something(t->mm); >>> break; >>> } >>> } >>> } >>> >>> But either way I don't understand what protects this ->mm. Perhaps this needs >>> find_lock_task_mm(). >> >> And, I don't understand this code, probably this doesn't matter, but. >> >> unpin_all() is probably fine, but xen_mm_pin_all() can race with fork() >> and miss the new child. Is it OK? > > > Currently xen_mm_pin_all() is only called in the suspend path, out of > stop_machine(), so presumably at that time fork is not possible. OK, thanks, this also means that this code shouldn't worry about ->mm, it should be stable. But for_each_process() in sync_global_pgds() should, afaics. Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751446AbbFMGrS (ORCPT ); Sat, 13 Jun 2015 02:47:18 -0400 Received: from mail-wi0-f175.google.com ([209.85.212.175]:34432 "EHLO mail-wi0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750765AbbFMGrK (ORCPT ); Sat, 13 Jun 2015 02:47:10 -0400 Date: Sat, 13 Jun 2015 08:47:05 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150613064705.GA13835@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612225000.GA24699@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > On 06/11, Ingo Molnar wrote: > > > > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) > > +{ > > + /* > > + * This is called after a new MM has been made visible > > + * in fork() or exec(). > > + * > > + * This barrier makes sure the MM is visible to new RCU > > + * walkers before we initialize it, so that we don't miss > > + * updates: > > + */ > > + smp_wmb(); > > I can't understand the comment and the barrier... > > Afaics, we need to ensure that: > > > + if (pgd_val(*pgd_src)) > > + WRITE_ONCE(*pgd_dst, *pgd_src); > > either we notice the recent update of this PGD, or (say) the subsequent > sync_global_pgds() can miss the child. > > How the write barrier can help? So the real thing this pairs with is the earlier: tsk->mm = mm; plus the linking of the new task in the task list. _that_ write must become visible to others before we do the (conditional) copy ourselves. Granted, it happens quite a bit earlier, and the task linking's use of locking is a natural barrier - but since this is lockless I didn't want to leave a silent assumption in. Perhaps remove the barrier and just leave a comment in that describes the assumption on task-linking being a full barrier? Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755259AbbFMGvm (ORCPT ); Sat, 13 Jun 2015 02:51:42 -0400 Received: from mail-wi0-f174.google.com ([209.85.212.174]:36396 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753466AbbFMGtg (ORCPT ); Sat, 13 Jun 2015 02:49:36 -0400 Date: Sat, 13 Jun 2015 08:49:31 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150613064930.GB13835@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> <20150612225711.GB24699@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612225711.GB24699@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > On 06/13, Oleg Nesterov wrote: > > > > Afaics, we need to ensure that: > > > > > + if (pgd_val(*pgd_src)) > > > + WRITE_ONCE(*pgd_dst, *pgd_src); > > > > either we notice the recent update of this PGD, or (say) the subsequent > > sync_global_pgds() can miss the child. > > and perhaps !pgd_none(pgd_src) will look better. Indeed - fixed. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751914AbbFMGxG (ORCPT ); Sat, 13 Jun 2015 02:53:06 -0400 Received: from mail-wg0-f42.google.com ([74.125.82.42]:34499 "EHLO mail-wg0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751005AbbFMGxA (ORCPT ); Sat, 13 Jun 2015 02:53:00 -0400 Date: Sat, 13 Jun 2015 08:52:55 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150613065255.GA16018@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> <20150613064705.GA13835@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150613064705.GA13835@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > * Oleg Nesterov wrote: > > > On 06/11, Ingo Molnar wrote: > > > > > > +void arch_pgd_init_late(struct mm_struct *mm, pgd_t *pgd) > > > +{ > > > + /* > > > + * This is called after a new MM has been made visible > > > + * in fork() or exec(). > > > + * > > > + * This barrier makes sure the MM is visible to new RCU > > > + * walkers before we initialize it, so that we don't miss > > > + * updates: > > > + */ > > > + smp_wmb(); > > > > I can't understand the comment and the barrier... > > > > Afaics, we need to ensure that: > > > > > + if (pgd_val(*pgd_src)) > > > + WRITE_ONCE(*pgd_dst, *pgd_src); > > > > either we notice the recent update of this PGD, or (say) the subsequent > > sync_global_pgds() can miss the child. > > > > How the write barrier can help? > > So the real thing this pairs with is the earlier: > > tsk->mm = mm; > > plus the linking of the new task in the task list. > > _that_ write must become visible to others before we do the (conditional) copy > ourselves. > > Granted, it happens quite a bit earlier, and the task linking's use of locking > is a natural barrier - but since this is lockless I didn't want to leave a > silent assumption in. > > Perhaps remove the barrier and just leave a comment in that describes the > assumption on task-linking being a full barrier? Ah, there's another detail I forgot. This might handle the fork case, but in exec() we have: tsk->mm = mm; arch_pgd_init_late(mm); and since the task is already linked, here we need the barrier. So how about I improve the comment to: /* * This function is called after a new MM has been made visible * in fork() or exec() via: * * tsk->mm = mm; * * This barrier makes sure the MM is visible to new RCU * walkers before we initialize the pagetables below, so that * we don't miss updates: */ smp_wmb(); and leave the barrier there? Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752021AbbFMGyh (ORCPT ); Sat, 13 Jun 2015 02:54:37 -0400 Received: from mail-wg0-f41.google.com ([74.125.82.41]:35498 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751285AbbFMGy3 (ORCPT ); Sat, 13 Jun 2015 02:54:29 -0400 Date: Sat, 13 Jun 2015 08:54:24 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late() Message-ID: <20150613065424.GA16745@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-6-git-send-email-mingo@kernel.org> <20150612211220.GC18966@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612211220.GC18966@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > On 06/11, Ingo Molnar wrote: > > > > @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags, > > syscall_tracepoint_update(p); > > write_unlock_irq(&tasklist_lock); > > > > + /* > > + * If we have a new PGD then initialize it: > > + * > > + * This method is called after a task has been made visible > > + * on the task list already. > > + * > > + * Architectures that manage per task kernel pagetables > > + * might use this callback to initialize them after they > > + * are already visible to new updates. > > + * > > + * NOTE: any user-space parts of the PGD are already initialized > > + * and must not be clobbered. > > + */ > > + if (p->mm != current->mm) > > + arch_pgd_init_late(p->mm, p->mm->pgd); > > + > > Cosmetic, but imo > > if (!(clone_flags & CLONE_VM)) > arch_pgd_init_late(...); > > will look better and more consistent. Indeed - fixed. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751937AbbFMH07 (ORCPT ); Sat, 13 Jun 2015 03:26:59 -0400 Received: from mail-wi0-f169.google.com ([209.85.212.169]:37891 "EHLO mail-wi0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750963AbbFMH0v (ORCPT ); Sat, 13 Jun 2015 03:26:51 -0400 Date: Sat, 13 Jun 2015 09:26:35 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: Linus Torvalds , Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150613072635.GA30388@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> <20150612080425.GC8759@gmail.com> <20150612203832.GA18966@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612203832.GA18966@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > > So we could add tsk->mm_leader or so, > > No, no, please do not. Just do something like > > for_each_process(p) { > > for_each_thread(p, t) { So far that's what the for_each_process_thread() iterations I added do, right? > if (t->mm) { > do_something(t->mm); > break; > } > } > } > > But either way I don't understand what protects this ->mm. Perhaps this needs > find_lock_task_mm(). That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm. find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as a task can only go away via RCU expiry, and all the iterations I added are (supposed to) be RCU safe. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751997AbbFMHrJ (ORCPT ); Sat, 13 Jun 2015 03:47:09 -0400 Received: from mail-wg0-f42.google.com ([74.125.82.42]:33411 "EHLO mail-wg0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751127AbbFMHrB (ORCPT ); Sat, 13 Jun 2015 03:47:01 -0400 Date: Sat, 13 Jun 2015 09:46:55 +0200 From: Ingo Molnar To: Waiman Long Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Message-ID: <20150613074655.GB30388@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-3-git-send-email-mingo@kernel.org> <557B61B7.4080301@hp.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <557B61B7.4080301@hp.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Waiman Long wrote: > >@@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > > > for (address = start; address<= end; address += PGDIR_SIZE) { > > const pgd_t *pgd_ref = pgd_offset_k(address); > >- struct page *page; > >+ struct task_struct *g, *p; > > > > /* > >- * When it is called after memory hot remove, pgd_none() > >- * returns true. In this case (removed == 1), we must clear > >- * the PGD entries in the local PGD level page. > >+ * When this function is called after memory hot remove, > >+ * pgd_none() already returns true, but only the reference > >+ * kernel PGD has been cleared, not the process PGDs. > >+ * > >+ * So clear the affected entries in every process PGD as well: > > */ > > if (pgd_none(*pgd_ref)&& !removed) > > continue; > > > > spin_lock(&pgd_lock); > >- list_for_each_entry(page,&pgd_list, lru) { > >- pgd_t *pgd; > >+ > >+ for_each_process_thread(g, p) { > >+ pgd_t *pgd = p->mm->pgd; > > spinlock_t *pgt_lock; > > > >- pgd = (pgd_t *)page_address(page) + pgd_index(address); > >- /* the pgt_lock only for Xen */ > >- pgt_lock =&pgd_page_get_mm(page)->page_table_lock; > >+ if (!p->mm) > >+ continue; > > pgd was initialized to p->mm->pgd before the "p->mm" check is done. > Shouldn't the initialization be moved after that. Yes, already found this bug in testing and fixed it - will send out a new series with all the feedback so far addressed. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752129AbbFMHs6 (ORCPT ); Sat, 13 Jun 2015 03:48:58 -0400 Received: from mail-wi0-f179.google.com ([209.85.212.179]:34094 "EHLO mail-wi0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751230AbbFMHsu (ORCPT ); Sat, 13 Jun 2015 03:48:50 -0400 Date: Sat, 13 Jun 2015 09:48:44 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Message-ID: <20150613074844.GC30388@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-3-git-send-email-mingo@kernel.org> <20150612231518.GC24699@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612231518.GC24699@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > On 06/11, Ingo Molnar wrote: > > > > void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > { > > @@ -169,29 +169,33 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > > > for (address = start; address <= end; address += PGDIR_SIZE) { > > const pgd_t *pgd_ref = pgd_offset_k(address); > > - struct page *page; > > + struct task_struct *g, *p; > > > > /* > > - * When it is called after memory hot remove, pgd_none() > > - * returns true. In this case (removed == 1), we must clear > > - * the PGD entries in the local PGD level page. > > + * When this function is called after memory hot remove, > > + * pgd_none() already returns true, but only the reference > > + * kernel PGD has been cleared, not the process PGDs. > > + * > > + * So clear the affected entries in every process PGD as well: > > */ > > if (pgd_none(*pgd_ref) && !removed) > > continue; > > > > spin_lock(&pgd_lock); > > - list_for_each_entry(page, &pgd_list, lru) { > > - pgd_t *pgd; > > + > > + for_each_process_thread(g, p) { > > + pgd_t *pgd = p->mm->pgd; > > and we use the same pgd later: set_pgd(pgd, *pgd_ref). > looks like this should be pgd_offset(p->mm, address) ? > > And obviously you need to check ->mm != NULL first? Yes. > And perhaps it makes sense to swap "for (address)" and for_each_process() > loops... Yes, but I'd flip around the logic in a later, separate patch. Note that PGDIR_SIZE is 512 GB here, so the outer loop will most likely execute at most twice in practice. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752240AbbFMHvN (ORCPT ); Sat, 13 Jun 2015 03:51:13 -0400 Received: from mail-wi0-f182.google.com ([209.85.212.182]:36501 "EHLO mail-wi0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751127AbbFMHvH (ORCPT ); Sat, 13 Jun 2015 03:51:07 -0400 Date: Sat, 13 Jun 2015 09:51:02 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Message-ID: <20150613075102.GD30388@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-5-git-send-email-mingo@kernel.org> <20150612222830.GA23855@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150612222830.GA23855@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > Yes I guess I am totally confused ;) > > On 06/11, Ingo Molnar wrote: > > > > @@ -178,7 +178,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) > > * > > * So clear the affected entries in every process PGD as well: > > */ > > - if (pgd_none(*pgd_ref) && !removed) > > + if (pgd_none(*pgd_ref)) > > But doesn't this change invalidate the comment above? Indeed - I fixed the comment to now say: /* Only sync (potentially) newly added PGD entries: */ if (pgd_none(*pgd_ref)) continue; Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753421AbbFMJuL (ORCPT ); Sat, 13 Jun 2015 05:50:11 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:34666 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752002AbbFMJtj (ORCPT ); Sat, 13 Jun 2015 05:49:39 -0400 From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Waiman Long Subject: [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Date: Sat, 13 Jun 2015 11:49:07 +0200 Message-Id: <1434188955-31397-5-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1434188955-31397-1-git-send-email-mingo@kernel.org> References: <1434188955-31397-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that the memory hotplug code does not remove PGD entries anymore, the only users of sync_global_pgds() use it after extending the PGD. So remove the 'removed' parameter and simplify the call sites. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Waiman Long Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable_64.h | 3 +-- arch/x86/mm/fault.c | 2 +- arch/x86/mm/init_64.c | 27 ++++++++------------------- 3 files changed, 10 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 2ee781114d34..f405fc3bb719 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -116,8 +116,7 @@ static inline void native_pgd_clear(pgd_t *pgd) native_set_pgd(pgd, native_make_pgd(0)); } -extern void sync_global_pgds(unsigned long start, unsigned long end, - int removed); +extern void sync_global_pgds(unsigned long start, unsigned long end); /* * Conversion functions: convert a page and protection to a page entry, diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 181c53bac3a7..50342825f221 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -349,7 +349,7 @@ static void dump_pagetable(unsigned long address) void vmalloc_sync_all(void) { - sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0); + sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END); } /* diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 7a988dbad240..dcb2f45caf0e 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -160,10 +160,10 @@ static int __init nonx32_setup(char *str) __setup("noexec32=", nonx32_setup); /* - * When memory was added/removed make sure all the process MMs have + * When memory was added make sure all the process MMs have * matching PGD entries in the local PGD level page as well. */ -void sync_global_pgds(unsigned long start, unsigned long end, int removed) +void sync_global_pgds(unsigned long start, unsigned long end) { unsigned long address; @@ -171,14 +171,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) const pgd_t *pgd_ref = pgd_offset_k(address); struct task_struct *g, *p; - /* - * When this function is called after memory hot remove, - * pgd_none() already returns true, but only the reference - * kernel PGD has been cleared, not the process PGDs. - * - * So clear the affected entries in every process PGD as well: - */ - if (pgd_none(*pgd_ref) && !removed) + /* Only sync (potentially) newly added PGD entries: */ + if (pgd_none(*pgd_ref)) continue; spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */ @@ -204,13 +198,8 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed) if (!pgd_none(*pgd_ref) && !pgd_none(*pgd)) BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); - if (removed) { - if (pgd_none(*pgd_ref) && !pgd_none(*pgd)) - pgd_clear(pgd); - } else { - if (pgd_none(*pgd)) - set_pgd(pgd, *pgd_ref); - } + if (pgd_none(*pgd)) + set_pgd(pgd, *pgd_ref); spin_unlock(pgt_lock); task_unlock(p); @@ -644,7 +633,7 @@ kernel_physical_mapping_init(unsigned long start, } if (pgd_changed) - sync_global_pgds(addr, end - 1, 0); + sync_global_pgds(addr, end - 1); __flush_tlb_all(); @@ -1284,7 +1273,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) else err = vmemmap_populate_basepages(start, end, node); if (!err) - sync_global_pgds(start, end - 1, 0); + sync_global_pgds(start, end - 1); return err; } -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754187AbbFMRqm (ORCPT ); Sat, 13 Jun 2015 13:46:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51082 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753937AbbFMRqd (ORCPT ); Sat, 13 Jun 2015 13:46:33 -0400 Date: Sat, 13 Jun 2015 19:45:27 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150613174527.GA29379@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> <20150613064705.GA13835@gmail.com> <20150613065255.GA16018@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150613065255.GA16018@gmail.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/13, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > * Oleg Nesterov wrote: > > > > > > > > Afaics, we need to ensure that: > > > > > > > + if (pgd_val(*pgd_src)) > > > > + WRITE_ONCE(*pgd_dst, *pgd_src); > > > > > > either we notice the recent update of this PGD, or (say) the subsequent > > > sync_global_pgds() can miss the child. > > > > > > How the write barrier can help? > > > > So the real thing this pairs with is the earlier: > > > > tsk->mm = mm; > > > > plus the linking of the new task in the task list. > > > > _that_ write must become visible to others before we do the (conditional) copy > > ourselves. Hmm. that write must be visible before we start to _read_ *pgd_src, afaics. > > Granted, it happens quite a bit earlier, and the task linking's use of locking > > is a natural barrier - but since this is lockless I didn't want to leave a > > silent assumption in. I agree, > Ah, there's another detail I forgot. This might handle the fork case, but in > exec() we have: > > tsk->mm = mm; > arch_pgd_init_late(mm); Yes, this too. But wmb() can't help. At least we need the full mb() to serialize the STORE above (or list addition in copy_process) with the LOAD which reads *pgd_src. Plus we need another mb() in sync_global_pgds(), say, before the main for_each_process() loop. it would be nice to make this more simple/clear somehow... Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752602AbbFMSCI (ORCPT ); Sat, 13 Jun 2015 14:02:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40843 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750749AbbFMSCD (ORCPT ); Sat, 13 Jun 2015 14:02:03 -0400 Date: Sat, 13 Jun 2015 20:00:56 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: Linus Torvalds , Waiman Long , Thomas Gleixner , Denys Vlasenko , Borislav Petkov , Andrew Morton , Andy Lutomirski , linux-mml@vger.kernel.org, Linux Kernel Mailing List , Brian Gerst , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Message-ID: <20150613180056.GB29379@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-8-git-send-email-mingo@kernel.org> <20150612072302.GA7509@gmail.com> <20150612080425.GC8759@gmail.com> <20150612203832.GA18966@redhat.com> <20150613072635.GA30388@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150613072635.GA30388@gmail.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/13, Ingo Molnar wrote: > > * Oleg Nesterov wrote: > > > > So we could add tsk->mm_leader or so, > > > > No, no, please do not. Just do something like > > > > for_each_process(p) { > > > > for_each_thread(p, t) { > > So far that's what the for_each_process_thread() iterations I added do, right? Not really, > > if (t->mm) { > > do_something(t->mm); > > break; ^^^^^ Note this "break". We stop the inner loop right after we find a thread "t" with ->mm != NULL. In the likely case t == p (group leader) so the inner loop stops on the 1st iteration. > > But either way I don't understand what protects this ->mm. Perhaps this needs > > find_lock_task_mm(). > > That's indeed a bug: I'll add task_lock()/unlock() before looking at ->mm. Well, in this particular case we are safe. As Boris explained this is called from stop_machine(). But sync_global_pgds() is not safe. > find_lock_task_mm() is not needed IMHO: we have a stable reference to 't', as > task can only go away via RCU expiry, and all the iterations I added are (supposed > to) be RCU safe. Sure, you can do lock/unlock by hand. But find_lock_task_mm() can simplify the code because it checks subthreads if group_leader->mm == NULL. You can simply do rcu_read_lock(); for_each_process(p) { t = find_lock_task_mm(p); if (!t) continue; do_something(t->mm); task_unlock(t); } rcu_read_unlock(); Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752161AbbFNIOK (ORCPT ); Sun, 14 Jun 2015 04:14:10 -0400 Received: from mail-wg0-f41.google.com ([74.125.82.41]:36685 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751549AbbFNIN6 (ORCPT ); Sun, 14 Jun 2015 04:13:58 -0400 Date: Sun, 14 Jun 2015 10:13:52 +0200 From: Ingo Molnar To: Oleg Nesterov Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150614081352.GA3446@gmail.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> <20150613064705.GA13835@gmail.com> <20150613065255.GA16018@gmail.com> <20150613174527.GA29379@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150613174527.GA29379@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Oleg Nesterov wrote: > On 06/13, Ingo Molnar wrote: > > > > * Ingo Molnar wrote: > > > > > * Oleg Nesterov wrote: > > > > > > > > > > > Afaics, we need to ensure that: > > > > > > > > > + if (pgd_val(*pgd_src)) > > > > > + WRITE_ONCE(*pgd_dst, *pgd_src); > > > > > > > > either we notice the recent update of this PGD, or (say) the subsequent > > > > sync_global_pgds() can miss the child. > > > > > > > > How the write barrier can help? > > > > > > So the real thing this pairs with is the earlier: > > > > > > tsk->mm = mm; > > > > > > plus the linking of the new task in the task list. > > > > > > _that_ write must become visible to others before we do the (conditional) copy > > > ourselves. > > Hmm. that write must be visible before we start to _read_ *pgd_src, > afaics. > > > > Granted, it happens quite a bit earlier, and the task linking's use of locking > > > is a natural barrier - but since this is lockless I didn't want to leave a > > > silent assumption in. > > I agree, > > > Ah, there's another detail I forgot. This might handle the fork case, but in > > exec() we have: > > > > tsk->mm = mm; > > arch_pgd_init_late(mm); > > Yes, this too. > > But wmb() can't help. At least we need the full mb() to serialize the > STORE above (or list addition in copy_process) with the LOAD which > reads *pgd_src. True. > Plus we need another mb() in sync_global_pgds(), say, before the main > for_each_process() loop. True. So since we have a spin_lock() there already, I tried to find the right primitive to turn it into a full memory barrier but gave up. Is there a simple primitive for that? Also, since this is x86 specific code we could rely on the fact that spinlock-acquire is a full memory barrier? > > it would be nice to make this more simple/clear somehow... I'm afraid lockless is rarely simple, but I'm open to suggestions ... Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752412AbbFNUz0 (ORCPT ); Sun, 14 Jun 2015 16:55:26 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49689 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751752AbbFNUzU (ORCPT ); Sun, 14 Jun 2015 16:55:20 -0400 Date: Sun, 14 Jun 2015 22:54:12 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150614205412.GC19582@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> <20150613064705.GA13835@gmail.com> <20150613065255.GA16018@gmail.com> <20150613174527.GA29379@redhat.com> <20150614081352.GA3446@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150614081352.GA3446@gmail.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/14, Ingo Molnar wrote: > > So since we have a spin_lock() there already, Yeeeees, I thought about task_lock() or pgd_lock too. > Also, since this is x86 specific code we could rely on the fact that > spinlock-acquire is a full memory barrier? we do not really need the full barrier if we rely on spinlock_t, we can rely on acquire+release semantics. Lets forget about exec_mmap(). If we add, say, // or unlock_wait() + barriers task_lock(current->group_leader); task_unlock(current->group_leader); at the start of arch_pgd_init_late() we will fix the problems with fork() even if pgd_none() below can leak into the critical section. We rely on the fact that find_lock_task_mm() does lock/unlock too and always starts with the group leader. If sync_global_pgds() takes this lock first, we must see the change in *PGD after task_unlock(). Actually right after task_lock(). Otherwise, sync_global_pgds() should see the result of list addition if it takes this (the same) ->group_leader->lock_alloc after us. But this is not nice, and exec_mmap() calls arch_pgd_init_late() under task_lock(). So, unless you are going to remove pgd_lock altogether perhaps we can rely on it the same way mb(); spin_unlock_wait(&pgd_lock); rmb(); Avoids the barriers (and comments) on another side, but I can't say I really like this... So I won't argue with 2 mb's on both sides. Oleg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753089AbbFNWMG (ORCPT ); Sun, 14 Jun 2015 18:12:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37117 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751797AbbFNWMC (ORCPT ); Sun, 14 Jun 2015 18:12:02 -0400 Date: Mon, 15 Jun 2015 00:10:55 +0200 From: Oleg Nesterov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mml@vger.kernel.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Message-ID: <20150614221055.GA32290@redhat.com> References: <1434031637-9091-1-git-send-email-mingo@kernel.org> <1434031637-9091-7-git-send-email-mingo@kernel.org> <20150612225000.GA24699@redhat.com> <20150613064705.GA13835@gmail.com> <20150613065255.GA16018@gmail.com> <20150613174527.GA29379@redhat.com> <20150614081352.GA3446@gmail.com> <20150614205412.GC19582@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150614205412.GC19582@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/14, Oleg Nesterov wrote: > > So, unless you are going to remove pgd_lock altogether perhaps we can > rely on it the same way > > mb(); > spin_unlock_wait(&pgd_lock); > rmb(); > > > Avoids the barriers (and comments) on another side, but I can't say > I really like this... > > > So I won't argue with 2 mb's on both sides. Or we can add // A new child created before can miss the PGD updates, // but we must see that child on the process list read_lock(tasklist_lock); read_unlock(tasklist_lock); // We can miss a new child forked after read_unlock(), // but then its parent must see all PGD updates right // after it does write_unlock(tasklist); for_each_process(p) { before main for_each_process() loop in sync_global_pgds(). As for exec_mmap() we can rely on task_lock(), sync_global_pgds() takes it too. The corner case is when exec changes the leader, so exec_mmap/sync_global_pgds can take different locks. But in this case we can rely on de_thread() (which takes tasklist for write) by the same reason: either sync_global_pgds() will see the new leader, or the new leader must see the updates. Oleg.