All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leonardo Bras <leonardo@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	kvm-ppc@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org
Cc: Song Liu <songliubraving@fb.com>, Michal Hocko <mhocko@suse.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Dmitry V. Levin" <ldv@altlinux.org>,
	Keith Busch <keith.busch@intel.com>,
	Paul Mackerras <paulus@samba.org>,
	Christoph Lameter <cl@linux.com>, Ira Weiny <ira.weiny@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Elena Reshetova <elena.reshetova@intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Santosh Sivaraj <santosh@fossix.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Allison Randal <allison@lohutok.net>,
	Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>,
	Leonardo Bras <leonardo@linux.ibm.com>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Ingo Molnar <mingo@kernel.org>, Ralph Campbell <rcampbe>
Subject: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
Date: Wed,  2 Oct 2019 22:33:16 -0300	[thread overview]
Message-ID: <20191003013325.2614-3-leonardo@linux.ibm.com> (raw)
In-Reply-To: <20191003013325.2614-1-leonardo@linux.ibm.com>

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

On powerpc, we need to do some lockless pagetable walks from functions
that already have disabled interrupts, specially from real mode with
MSR[EE=0].

In these contexts, disabling/enabling interrupts can be very troubling.

So, this arch-specific implementation features functions with an extra
argument that allows interrupt enable/disable to be skipped:
__begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().

Functions similar to the generic ones are also exported, by calling
the above functions with parameter *able_irq = false.

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   9 ++
 arch/powerpc/mm/book3s64/pgtable.c           | 117 +++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b01624e5c467..8330b35cd28d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1372,5 +1372,14 @@ static inline bool pgd_is_leaf(pgd_t pgd)
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm);
+unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+					  bool disable_irq);
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask);
+void __end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask,
+			       bool enable_irq);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 75483b40fcb1..ae557fdce9a3 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,123 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses begin_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version allows setting disable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+
+inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+						 bool disable_irq)
+{
+	unsigned long irq_mask = 0;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (disable_irq)
+		local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version is used by generic code, and always assume irqs being disabled
+ */
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return __begin_lockless_pgtbl_walk(mm, true);
+}
+EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
+
+/*
+ * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version allows setting enable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline void __end_lockless_pgtbl_walk(struct mm_struct *mm,
+				      unsigned long irq_mask, bool enable_irq)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (enable_irq)
+		local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version is used by generic code, and always assume irqs being enabled
+ */
+
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask)
+{
+	__end_lockless_pgtbl_walk(mm, irq_mask, true);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it fallback to sync method
+	 * (serialize_against_pte_lookup)
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(running_lockless_pgtbl_walk);
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1

WARNING: multiple messages have this Message-ID (diff)
From: Leonardo Bras <leonardo@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	kvm-ppc@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org
Cc: "Song Liu" <songliubraving@fb.com>,
	"Michal Hocko" <mhocko@suse.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Dmitry V. Levin" <ldv@altlinux.org>,
	"Keith Busch" <keith.busch@intel.com>,
	"Paul Mackerras" <paulus@samba.org>,
	"Christoph Lameter" <cl@linux.com>,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Elena Reshetova" <elena.reshetova@intel.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Santosh Sivaraj" <santosh@fossix.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	"Bartlomiej Zolnierkiewicz" <b.zolnierkie@samsung.com>,
	"Mike Rapoport" <rppt@linux.ibm.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Allison Randal" <allison@lohutok.net>,
	"Mahesh Salgaonkar" <mahesh@linux.vnet.ibm.com>,
	"Leonardo Bras" <leonardo@linux.ibm.com>,
	"Alexey Dobriyan" <adobriyan@gmail.com>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Arnd Bergmann" <arnd@arndb.de>, "Jann Horn" <jannh@google.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Andrey Ryabinin" <aryabinin@virtuozzo.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Reza Arbab" <arbab@linux.ibm.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Christian Brauner" <christian.brauner@ubuntu.com>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Souptick Joarder" <jrdr.linux@gmail.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Logan Gunthorpe" <logang@deltatee.com>,
	"Roman Gushchin" <guro@fb.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
Date: Wed,  2 Oct 2019 22:33:16 -0300	[thread overview]
Message-ID: <20191003013325.2614-3-leonardo@linux.ibm.com> (raw)
In-Reply-To: <20191003013325.2614-1-leonardo@linux.ibm.com>

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

On powerpc, we need to do some lockless pagetable walks from functions
that already have disabled interrupts, specially from real mode with
MSR[EE=0].

In these contexts, disabling/enabling interrupts can be very troubling.

So, this arch-specific implementation features functions with an extra
argument that allows interrupt enable/disable to be skipped:
__begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().

Functions similar to the generic ones are also exported, by calling
the above functions with parameter *able_irq = false.

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   9 ++
 arch/powerpc/mm/book3s64/pgtable.c           | 117 +++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b01624e5c467..8330b35cd28d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1372,5 +1372,14 @@ static inline bool pgd_is_leaf(pgd_t pgd)
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm);
+unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+					  bool disable_irq);
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask);
+void __end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask,
+			       bool enable_irq);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 75483b40fcb1..ae557fdce9a3 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,123 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses begin_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version allows setting disable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+
+inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+						 bool disable_irq)
+{
+	unsigned long irq_mask = 0;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (disable_irq)
+		local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version is used by generic code, and always assume irqs being disabled
+ */
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return __begin_lockless_pgtbl_walk(mm, true);
+}
+EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
+
+/*
+ * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version allows setting enable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline void __end_lockless_pgtbl_walk(struct mm_struct *mm,
+				      unsigned long irq_mask, bool enable_irq)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (enable_irq)
+		local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version is used by generic code, and always assume irqs being enabled
+ */
+
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask)
+{
+	__end_lockless_pgtbl_walk(mm, irq_mask, true);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it fallback to sync method
+	 * (serialize_against_pte_lookup)
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(running_lockless_pgtbl_walk);
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1


WARNING: multiple messages have this Message-ID (diff)
From: Leonardo Bras <leonardo@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	kvm-ppc@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org
Cc: "Leonardo Bras" <leonardo@linux.ibm.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Paul Mackerras" <paulus@samba.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	"Christophe Leroy" <christophe.leroy@c-s.fr>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Mahesh Salgaonkar" <mahesh@linux.vnet.ibm.com>,
	"Reza Arbab" <arbab@linux.ibm.com>,
	"Santosh Sivaraj" <santosh@fossix.org>,
	"Balbir Singh" <bsingharora@gmail.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Mike Rapoport" <rppt@linux.ibm.com>,
	"Allison Randal" <allison@lohutok.net>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Christoph Lameter" <cl@linux.com>,
	"Logan Gunthorpe" <logang@deltatee.com>,
	"Andrey Ryabinin" <aryabinin@virtuozzo.com>,
	"Alexey Dobriyan" <adobriyan@gmail.com>,
	"Souptick Joarder" <jrdr.linux@gmail.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	"Jann Horn" <jannh@google.com>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Christian Brauner" <christian.brauner@ubuntu.com>,
	"Michal Hocko" <mhocko@suse.com>,
	"Elena Reshetova" <elena.reshetova@intel.com>,
	"Roman Gushchin" <guro@fb.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dmitry V. Levin" <ldv@altlinux.org>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Song Liu" <songliubraving@fb.com>,
	"Bartlomiej Zolnierkiewicz" <b.zolnierkie@samsung.com>,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Keith Busch" <keith.busch@intel.com>
Subject: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
Date: Wed,  2 Oct 2019 22:33:16 -0300	[thread overview]
Message-ID: <20191003013325.2614-3-leonardo@linux.ibm.com> (raw)
In-Reply-To: <20191003013325.2614-1-leonardo@linux.ibm.com>

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

On powerpc, we need to do some lockless pagetable walks from functions
that already have disabled interrupts, specially from real mode with
MSR[EE=0].

In these contexts, disabling/enabling interrupts can be very troubling.

So, this arch-specific implementation features functions with an extra
argument that allows interrupt enable/disable to be skipped:
__begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().

Functions similar to the generic ones are also exported, by calling
the above functions with parameter *able_irq = false.

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   9 ++
 arch/powerpc/mm/book3s64/pgtable.c           | 117 +++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b01624e5c467..8330b35cd28d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1372,5 +1372,14 @@ static inline bool pgd_is_leaf(pgd_t pgd)
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm);
+unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+					  bool disable_irq);
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask);
+void __end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask,
+			       bool enable_irq);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 75483b40fcb1..ae557fdce9a3 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,123 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses begin_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version allows setting disable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+
+inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+						 bool disable_irq)
+{
+	unsigned long irq_mask = 0;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (disable_irq)
+		local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version is used by generic code, and always assume irqs being disabled
+ */
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return __begin_lockless_pgtbl_walk(mm, true);
+}
+EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
+
+/*
+ * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version allows setting enable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline void __end_lockless_pgtbl_walk(struct mm_struct *mm,
+				      unsigned long irq_mask, bool enable_irq)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (enable_irq)
+		local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version is used by generic code, and always assume irqs being enabled
+ */
+
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask)
+{
+	__end_lockless_pgtbl_walk(mm, irq_mask, true);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it fallback to sync method
+	 * (serialize_against_pte_lookup)
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(running_lockless_pgtbl_walk);
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1



  parent reply	other threads:[~2019-10-03  1:33 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
2019-10-03  1:33 ` Leonardo Bras
2019-10-03  1:33 ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  7:11   ` Peter Zijlstra
2019-10-03  7:11     ` Peter Zijlstra
2019-10-03  7:11     ` Peter Zijlstra
2019-10-03 11:51     ` Peter Zijlstra
2019-10-03 11:51       ` Peter Zijlstra
2019-10-03 11:51       ` Peter Zijlstra
2019-10-03 20:40       ` John Hubbard
2019-10-03 20:40         ` John Hubbard
2019-10-03 20:40         ` John Hubbard
2019-10-04 11:24         ` Peter Zijlstra
2019-10-04 11:24           ` Peter Zijlstra
2019-10-04 11:24           ` Peter Zijlstra
2019-10-03 21:24       ` Leonardo Bras
2019-10-03 21:24         ` Leonardo Bras
2019-10-03 21:24         ` Leonardo Bras
2019-10-03 21:24         ` Leonardo Bras
2019-10-04 11:28         ` Peter Zijlstra
2019-10-04 11:28           ` Peter Zijlstra
2019-10-04 11:28           ` Peter Zijlstra
2019-10-04 11:28           ` Peter Zijlstra
2019-10-09 18:09           ` Leonardo Bras
2019-10-09 18:09             ` Leonardo Bras
2019-10-09 18:09             ` Leonardo Bras
2019-10-09 18:09             ` Leonardo Bras
2019-10-05  8:35       ` Aneesh Kumar K.V
2019-10-05  8:35         ` Aneesh Kumar K.V
2019-10-05  8:35         ` Aneesh Kumar K.V
2019-10-08 14:47         ` Kirill A. Shutemov
2019-10-08 14:47           ` Kirill A. Shutemov
2019-10-08 14:47           ` Kirill A. Shutemov
2019-10-03  1:33 ` Leonardo Bras [this message]
2019-10-03  1:33   ` [PATCH v5 02/11] powerpc/mm: Adds counting method " Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-08 15:11   ` Christopher Lameter
2019-10-08 15:11     ` Christopher Lameter
2019-10-08 15:11     ` Christopher Lameter
2019-10-08 17:13     ` Leonardo Bras
2019-10-08 17:13       ` Leonardo Bras
2019-10-08 17:13       ` Leonardo Bras
2019-10-08 17:43       ` Christopher Lameter
2019-10-08 17:43         ` Christopher Lameter
2019-10-08 17:43         ` Christopher Lameter
2019-10-08 18:02         ` Leonardo Bras
2019-10-08 18:02           ` Leonardo Bras
2019-10-08 18:02           ` Leonardo Bras
2019-10-08 18:27           ` Christopher Lameter
2019-10-08 18:27             ` Christopher Lameter
2019-10-08 18:27             ` Christopher Lameter
2019-10-03  1:33 ` [PATCH v5 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 05/11] powerpc/perf: " Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 07/11] powerpc/kvm/e500: " Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  2:08   ` Qian Cai
2019-10-03  2:08     ` Qian Cai
2019-10-03  2:08     ` Qian Cai
2019-10-03 19:04     ` Leonardo Bras
2019-10-03 19:04       ` Leonardo Bras
2019-10-03 19:04       ` Leonardo Bras
2019-10-03 19:08       ` Leonardo Bras
2019-10-03 19:08         ` Leonardo Bras
2019-10-03 19:08         ` Leonardo Bras
2019-10-03  7:44   ` Peter Zijlstra
2019-10-03  7:44     ` Peter Zijlstra
2019-10-03  7:44     ` Peter Zijlstra
2019-10-03 20:40     ` Leonardo Bras
2019-10-03 20:40       ` Leonardo Bras
2019-10-03 20:40       ` Leonardo Bras
2019-10-03 20:40       ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  1:33   ` Leonardo Bras
2019-10-03  7:29 ` [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Peter Zijlstra
2019-10-03  7:29   ` Peter Zijlstra
2019-10-03  7:29   ` Peter Zijlstra
2019-10-03 20:36   ` Leonardo Bras
2019-10-03 20:36     ` Leonardo Bras
2019-10-03 20:36     ` Leonardo Bras
2019-10-03 20:36     ` Leonardo Bras
2019-10-03 20:49     ` John Hubbard
2019-10-03 20:49       ` John Hubbard
2019-10-03 20:49       ` John Hubbard
2019-10-03 21:38       ` Leonardo Bras
2019-10-03 21:38         ` Leonardo Bras
2019-10-03 21:38         ` Leonardo Bras
2019-10-03 21:38         ` Leonardo Bras
2019-10-04 11:42     ` Peter Zijlstra
2019-10-04 11:42       ` Peter Zijlstra
2019-10-04 11:42       ` Peter Zijlstra
2019-10-04 11:42       ` Peter Zijlstra
2019-10-04 12:57       ` Peter Zijlstra
2019-10-04 12:57         ` Peter Zijlstra
2019-10-04 12:57         ` Peter Zijlstra
2019-10-04 12:57         ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191003013325.2614-3-leonardo@linux.ibm.com \
    --to=leonardo@linux.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=adobriyan@gmail.com \
    --cc=allison@lohutok.net \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=b.zolnierkie@samsung.com \
    --cc=cl@linux.com \
    --cc=dave@stgolabs.net \
    --cc=elena.reshetova@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=keith.busch@intel.com \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=ldv@altlinux.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.vnet.ibm.com \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.ibm.com \
    --cc=santosh@fossix.org \
    --cc=songliubraving@fb.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.