[PATCH] x86/mm/tlb: Revert retpoline avoidance approach

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/mm/tlb: Revert retpoline avoidance approach
@ 2022-03-18 16:33 Dave Hansen
  2022-03-19  7:20 ` Nadav Amit
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Hansen @ 2022-03-18 16:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Dave Hansen, kernel test robot, Nadav Amit,
	Ingo Molnar, Andy Lutomirski, Peter Zijlstra, x86

0day reported a regression on a microbenchmark which is intended to
stress the TLB flushing path:

	https://lore.kernel.org/all/20220317090415.GE735@xsang-OptiPlex-9020/

It pointed at a commit from Nadav which intended to remove retpoline
overhead in the TLB flushing path by taking the 'cond'-ition in
on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into
'cpumask'.  That allowed the code to use a bunch of earlier direct
calls instead of later indirect calls that need a retpoline.

But, in practice, threads can go idle (and into lazy TLB mode where
they don't need to flush their TLB) between the early and late calls.
It works in this direction and not in the other because TLB-flushing
threads tend to hold mmap_lock for write.  Contention on that lock
causes threads to _go_ idle right in this early/late window.

There was not any performance data in the original commit specific
to the retpoline overhead.  I did a few tests on a system with
retpolines:

	https://lore.kernel.org/all/dd8be93c-ded6-b962-50d4-96b1c3afb2b7@intel.com/

which showed a possible small win.  But, that small win pales in
comparison with the bigger loss induced on non-retpoline systems.

Revert the patch that removed the retpolines.  This was not a
clean revert, but it was self-contained enough not to be too painful.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.cm>
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 6035152d8eeb ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
Cc: Nadav Amit <namit@vmware.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
---
 arch/x86/mm/tlb.c | 37 +++++--------------------------------
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1e6513f95133..161984b44331 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -854,13 +854,11 @@ static void flush_tlb_func(void *info)
 			nr_invalidate);
 }
 
-static bool tlb_is_not_lazy(int cpu)
+static bool tlb_is_not_lazy(int cpu, void *data)
 {
 	return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
 }
 
-static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
-
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);
 
@@ -889,36 +887,11 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables) {
+	if (info->freed_tables)
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
-	} else {
-		/*
-		 * Although we could have used on_each_cpu_cond_mask(),
-		 * open-coding it has performance advantages, as it eliminates
-		 * the need for indirect calls or retpolines. In addition, it
-		 * allows to use a designated cpumask for evaluating the
-		 * condition, instead of allocating one.
-		 *
-		 * This code works under the assumption that there are no nested
-		 * TLB flushes, an assumption that is already made in
-		 * flush_tlb_mm_range().
-		 *
-		 * cond_cpumask is logically a stack-local variable, but it is
-		 * more efficient to have it off the stack and not to allocate
-		 * it on demand. Preemption is disabled and this code is
-		 * non-reentrant.
-		 */
-		struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
-		int cpu;
-
-		cpumask_clear(cond_cpumask);
-
-		for_each_cpu(cpu, cpumask) {
-			if (tlb_is_not_lazy(cpu))
-				__cpumask_set_cpu(cpu, cond_cpumask);
-		}
-		on_each_cpu_mask(cond_cpumask, flush_tlb_func, (void *)info, true);
-	}
+	else
+		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
+				(void *)info, 1, cpumask);
 }
 
 void flush_tlb_multi(const struct cpumask *cpumask,
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/mm/tlb: Revert retpoline avoidance approach
  2022-03-18 16:33 [PATCH] x86/mm/tlb: Revert retpoline avoidance approach Dave Hansen
@ 2022-03-19  7:20 ` Nadav Amit
  2022-03-19  7:28   ` Nadav Amit
  0 siblings, 1 reply; 3+ messages in thread
From: Nadav Amit @ 2022-03-19  7:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Dave Hansen, kernel test robot, Ingo Molnar,
	Andy Lutomirski, Peter Zijlstra, x86@kernel.org


> On Mar 18, 2022, at 9:33 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
> ⚠ External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
> 
> 0day reported a regression on a microbenchmark which is intended to
> stress the TLB flushing path:
> 
>        https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20220317090415.GE735%40xsang-OptiPlex-9020%2F&amp;data=04%7C01%7Cnamit%40vmware.com%7C4a2c382b5ef44105474308da08fd0c7f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637832180178751497%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=TA0iATQCnfDjIZ1lG3YdhjMZjelXrVatBjBE8Hz3AfE%3D&amp;reserved=0
> 
> It pointed at a commit from Nadav which intended to remove retpoline
> overhead in the TLB flushing path by taking the 'cond'-ition in
> on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into
> 'cpumask'.  That allowed the code to use a bunch of earlier direct
> calls instead of later indirect calls that need a retpoline.
> 
> But, in practice, threads can go idle (and into lazy TLB mode where
> they don't need to flush their TLB) between the early and late calls.
> It works in this direction and not in the other because TLB-flushing
> threads tend to hold mmap_lock for write.  Contention on that lock
> causes threads to _go_ idle right in this early/late window.
> 

Acked-by: Nadav Amit <namit@vmware.com>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/mm/tlb: Revert retpoline avoidance approach
  2022-03-19  7:20 ` Nadav Amit
@ 2022-03-19  7:28   ` Nadav Amit
  0 siblings, 0 replies; 3+ messages in thread
From: Nadav Amit @ 2022-03-19  7:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, kernel test robot, Ingo Molnar, Andy Lutomirski,
	Peter Zijlstra, X86 ML



> On Mar 19, 2022, at 12:20 AM, Nadav Amit <namit@vmware.com> wrote:
> 
> 
>> On Mar 18, 2022, at 9:33 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> 
>> ⚠ External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
>> 
>> 0day reported a regression on a microbenchmark which is intended to
>> stress the TLB flushing path:
>> 
>>       https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20220317090415.GE735%40xsang-OptiPlex-9020%2F&amp;data=04%7C01%7Cnamit%40vmware.com%7C4a2c382b5ef44105474308da08fd0c7f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637832180178751497%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=TA0iATQCnfDjIZ1lG3YdhjMZjelXrVatBjBE8Hz3AfE%3D&amp;reserved=0
>> 
>> It pointed at a commit from Nadav which intended to remove retpoline
>> overhead in the TLB flushing path by taking the 'cond'-ition in
>> on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into
>> 'cpumask'.  That allowed the code to use a bunch of earlier direct
>> calls instead of later indirect calls that need a retpoline.
>> 
>> But, in practice, threads can go idle (and into lazy TLB mode where
>> they don't need to flush their TLB) between the early and late calls.
>> It works in this direction and not in the other because TLB-flushing
>> threads tend to hold mmap_lock for write.  Contention on that lock
>> causes threads to _go_ idle right in this early/late window.
>> 
> 
> Acked-by: Nadav Amit <namit@vmware.com>
> 

Dave,

Note that your SOB has a wrong email address (intel.cm).

I noticed only because my mail server complained about wrong email
address in the recipient list…



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-19  7:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-18 16:33 [PATCH] x86/mm/tlb: Revert retpoline avoidance approach Dave Hansen
2022-03-19  7:20 ` Nadav Amit
2022-03-19  7:28   ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox