[REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT
@ 2016-04-06  3:14 Alex Shi
  2016-04-06  4:47 ` Andy Lutomirski
  0 siblings, 1 reply; 3+ messages in thread
From: Alex Shi @ 2016-04-06  3:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)
  Cc: Alex Shi, Andrew Morton, Andy Lutomirski, Rik van Riel

It seems Intel core still share the TLB pool, flush both of threads' TLB
just cause a extra useless IPI and a extra flush. The extra flush will 
flush out TLB again which another thread just introduced.
That's double waste.

The micro testing show memory access can save about 25% time on my 
haswell i7 desktop.
munmap source code is here: https://lkml.org/lkml/2012/5/17/59

test result on Kernel v4.5.0:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 20ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':

        18,739,808      dTLB-load-misses          #    2.47% of all dTLB cache hits   (43.05%)
       757,380,911      dTLB-loads                                                    (34.34%)
         2,125,275      dTLB-store-misses                                             (32.23%)
       318,307,759      dTLB-stores                                                   (46.32%)
            32,765      iTLB-load-misses          #    2.03% of all iTLB cache hits   (56.90%)
         1,616,237      iTLB-loads                                                    (44.47%)
            41,476      tlb:tlb_flush

       1.443484546 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 32262

test result on Kernel v4.5.0 + this patch:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 16ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':

        15,984,772      dTLB-load-misses          #    1.89% of all dTLB cache hits   (41.72%)
       844,099,241      dTLB-loads                                                    (33.30%)
         1,328,102      dTLB-store-misses                                             (52.13%)
       280,902,875      dTLB-stores                                                   (52.03%)
            27,678      iTLB-load-misses          #    1.67% of all iTLB cache hits   (35.35%)
         1,659,550      iTLB-loads                                                    (38.38%)
            25,137      tlb:tlb_flush

       1.428880301 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 15912

BTW, 
This change isn't architecturally guaranteed.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
To: linux-kernel@vger.kernel.org
To: Mel Gorman <mgorman@suse.de>
To: x86@kernel.org
To: "H. Peter Anvin" <hpa@zytor.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Alex Shi <alex.shi@linaro.org>
---
 arch/x86/mm/tlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3d..6510316 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -134,7 +134,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
+	int cpu;
 	struct flush_tlb_info info;
+	cpumask_t flush_mask, *sblmask;
+
 	info.flush_mm = mm;
 	info.flush_start = start;
 	info.flush_end = end;
@@ -151,7 +154,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+
+	if (unlikely(smp_num_siblings <= 1)) {
+		smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+		return;
+	}
+
+	/* Only one flush needed on both siblings of SMT */
+	cpumask_copy(&flush_mask, cpumask);
+	for_each_cpu(cpu, &flush_mask) {
+		sblmask = topology_sibling_cpumask(cpu);
+		if (!cpumask_subset(sblmask, &flush_mask))
+			continue;
+
+		cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask);
+	}
+
+	smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1);
 }
 
 void flush_tlb_current_task(void)
-- 
2.7.2.333.g70bd996

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT
  2016-04-06  3:14 [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT Alex Shi
@ 2016-04-06  4:47 ` Andy Lutomirski
  2016-04-06  5:15   ` Alex Shi
  0 siblings, 1 reply; 3+ messages in thread
From: Andy Lutomirski @ 2016-04-06  4:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: X86 ML, Thomas Gleixner, Ingo Molnar,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT), Andrew Morton,
	H. Peter Anvin, Rik van Riel

On Apr 5, 2016 8:17 PM, "Alex Shi" <alex.shi@linaro.org> wrote:
>
> It seems Intel core still share the TLB pool, flush both of threads' TLB
> just cause a extra useless IPI and a extra flush. The extra flush will
> flush out TLB again which another thread just introduced.
> That's double waste.

Do you have a reference in both the SDM and the APM for this?

Do we have a guarantee that this serialized the front end such that
the non-targetted sibling won't execute an instruction that it decoded
from a stale translation?

This will conflict rather deeply with my PCID series, too.

--Andy

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT
  2016-04-06  4:47 ` Andy Lutomirski
@ 2016-04-06  5:15   ` Alex Shi
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Shi @ 2016-04-06  5:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Thomas Gleixner, Ingo Molnar,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT), Andrew Morton,
	H. Peter Anvin, Rik van Riel



On 04/06/2016 12:47 PM, Andy Lutomirski wrote:
> On Apr 5, 2016 8:17 PM, "Alex Shi" <alex.shi@linaro.org> wrote:
>>
>> It seems Intel core still share the TLB pool, flush both of threads' TLB
>> just cause a extra useless IPI and a extra flush. The extra flush will
>> flush out TLB again which another thread just introduced.
>> That's double waste.
> 
> Do you have a reference in both the SDM and the APM for this?

No. as I said in the end of commit log. There are no any official
guarantee for this usage, but it seems working widely in Intel CPUs.

And the performance benefit is so tempted...
Is there Intel's guys like to dig it more? :)

> 
> Do we have a guarantee that this serialized the front end such that
> the non-targetted sibling won't execute an instruction that it decoded
> from a stale translation?

Is your worrying an evidence for my guess? Otherwise the stale
instruction happens either before IPI coming in... :)
> 
> This will conflict rather deeply with my PCID series, too.
> 
> --Andy
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-04-06  5:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-06  3:14 [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT Alex Shi
2016-04-06  4:47 ` Andy Lutomirski
2016-04-06  5:15   ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox