All of lore.kernel.org
 help / color / mirror / Atom feed
* [merged mm-stable] mm-tlbbatch-introduce-arch_tlbbatch_should_defer.patch removed from -mm tree
@ 2023-08-11 23:01 Andrew Morton
  0 siblings, 0 replies; only message in thread
From: Andrew Morton @ 2023-08-11 23:01 UTC (permalink / raw)
  To: mm-commits, yangyicong, xhao, will, wangkefeng.wang, v-songbaohua,
	ryan.roberts, realmz6, punit.agrawal, prime.zeng, peterz, namit,
	mgorman, mark.rutland, lipeifeng, Jonathan.Cameron, darren,
	corbet, catalin.marinas, baohua, arnd, anshuman.khandual,
	khandual, akpm


The quilt patch titled
     Subject: mm/tlbbatch: introduce arch_tlbbatch_should_defer()
has been removed from the -mm tree.  Its filename was
     mm-tlbbatch-introduce-arch_tlbbatch_should_defer.patch

This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

------------------------------------------------------
From: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Subject: mm/tlbbatch: introduce arch_tlbbatch_should_defer()
Date: Mon, 17 Jul 2023 21:10:01 +0800

Patch series "arm64: support batched/deferred tlb shootdown during page
reclamation/migration", v11.

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.  A simplest micro benchmark shows even on
snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is
huge even for paging out one page mapped by only one process: 5.36% a.out
[kernel.kallsyms] [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs, the cost
should become even higher due to the bad scalability of tlb shootdown. 
The same benchmark can result in 16.99% CPU consumption on ARM64 server
with around 100 cores according to the test on patch 4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()

Testing on snapdragon shows the overhead of ptep_clear_flush is removed by
the patchset.  The micro benchmark becomes 5% faster even for one page
mapped by single process on snapdragon 888.

Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset
does some renaming/extension for the current implementation first (Patch
1-3), then add the support on arm64 (Patch 4).
		

This patch (of 4):

The entire scheme of deferred TLB flush in reclaim path rests on the fact
that the cost to refill TLB entries is less than flushing out individual
entries by sending IPI to remote CPUs.  But architecture can have
different ways to evaluate that.  Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

[yangyicong@hisilicon.com: rebase and fix incorrect return value type]
Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com
Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/tlbflush.h |   12 ++++++++++++
 mm/rmap.c                       |    9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h~mm-tlbbatch-introduce-arch_tlbbatch_should_defer
+++ a/arch/x86/include/asm/tlbflush.h
@@ -253,6 +253,18 @@ static inline void flush_tlb_page(struct
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
 	/*
--- a/mm/rmap.c~mm-tlbbatch-introduce-arch_tlbbatch_should_defer
+++ a/mm/rmap.c
@@ -688,17 +688,10 @@ retry:
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
_

Patches currently in -mm which might be from khandual@linux.vnet.ibm.com are



^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2023-08-11 23:05 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-11 23:01 [merged mm-stable] mm-tlbbatch-introduce-arch_tlbbatch_should_defer.patch removed from -mm tree Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.