[RFC PATCH 0/4] possibilities for improving invalidations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] possibilities for improving invalidations
@ 2018-07-25 15:52 Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 1/4] mm: munmap optimise single threaded page freeing Nicholas Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Nicholas Piggin @ 2018-07-25 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-arch

I wonder if we could make some improvements to zapping pages to
reduce TLB flushes under PTL, and to single threaded pte updates
to reduce atomic operations.

This might require some changes to arch code, particularly the
last patch. I'd just like to see if I've missed something
fundamental with the mm or with pte/tlb behaviour.

Thanks,
Nick

Nicholas Piggin (4):
  mm: munmap optimise single threaded page freeing
  mm: zap_pte_range only flush under ptl if a dirty shared page was
    unmapped
  mm: zap_pte_range optimise fullmm handling for dirty shared pages
  mm: optimise flushing and pte manipulation for single threaded access

 include/asm-generic/tlb.h |  3 +++
 mm/huge_memory.c          |  4 ++--
 mm/madvise.c              |  4 ++--
 mm/memory.c               | 40 ++++++++++++++++++++++++++++++++-------
 4 files changed, 40 insertions(+), 11 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1/4] mm: munmap optimise single threaded page freeing
  2018-07-25 15:52 [RFC PATCH 0/4] possibilities for improving invalidations Nicholas Piggin
@ 2018-07-25 15:52 ` Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 2/4] mm: zap_pte_range only flush under ptl if a dirty shared page was unmapped Nicholas Piggin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Nicholas Piggin @ 2018-07-25 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-arch

In case a single threaded process is zapping its own mappings, there
should be no concurrent memory accesses through the TLBs, and so it
is safe to free pages immediately rather than batch them up.
---
 mm/memory.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 135d18b31e44..773d588b371d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -296,6 +296,15 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	VM_BUG_ON(!tlb->end);
 	VM_WARN_ON(tlb->page_size != page_size);
 
+	/*
+	 * When this is our mm and there are no other users, there can not be
+	 * a concurrent memory access.
+	 */
+	if (current->mm == tlb->mm && atomic_read(&tlb->mm->mm_users) < 2) {
+		free_page_and_swap_cache(page);
+		return false;
+	}
+
 	batch = tlb->active;
 	/*
 	 * Add the page and check if we are full. If so
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 2/4] mm: zap_pte_range only flush under ptl if a dirty shared page was unmapped
  2018-07-25 15:52 [RFC PATCH 0/4] possibilities for improving invalidations Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 1/4] mm: munmap optimise single threaded page freeing Nicholas Piggin
@ 2018-07-25 15:52 ` Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 3/4] mm: zap_pte_range optimise fullmm handling for dirty shared pages Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 4/4] mm: optimise flushing and pte manipulation for single threaded access Nicholas Piggin
  3 siblings, 0 replies; 5+ messages in thread
From: Nicholas Piggin @ 2018-07-25 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-arch

The force_flush is used for two cases, a tlb batch full, and a shared
dirty page unmapped. Only the latter is required to flush the TLB
under the page table lock, because the problem is page_mkclean returning
when there are still writable TLB entries the page can be modified with.

We are encountering cases of soft lockups due to high TLB flush latency
with very large guests. There is probably some contetion in hypervisor
and interconnect tuning to be done, and it's actually a hash MMU guest
which has a whole other set of issues, but I'm looking for general ways
to reduce TLB fushing under locks.
---
 mm/memory.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 773d588b371d..1161ed3f1d0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1281,6 +1281,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = tlb->mm;
 	int force_flush = 0;
+	int locked_flush = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
@@ -1322,6 +1323,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
 					force_flush = 1;
+					locked_flush = 1;
 					set_page_dirty(page);
 				}
 				if (pte_young(ptent) &&
@@ -1384,7 +1386,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (locked_flush)
 		tlb_flush_mmu_tlbonly(tlb);
 	pte_unmap_unlock(start_pte, ptl);
 
@@ -1395,8 +1397,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	 * memory too. Restart if we didn't do everything.
 	 */
 	if (force_flush) {
-		force_flush = 0;
+		if (!locked_flush)
+			tlb_flush_mmu_tlbonly(tlb);
 		tlb_flush_mmu_free(tlb);
+
+		force_flush = 0;
+		locked_flush = 0;
 		if (addr != end)
 			goto again;
 	}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 3/4] mm: zap_pte_range optimise fullmm handling for dirty shared pages
  2018-07-25 15:52 [RFC PATCH 0/4] possibilities for improving invalidations Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 1/4] mm: munmap optimise single threaded page freeing Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 2/4] mm: zap_pte_range only flush under ptl if a dirty shared page was unmapped Nicholas Piggin
@ 2018-07-25 15:52 ` Nicholas Piggin
  2018-07-25 15:52 ` [RFC PATCH 4/4] mm: optimise flushing and pte manipulation for single threaded access Nicholas Piggin
  3 siblings, 0 replies; 5+ messages in thread
From: Nicholas Piggin @ 2018-07-25 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-arch

Shared dirty pages do not need to be flushed under page table lock
for the fullmm case, because there will be no subsequent access
through the TLBs.
---
 mm/memory.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1161ed3f1d0b..490689909186 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1322,8 +1322,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
-					force_flush = 1;
-					locked_flush = 1;
+					/*
+					 * Page must be flushed from TLBs
+					 * before releasing PTL to synchronize
+					 * with page_mkclean and avoid another
+					 * thread writing to the page through
+					 * the old TLB after it was marked
+					 * clean.
+					 */
+					if (!tlb->fullmm) {
+						force_flush = 1;
+						locked_flush = 1;
+					}
 					set_page_dirty(page);
 				}
 				if (pte_young(ptent) &&
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 4/4] mm: optimise flushing and pte manipulation for single threaded access
  2018-07-25 15:52 [RFC PATCH 0/4] possibilities for improving invalidations Nicholas Piggin
                   ` (2 preceding siblings ...)
  2018-07-25 15:52 ` [RFC PATCH 3/4] mm: zap_pte_range optimise fullmm handling for dirty shared pages Nicholas Piggin
@ 2018-07-25 15:52 ` Nicholas Piggin
  3 siblings, 0 replies; 5+ messages in thread
From: Nicholas Piggin @ 2018-07-25 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-arch

I think many tlb->fullmm users actually want to know if there could
be concurrent memory accesses via the mappings being invalidated. If
there is no risk of the page being modified after the pte is changed
and if the pte dirty/young bits won't be modified by the hardware
concurrently, there is no reason to defer page freeing or atomically
update ptes.

I haven't gone carefully through all archs, maybe there is some
exception. But for core mm code I haven't been able to see why the
single threaded case should be different than the fullmm case in
terms of the pte updates and tlb flushing.
---
 include/asm-generic/tlb.h |  3 +++
 mm/huge_memory.c          |  4 ++--
 mm/madvise.c              |  4 ++--
 mm/memory.c               | 13 +++++++------
 4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index a55ef1425f0d..601c9fefda8e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -100,6 +100,9 @@ struct mmu_gather {
 	/* we are in the middle of an operation to clear
 	 * a full mm and can make some optimizations */
 	unsigned int		fullmm : 1,
+	/* we have a single thread, current, which is active in the user
+	 * address space */
+				singlethread: 1,
 	/* we have performed an operation which
 	 * requires a complete flush of the tlb */
 				need_flush_all : 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..8e3958a811f0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1703,7 +1703,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * operations.
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
-			tlb->fullmm);
+			tlb->singlethread);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	if (vma_is_dax(vma)) {
 		if (arch_needs_pgtable_deposit())
@@ -1971,7 +1971,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * operations.
 	 */
 	orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud,
-			tlb->fullmm);
+			tlb->singlethread);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_dax(vma)) {
 		spin_unlock(ptl);
diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..d9e1f3ac8067 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -350,7 +350,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				continue;
 			nr_swap--;
 			free_swap_and_cache(entry);
-			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			pte_clear_not_present_full(mm, addr, pte, tlb->singlethread);
 			continue;
 		}
 
@@ -417,7 +417,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			 * after pte clearing.
 			 */
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
+							tlb->singlethread);
 
 			ptent = pte_mkold(ptent);
 			ptent = pte_mkclean(ptent);
diff --git a/mm/memory.c b/mm/memory.c
index 490689909186..471ba07bc7e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -221,8 +221,9 @@ void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 {
 	tlb->mm = mm;
 
-	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
+	tlb->singlethread = (mm == current->mm) &&
+				(atomic_read(&mm->mm_users) <= 1);
 	tlb->need_flush_all = 0;
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
@@ -300,7 +301,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * When this is our mm and there are no other users, there can not be
 	 * a concurrent memory access.
 	 */
-	if (current->mm == tlb->mm && atomic_read(&tlb->mm->mm_users) < 2) {
+	if (tlb->singlethread) {
 		free_page_and_swap_cache(page);
 		return false;
 	}
@@ -1315,7 +1316,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					continue;
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
+							tlb->singlethread);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1330,7 +1331,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					 * the old TLB after it was marked
 					 * clean.
 					 */
-					if (!tlb->fullmm) {
+					if (!tlb->singlethread) {
 						force_flush = 1;
 						locked_flush = 1;
 					}
@@ -1367,7 +1368,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					continue;
 			}
 
-			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			pte_clear_not_present_full(mm, addr, pte, tlb->singlethread);
 			rss[mm_counter(page)]--;
 			page_remove_rmap(page, false);
 			put_page(page);
@@ -1389,7 +1390,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		pte_clear_not_present_full(mm, addr, pte, tlb->singlethread);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-25 15:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-25 15:52 [RFC PATCH 0/4] possibilities for improving invalidations Nicholas Piggin
2018-07-25 15:52 ` [RFC PATCH 1/4] mm: munmap optimise single threaded page freeing Nicholas Piggin
2018-07-25 15:52 ` [RFC PATCH 2/4] mm: zap_pte_range only flush under ptl if a dirty shared page was unmapped Nicholas Piggin
2018-07-25 15:52 ` [RFC PATCH 3/4] mm: zap_pte_range optimise fullmm handling for dirty shared pages Nicholas Piggin
2018-07-25 15:52 ` [RFC PATCH 4/4] mm: optimise flushing and pte manipulation for single threaded access Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).