Re: Dirty/Access bits vs. page content

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Dirty/Access bits vs. page content
       [not found]             ` <53559F48.8040808@intel.com>
@ 2014-04-22  0:31               ` Linus Torvalds
  2014-04-22  0:44                 ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-22  0:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-arch@vger.kernel.org, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

On Mon, Apr 21, 2014 at 3:44 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> I came up with something pretty similar to what you've got.  I used some
> local variables for the dirty state rather than using the pte, but
> otherwise looks pretty similar.  It actually boots, runs, and
> superficially looks to be doing the right thing.

.. except your version doesn't seem to have a chance of even compiling
on anything that doesn't use asm-generic/tlb.h and thus
HAVE_GENERIC_MMU_GATHER.

Now, I don't know that mine works either, but at least I tried. I've
love to hear if somebody who has a cross-compile environment set up
for the non-generic architectures. I tried 'um', but we have at least
arm, ia64, s390 and sh that don't use the generic mmu gather logic.

I'm not entirely sure why ARM doesn't do the generic one, but I think
s390 is TLB-coherent at the ptep_get_and_clear() point, so there just
doing the set_page_dirty() is fine (assuming it compiles - there could
be some header file ordering issue).

> I fixed free_pages_and_swap_cache() but just making a first pass through
> the array and clearing the bits.

Yeah. I have to say, I think it's a bit ugly.

I am personally starting to think that we could just make
release_pages() ignore the low bit of the "struct page" pointer in the
array it is passed in, and then free_pages_and_swap_cache() could
easily just do the "set_page_dirty()" in the loop it already does.

Now, I agree that that is certainly *also* a bit ugly, but it ends up
simplifying everything else, so it's a preferable kind of ugly to me.

So here's a suggested *incremental* patch (on top of my previous patch
that did the interface change) that does that.

Does this work for people? It *looks* sane. It compiles for me (tested
on x86 that uses generic mmu gather, and on UM that does not).

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 2397 bytes --]

 mm/memory.c     |  5 +----
 mm/swap.c       |  8 +++++++-
 mm/swap_state.c | 14 ++++++++++++--
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 62fdcd1995f4..174542ab2b90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -283,11 +283,8 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 
 	VM_BUG_ON(!tlb->need_flush);
 
-	/* FIXME! This needs to be batched too */
-	if (dirty)
-		set_page_dirty(page);
 	batch = tlb->active;
-	batch->pages[batch->nr++] = page;
+	batch->pages[batch->nr++] = (void *) (dirty + (unsigned long)page);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return 0;
diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba4498b..1a58c58c7f41 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -821,8 +821,14 @@ void release_pages(struct page **pages, int nr, int cold)
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	/*
+	 * NOTE! The low bit of the struct page pointer in
+	 * the "pages[]" array is used as a dirty bit, so
+	 * we ignore it
+	 */
 	for (i = 0; i < nr; i++) {
-		struct page *page = pages[i];
+		unsigned long pageval = (unsigned long)pages[i];
+		struct page *page = (void *)(~1ul & pageval);
 
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e76ace30d436..bb0b2d675a82 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -258,6 +258,11 @@ void free_page_and_swap_cache(struct page *page)
 /*
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
+ *
+ * NOTE! The low bit of the "struct page" pointers passed in is a dirty
+ * indicator, saying that the page needs to be marked dirty before freeing.
+ *
+ * release_pages() itself ignores that bit.
  */
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
@@ -268,8 +273,13 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 		int todo = min(nr, PAGEVEC_SIZE);
 		int i;
 
-		for (i = 0; i < todo; i++)
-			free_swap_cache(pagep[i]);
+		for (i = 0; i < todo; i++) {
+			unsigned long pageval = (unsigned long) pagep[i];
+			struct page *page = (void *)(~1ul & pageval);
+			if (pageval & 1)
+				set_page_dirty(page);
+			free_swap_cache(page);
+		}
 		release_pages(pagep, todo, 0);
 		pagep += todo;
 		nr -= todo;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22  0:31               ` Dirty/Access bits vs. page content Linus Torvalds
@ 2014-04-22  0:44                 ` Linus Torvalds
  2014-04-22  5:15                   ` Tony Luck
                                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-22  0:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

[-- Attachment #1: Type: text/plain, Size: 904 bytes --]

On Mon, Apr 21, 2014 at 5:31 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Does this work for people? It *looks* sane. It compiles for me (tested
> on x86 that uses generic mmu gather, and on UM that does not).

Just to make it easier for people to test, here's both of them with
commit logs. I've committed them in my tree, but not pushed out, so I
can still edit them or add acks and tested-by's.

I think Russell and Tony are both on linux-arch, so they probably saw
at least part of this discussion flow past already, but just in case
I'm adding them explicitly to the cc, because both ia64 and arm seem
to implement their own version of TLB batching rather than use the
generic code. I *thought* the generic code was supposed to be generic
enough for arm and ia64 too, but if not, they'd need to do the proper
dirty bit batching in those private implementations.

             Linus

[-- Attachment #2: 0001-mm-move-page-table-dirty-state-into-TLB-gather-opera.patch --]
[-- Type: text/x-patch, Size: 9364 bytes --]

From 21819f790e3d206ad77cd20d6e7cae86311fc87d Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 21 Apr 2014 15:29:49 -0700
Subject: [PATCH 1/2] mm: move page table dirty state into TLB gather operation

When tearing down a memory mapping, we have long delayed the actual
freeing of the pages until after the (batched) TLB flush, since only
after the TLB entries have been flushed from all CPU's do we know that
none of the pages will be accessed any more.

HOWEVER.

Ben Herrenschmidt points out that we need to do the same thing for
marking a shared mapped page dirty.  Because if we mark the underlying
page dirty before we have flushed the TLB's, other CPU's may happily
continue to write to the page (using their stale TLB contents) after
we've marked the page dirty, and they can thus race with any cleaning
operation.

Now, in practice, any page cleaning operations will take much longer to
start the IO on the page than it will have taken us to get to the TLB
flush, so this is going to be hard to trigger in real life.  In fact, so
far nobody has even come up with a reasonable test-case for this to show
it happening.

But what we do now (set_page_dirty() before flushing the TLB) really is
wrong.  And this commit does not fix it, but by moving the dirty
handling into the TLB gather operation at least the internal interfaces
now support the notion of those TLB gather interfaces doing the rigth
thing.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/arm/include/asm/tlb.h  |  6 ++++--
 arch/ia64/include/asm/tlb.h |  6 ++++--
 arch/s390/include/asm/tlb.h |  4 +++-
 arch/sh/include/asm/tlb.h   |  6 ++++--
 arch/um/include/asm/tlb.h   |  6 ++++--
 include/asm-generic/tlb.h   |  4 ++--
 mm/hugetlb.c                |  4 +---
 mm/memory.c                 | 15 +++++++++------
 8 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 0baf7f0d9394..ac9c16af8e63 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -165,8 +165,10 @@ tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 		tlb_flush(tlb);
 }
 
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
+	if (dirty)
+		set_page_dirty(page);
 	tlb->pages[tlb->nr++] = page;
 	VM_BUG_ON(tlb->nr > tlb->max);
 	return tlb->max - tlb->nr;
@@ -174,7 +176,7 @@ static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
-	if (!__tlb_remove_page(tlb, page))
+	if (!__tlb_remove_page(tlb, page, 0))
 		tlb_flush_mmu(tlb);
 }
 
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index bc5efc7c3f3f..9049a7d6427d 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -193,8 +193,10 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
  * must be delayed until after the TLB has been flushed (see comments at the beginning of
  * this file).
  */
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
+	if (dirty)
+		set_page_dirty(page);
 	tlb->need_flush = 1;
 
 	if (!tlb->nr && tlb->pages == tlb->local)
@@ -213,7 +215,7 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
-	if (!__tlb_remove_page(tlb, page))
+	if (!__tlb_remove_page(tlb, page, 0))
 		tlb_flush_mmu(tlb);
 }
 
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index c544b6f05d95..8107a8c53985 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -76,8 +76,10 @@ static inline void tlb_finish_mmu(struct mmu_gather *tlb,
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
+	if (dirty)
+		set_page_dirty(page);
 	free_page_and_swap_cache(page);
 	return 1; /* avoid calling tlb_flush_mmu */
 }
diff --git a/arch/sh/include/asm/tlb.h b/arch/sh/include/asm/tlb.h
index 362192ed12fe..428de7858d27 100644
--- a/arch/sh/include/asm/tlb.h
+++ b/arch/sh/include/asm/tlb.h
@@ -90,15 +90,17 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
 }
 
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, int dirty)
 {
+	if (dirty)
+		set_page_dirty(page);
 	free_page_and_swap_cache(page);
 	return 1; /* avoid calling tlb_flush_mmu */
 }
 
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
-	__tlb_remove_page(tlb, page);
+	__tlb_remove_page(tlb, page, 0);
 }
 
 #define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
diff --git a/arch/um/include/asm/tlb.h b/arch/um/include/asm/tlb.h
index 29b0301c18aa..df557f987819 100644
--- a/arch/um/include/asm/tlb.h
+++ b/arch/um/include/asm/tlb.h
@@ -86,8 +86,10 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
  *	while handling the additional races in SMP caused by other CPUs
  *	caching valid mappings in their TLBs.
  */
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
+	if (dirty)
+		set_page_dirty(page);
 	tlb->need_flush = 1;
 	free_page_and_swap_cache(page);
 	return 1; /* avoid calling tlb_flush_mmu */
@@ -95,7 +97,7 @@ static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
-	__tlb_remove_page(tlb, page);
+	__tlb_remove_page(tlb, page, 0);
 }
 
 /**
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 5672d7ea1fa0..541c563f0ff9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -116,7 +116,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 void tlb_flush_mmu(struct mmu_gather *tlb);
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start,
 							unsigned long end);
-int __tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty);
 
 /* tlb_remove_page
  *	Similar to __tlb_remove_page but will call tlb_flush_mmu() itself when
@@ -124,7 +124,7 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page);
  */
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
-	if (!__tlb_remove_page(tlb, page))
+	if (!__tlb_remove_page(tlb, page, 0))
 		tlb_flush_mmu(tlb);
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 246192929a2d..dfbe23c0c200 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2516,11 +2516,9 @@ again:
 
 		pte = huge_ptep_get_and_clear(mm, address, ptep);
 		tlb_remove_tlb_entry(tlb, ptep, address);
-		if (huge_pte_dirty(pte))
-			set_page_dirty(page);
 
 		page_remove_rmap(page);
-		force_flush = !__tlb_remove_page(tlb, page);
+		force_flush = !__tlb_remove_page(tlb, page, huge_pte_dirty(pte));
 		if (force_flush) {
 			spin_unlock(ptl);
 			break;
diff --git a/mm/memory.c b/mm/memory.c
index d0f0bef3be48..62fdcd1995f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -277,12 +277,15 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
  *	mappings in their TLBs. Returns the number of free page slots left.
  *	When out of page slots we must call tlb_flush_mmu().
  */
-int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
 	struct mmu_gather_batch *batch;
 
 	VM_BUG_ON(!tlb->need_flush);
 
+	/* FIXME! This needs to be batched too */
+	if (dirty)
+		set_page_dirty(page);
 	batch = tlb->active;
 	batch->pages[batch->nr++] = page;
 	if (batch->nr == batch->max) {
@@ -1124,11 +1127,11 @@ again:
 					pte_file_mksoft_dirty(ptfile);
 				set_pte_at(mm, addr, pte, ptfile);
 			}
-			if (PageAnon(page))
+			if (PageAnon(page)) {
+				/* We don't bother saving dirty state for anonymous pages */
+				ptent = pte_mkclean(ptent);
 				rss[MM_ANONPAGES]--;
-			else {
-				if (pte_dirty(ptent))
-					set_page_dirty(page);
+			} else {
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
@@ -1137,7 +1140,7 @@ again:
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			force_flush = !__tlb_remove_page(tlb, page);
+			force_flush = !__tlb_remove_page(tlb, page, pte_dirty(ptent));
 			if (force_flush)
 				break;
 			continue;
-- 
1.9.1.377.g96e67c8.dirty


[-- Attachment #3: 0002-mm-make-the-generic-TLB-flush-batching-correctly-dir.patch --]
[-- Type: text/x-patch, Size: 3264 bytes --]

From d26515fe19d5850aa69881ee6ae193e068f22ba1 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 21 Apr 2014 17:35:35 -0700
Subject: [PATCH 2/2] mm: make the generic TLB flush batching correctly dirty
 the page at the end

When unmapping dirty shared mappings, the page should be dirtied after
doing the TLB flush.  This does that by hiding the dirty bit in the low
bit of the "struct page" pointer in the TLB gather batching array, and
teaching free_pages_and_swap_cache() to mark the pages dirty at the end.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c     |  5 +----
 mm/swap.c       |  8 +++++++-
 mm/swap_state.c | 14 ++++++++++++--
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 62fdcd1995f4..174542ab2b90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -283,11 +283,8 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 
 	VM_BUG_ON(!tlb->need_flush);
 
-	/* FIXME! This needs to be batched too */
-	if (dirty)
-		set_page_dirty(page);
 	batch = tlb->active;
-	batch->pages[batch->nr++] = page;
+	batch->pages[batch->nr++] = (void *) (dirty + (unsigned long)page);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return 0;
diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba4498b..1a58c58c7f41 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -821,8 +821,14 @@ void release_pages(struct page **pages, int nr, int cold)
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	/*
+	 * NOTE! The low bit of the struct page pointer in
+	 * the "pages[]" array is used as a dirty bit, so
+	 * we ignore it
+	 */
 	for (i = 0; i < nr; i++) {
-		struct page *page = pages[i];
+		unsigned long pageval = (unsigned long)pages[i];
+		struct page *page = (void *)(~1ul & pageval);
 
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e76ace30d436..bb0b2d675a82 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -258,6 +258,11 @@ void free_page_and_swap_cache(struct page *page)
 /*
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
+ *
+ * NOTE! The low bit of the "struct page" pointers passed in is a dirty
+ * indicator, saying that the page needs to be marked dirty before freeing.
+ *
+ * release_pages() itself ignores that bit.
  */
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
@@ -268,8 +273,13 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 		int todo = min(nr, PAGEVEC_SIZE);
 		int i;
 
-		for (i = 0; i < todo; i++)
-			free_swap_cache(pagep[i]);
+		for (i = 0; i < todo; i++) {
+			unsigned long pageval = (unsigned long) pagep[i];
+			struct page *page = (void *)(~1ul & pageval);
+			if (pageval & 1)
+				set_page_dirty(page);
+			free_swap_cache(page);
+		}
 		release_pages(pagep, todo, 0);
 		pagep += todo;
 		nr -= todo;
-- 
1.9.1.377.g96e67c8.dirty


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22  0:44                 ` Linus Torvalds
@ 2014-04-22  5:15                   ` Tony Luck
  2014-04-22 14:55                     ` Linus Torvalds
  2014-04-22  7:34                   ` Peter Zijlstra
  2014-04-22  7:54                   ` Peter Zijlstra
  2 siblings, 1 reply; 49+ messages in thread
From: Tony Luck @ 2014-04-22  5:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	Peter Zijlstra, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux

On Mon, Apr 21, 2014 at 5:44 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I think Russell and Tony are both on linux-arch, so they probably saw
> at least part of this discussion flow past already, but just in case
> I'm adding them explicitly to the cc, because both ia64 and arm seem
> to implement their own version of TLB batching rather than use the
> generic code.

It builds and boots on ia64 with no new warnings.  I haven't done
anything more stressful than booting though - so unsure whether
there are any corners cases that might show up under load.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22  5:15                   ` Tony Luck
@ 2014-04-22 14:55                     ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-22 14:55 UTC (permalink / raw)
  To: Tony Luck
  Cc: Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	Peter Zijlstra, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux

On Mon, Apr 21, 2014 at 10:15 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> It builds and boots on ia64 with no new warnings.  I haven't done
> anything more stressful than booting though - so unsure whether
> there are any corners cases that might show up under load.

Thanks. It shouldn't actually change any behavior on ia64 (you'd have
to do any dirty bit batching yourself), I was mainly worried about
compile warnings due to not having the set_page_dirty() declaration
due to some odd header file issues.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22  0:44                 ` Linus Torvalds
  2014-04-22  5:15                   ` Tony Luck
@ 2014-04-22  7:34                   ` Peter Zijlstra
  2014-04-22  7:54                   ` Peter Zijlstra
  2 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-22  7:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck, kirill.shutemov

> From 21819f790e3d206ad77cd20d6e7cae86311fc87d Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 21 Apr 2014 15:29:49 -0700
> Subject: [PATCH 1/2] mm: move page table dirty state into TLB gather operation
> 
> When tearing down a memory mapping, we have long delayed the actual
> freeing of the pages until after the (batched) TLB flush, since only
> after the TLB entries have been flushed from all CPU's do we know that
> none of the pages will be accessed any more.
> 
> HOWEVER.
> 
> Ben Herrenschmidt points out that we need to do the same thing for
> marking a shared mapped page dirty.  Because if we mark the underlying
> page dirty before we have flushed the TLB's, other CPU's may happily
> continue to write to the page (using their stale TLB contents) after
> we've marked the page dirty, and they can thus race with any cleaning
> operation.
> 
> Now, in practice, any page cleaning operations will take much longer to
> start the IO on the page than it will have taken us to get to the TLB
> flush, so this is going to be hard to trigger in real life.  In fact, so
> far nobody has even come up with a reasonable test-case for this to show
> it happening.
> 
> But what we do now (set_page_dirty() before flushing the TLB) really is
> wrong.  And this commit does not fix it, but by moving the dirty
> handling into the TLB gather operation at least the internal interfaces
> now support the notion of those TLB gather interfaces doing the rigth
> thing.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Peter Anvin <hpa@zytor.com>

Acked-by: Peter Zijlstra <peterz@infradead.org>

> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  arch/arm/include/asm/tlb.h  |  6 ++++--
>  arch/ia64/include/asm/tlb.h |  6 ++++--
>  arch/s390/include/asm/tlb.h |  4 +++-
>  arch/sh/include/asm/tlb.h   |  6 ++++--
>  arch/um/include/asm/tlb.h   |  6 ++++--
>  include/asm-generic/tlb.h   |  4 ++--
>  mm/hugetlb.c                |  4 +---
>  mm/memory.c                 | 15 +++++++++------
>  8 files changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
> index 0baf7f0d9394..ac9c16af8e63 100644
> --- a/arch/arm/include/asm/tlb.h
> +++ b/arch/arm/include/asm/tlb.h
> @@ -165,8 +165,10 @@ tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
>  		tlb_flush(tlb);
>  }
>  
> -static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> +static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
>  {
> +	if (dirty)
> +		set_page_dirty(page);
>  	tlb->pages[tlb->nr++] = page;
>  	VM_BUG_ON(tlb->nr > tlb->max);
>  	return tlb->max - tlb->nr;
> @@ -174,7 +176,7 @@ static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  
>  static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  {
> -	if (!__tlb_remove_page(tlb, page))
> +	if (!__tlb_remove_page(tlb, page, 0))
>  		tlb_flush_mmu(tlb);
>  }

So I checked this, and currently the only users of tlb_remove_page() are
the archs for freeing the page table pages and THP. The latter is OK
because it is strictly Anon (for now).

Anybody (/me looks at Kiryl) thinking of making THP work for shared
pages should also cure this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22  0:44                 ` Linus Torvalds
  2014-04-22  5:15                   ` Tony Luck
  2014-04-22  7:34                   ` Peter Zijlstra
@ 2014-04-22  7:54                   ` Peter Zijlstra
  2014-04-22 21:36                     ` Linus Torvalds
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-22  7:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On Mon, Apr 21, 2014 at 05:44:45PM -0700, Linus Torvalds wrote:
> From d26515fe19d5850aa69881ee6ae193e068f22ba1 Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 21 Apr 2014 17:35:35 -0700
> Subject: [PATCH 2/2] mm: make the generic TLB flush batching correctly dirty
>  the page at the end
> 
> When unmapping dirty shared mappings, the page should be dirtied after
> doing the TLB flush.  This does that by hiding the dirty bit in the low
> bit of the "struct page" pointer in the TLB gather batching array, and
> teaching free_pages_and_swap_cache() to mark the pages dirty at the end.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Peter Anvin <hpa@zytor.com>

Acked-by: Peter Zijlstra <peterz@infradead.org>

> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/memory.c     |  5 +----
>  mm/swap.c       |  8 +++++++-
>  mm/swap_state.c | 14 ++++++++++++--
>  3 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 62fdcd1995f4..174542ab2b90 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -283,11 +283,8 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
>  
>  	VM_BUG_ON(!tlb->need_flush);
>  
> -	/* FIXME! This needs to be batched too */
> -	if (dirty)
> -		set_page_dirty(page);
>  	batch = tlb->active;
> -	batch->pages[batch->nr++] = page;
> +	batch->pages[batch->nr++] = (void *) (dirty + (unsigned long)page);

Space between cast and expression.

>  	if (batch->nr == batch->max) {
>  		if (!tlb_next_batch(tlb))
>  			return 0;
> diff --git a/mm/swap.c b/mm/swap.c
> index 9ce43ba4498b..1a58c58c7f41 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -821,8 +821,14 @@ void release_pages(struct page **pages, int nr, int cold)
>  	struct lruvec *lruvec;
>  	unsigned long uninitialized_var(flags);
>  
> +	/*
> +	 * NOTE! The low bit of the struct page pointer in
> +	 * the "pages[]" array is used as a dirty bit, so
> +	 * we ignore it
> +	 */
>  	for (i = 0; i < nr; i++) {
> -		struct page *page = pages[i];
> +		unsigned long pageval = (unsigned long)pages[i];
> +		struct page *page = (void *)(~1ul & pageval);

No space between cast and expression.

Should we create some pointer bitops helpers? We do this casting all
over the place, maybe its time to make it pretty?

static inline void *ptr_or(void *ptr, unsigned long val)
{
	WARN_ON(val & ~0x03); /* too bad __alignof__ is 'broken' */
	return (void *)((unsigned long)ptr | val);
}

static inline void *ptr_mask(void *ptr)
{
	return (void *)((unsigned long)ptr & ~0x03);
}

static inline unsigned long ptr_and(void *ptr, unsigned long val)
{
	WARN_ON(val & ~0x03);
	return (unsigned long)ptr & val;
}

>  		if (unlikely(PageCompound(page))) {
>  			if (zone) {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e76ace30d436..bb0b2d675a82 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -258,6 +258,11 @@ void free_page_and_swap_cache(struct page *page)
>  /*
>   * Passed an array of pages, drop them all from swapcache and then release
>   * them.  They are removed from the LRU and freed if this is their last use.
> + *
> + * NOTE! The low bit of the "struct page" pointers passed in is a dirty
> + * indicator, saying that the page needs to be marked dirty before freeing.
> + *
> + * release_pages() itself ignores that bit.
>   */
>  void free_pages_and_swap_cache(struct page **pages, int nr)
>  {
> @@ -268,8 +273,13 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
>  		int todo = min(nr, PAGEVEC_SIZE);
>  		int i;
>  
> -		for (i = 0; i < todo; i++)
> -			free_swap_cache(pagep[i]);
> +		for (i = 0; i < todo; i++) {
> +			unsigned long pageval = (unsigned long) pagep[i];
> +			struct page *page = (void *)(~1ul & pageval);
> +			if (pageval & 1)
> +				set_page_dirty(page);
> +			free_swap_cache(page);
> +		}
>  		release_pages(pagep, todo, 0);
>  		pagep += todo;
>  		nr -= todo;

So PAGE_FLAGS_CHECK_AT_FREE doesn't include PG_dirty, so while we now
properly mark the page dirty, we could continue and simply free the
thing?

I suppose the pagecache has a ref on and there's no window where we
could drop that before doing this free (didn't check).

But my main point was; should we check for the dirty bit when freeing
the page?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22  7:54                   ` Peter Zijlstra
@ 2014-04-22 21:36                     ` Linus Torvalds
  2014-04-22 21:46                       ` Dave Hansen
  2014-04-23  3:08                       ` Hugh Dickins
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-22 21:36 UTC (permalink / raw)
  To: Peter Zijlstra, Hugh Dickins
  Cc: Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On Tue, Apr 22, 2014 at 12:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So PAGE_FLAGS_CHECK_AT_FREE doesn't include PG_dirty, so while we now
> properly mark the page dirty, we could continue and simply free the
> thing?

Yes. But being free'd while dirty should be fairly normal for
anonymous pages, no? And while I did a "pte_mkclean()" the the
PageAnon() case (so that we won't waste time on "set_page_dirty()" on
pages we don't care about, a concurrent truncate() could have turned
what *used* to be a file-backed page into just a dirty page with no
mapping any more.

So I don't think we would necessarily want to check for PG_dirty at
page freeing time, because freeing dirty pages isn't necessarily
wrong. For example, tmpfs/shmfs pages are generally always dirty, and
would get freed when the inode is deleted.

That said, Dave Hansen did report a BUG_ON() in
mpage_prepare_extent_to_map(). His line number was odd, but I assume
it's this one:

        BUG_ON(PageWriteback(page));

which may be indicative of some oddity here wrt the dirty bit.

So I'm a bit worried.  I'm starting to think that we need to do
"set_page_dirty_lock()". It *used* to be the case that because we held
the page table lock and the page used to be mapped (even if we then
unmapped it), page_mapping() could not go away from under us because
truncate would see it in the rmap list and then get serialized on that
page table lock. But moving the set_page_dirty() later - and to
outside the page table lock - means that we probably need to use that
version that takes the page lock.

Which might kind of suck from a performance angle. But it might
explain DaveH's BUG_ON() when testing those patches?

I wonder if we could hold on to the page mapping some other way than
taking that page lock, because taking that page lock sounds
potentially damn expensive.

Who is the master of the lock_page() semantics? Hugh Dickins again?
I'm bringing him in for this issue too, since whenever there is some
vm locking issue, he is always on top of it.

Hugh - I'm assuming you are on linux-mm. If not, holler, and I'll send
you the two patches I wrote for the TLB dirty shootdown (short
explanation: dirty bit setting needs to be delayed until after tlb
flushing, since other CPU's may be busily writing to the page until
that time).

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22 21:36                     ` Linus Torvalds
@ 2014-04-22 21:46                       ` Dave Hansen
  2014-04-22 22:08                         ` Linus Torvalds
  2014-04-23  2:44                         ` Linus Torvalds
  2014-04-23  3:08                       ` Hugh Dickins
  1 sibling, 2 replies; 49+ messages in thread
From: Dave Hansen @ 2014-04-22 21:46 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Hugh Dickins
  Cc: H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On 04/22/2014 02:36 PM, Linus Torvalds wrote:
> That said, Dave Hansen did report a BUG_ON() in
> mpage_prepare_extent_to_map(). His line number was odd, but I assume
> it's this one:
> 
>         BUG_ON(PageWriteback(page));
> 
> which may be indicative of some oddity here wrt the dirty bit.

I just triggered it a second time.  It only happens with my debugging
config[1] *and* those two fix patches.  It doesn't happen on the vanilla
kernel with lost dirty bit.

The line numbers point to the

	head = page_buffers(page);

which is:

#define page_buffers(page)                                      \
        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

1.
http://sr71.net/~dave/intel/20140422-lostdirtybit/config-ext4-bugon.20140422
Config that doesn't trigger it:
http://sr71.net/~dave/intel/20140422-lostdirtybit/config-ext4-nobug.20140422





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22 21:46                       ` Dave Hansen
@ 2014-04-22 22:08                         ` Linus Torvalds
  2014-04-22 22:41                           ` Dave Hansen
  2014-04-23  2:44                         ` Linus Torvalds
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-22 22:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Hugh Dickins, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Tue, Apr 22, 2014 at 2:46 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> I just triggered it a second time.  It only happens with my debugging
> config[1] *and* those two fix patches.  It doesn't happen on the vanilla
> kernel with lost dirty bit.

Ok. So looking at it some more, I'm becoming more and more convinced
that we do need to make that set_page_dirty() call in
free_pages_and_swap_cache() be a set_page_dirty_lock() instead.

Does that make things work for you?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22 22:08                         ` Linus Torvalds
@ 2014-04-22 22:41                           ` Dave Hansen
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2014-04-22 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Hugh Dickins, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On 04/22/2014 03:08 PM, Linus Torvalds wrote:
> On Tue, Apr 22, 2014 at 2:46 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> I just triggered it a second time.  It only happens with my debugging
>> config[1] *and* those two fix patches.  It doesn't happen on the vanilla
>> kernel with lost dirty bit.
> 
> Ok. So looking at it some more, I'm becoming more and more convinced
> that we do need to make that set_page_dirty() call in
> free_pages_and_swap_cache() be a set_page_dirty_lock() instead.
> 
> Does that make things work for you?

Nope, didn't appear to do anything different.

> [  160.607904] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
> [  173.093239] ------------[ cut here ]------------
> [  173.097884] kernel BUG at /home/davehans/linux.git/fs/ext4/inode.c:2377!
> [  173.104695] invalid opcode: 0000 [#1] SMP 
> [  173.108849] CPU: 65 PID: 4286 Comm: racewrite_threa Not tainted 3.15.0-rc2+ #392
> [  173.116383] Hardware name: FUJITSU-SV PRIMEQUEST 1800E2/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.24 09/14/2011
> [  173.127174] task: ffff887ff1909140 ti: ffff887ff2048000 task.ti: ffff887ff2048000
> [  173.134799] RIP: 0010:[<ffffffff81204810>]  [<ffffffff81204810>] mpage_prepare_extent_to_map+0x320/0x330
> [  173.144432] RSP: 0018:ffff887ff2049c68  EFLAGS: 00010246
> [  173.149819] RAX: 0000000000000041 RBX: ffff887ff2049dc8 RCX: ffff88bff2e52b48
> [  173.157080] RDX: 6bfffc000002003d RSI: 0000000000000167 RDI: ffffffff819c1ce8
> [  173.164341] RBP: ffff887ff2049d48 R08: ffffea037fbdb980 R09: 0000000000000001
> [  173.171603] R10: 0000000000000000 R11: 0000000000000220 R12: 7fffffffffffffff
> [  173.178864] R13: 0000000000000041 R14: ffff887ff2049ca8 R15: ffff887ff2049ca8
> [  173.186125] FS:  00007fe361e37700(0000) GS:ffff88dfffaa0000(0000) knlGS:0000000000000000
> [  173.194381] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  173.200214] CR2: 00007fe3aa70a000 CR3: 0000005fefdd9000 CR4: 00000000000007e0
> [  173.207475] Stack:
> [  173.209506]  ffffea037fbdb980 ffff887ff2049ca8 ffff88bff2e52b48 0000000000000000
> [  173.217017]  0000000000000000 0000000000000001 0000000000000000 ffffea037fbdb980
> [  173.224527]  0000000000000003 0000000000000286 ffff887ff1909140 ffff883ff2730800
> [  173.232038] Call Trace:
> [  173.234527]  [<ffffffff81233fcc>] ? __ext4_journal_start_sb+0x7c/0x120
> [  173.241159]  [<ffffffff8120af6b>] ? ext4_writepages+0x44b/0xce0
> [  173.247166]  [<ffffffff8120afa4>] ext4_writepages+0x484/0xce0
> [  173.253001]  [<ffffffff81123933>] do_writepages+0x23/0x40
> [  173.258479]  [<ffffffff811165f9>] __filemap_fdatawrite_range+0x59/0x60
> [  173.265112]  [<ffffffff8111b3f5>] SyS_fadvise64_64+0x265/0x270
> [  173.271031]  [<ffffffff8111b40e>] SyS_fadvise64+0xe/0x10
> [  173.276429]  [<ffffffff816d8a29>] system_call_fastpath+0x16/0x1b
> [  173.282513] Code: ff e9 6d ff ff ff 48 8d bd 48 ff ff ff e8 b9 20 f2 ff e9 31 ff ff ff 48 8b 4b 08 8b 49 20 85 c9 0f 85 ee fd ff ff 31 c0 eb b3 90 <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 
> [  173.301685] RIP  [<ffffffff81204810>] mpage_prepare_extent_to_map+0x320/0x330
> [  173.308933]  RSP <ffff887ff2049c68>
> [  173.312576] ---[ end trace b53fdf1d352b727a ]---


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22 21:46                       ` Dave Hansen
  2014-04-22 22:08                         ` Linus Torvalds
@ 2014-04-23  2:44                         ` Linus Torvalds
  1 sibling, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-23  2:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Hugh Dickins, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Tue, Apr 22, 2014 at 2:46 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> The line numbers point to the
>
>         head = page_buffers(page);

Ahh. My tree has an ext4 update, so for me that wasn't the case, and
your line numbers were off.

Hmm. I've stared at it, and I can make neither heads nor tails of it.
I really don't see why moving the set_page_dirty() would matter for
that case. Very odd. I'll mull on it and maybe it comes to me
overnight.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-22 21:36                     ` Linus Torvalds
  2014-04-22 21:46                       ` Dave Hansen
@ 2014-04-23  3:08                       ` Hugh Dickins
  2014-04-23  4:23                         ` Linus Torvalds
  2014-04-23 18:41                         ` Jan Kara
  1 sibling, 2 replies; 49+ messages in thread
From: Hugh Dickins @ 2014-04-23  3:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Peter Zijlstra, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Tue, 22 Apr 2014, Linus Torvalds wrote:
> On Tue, Apr 22, 2014 at 12:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> That said, Dave Hansen did report a BUG_ON() in
> mpage_prepare_extent_to_map(). His line number was odd, but I assume
> it's this one:
> 
>         BUG_ON(PageWriteback(page));
> 
> which may be indicative of some oddity here wrt the dirty bit.

Whereas later mail from Dave showed it to be the
	BUG_ON(!PagePrivate(page));
in page_buffers() from fs/ext4/inode.c mpage_prepare_extent_to_map().
But still presumably some kind of fallout from your patches.

Once upon a time there was a page_has_buffers() check in there,
but Honza decided that's nowadays unnecessary in f8bec37037ac
"ext4: dirty page has always buffers attached".  Cc'ed,
he may very well have some good ideas.

Reading that commit reminded me of how we actually don't expect that
set_page_dirty() in zap_pte_range() to do anything at all on the usual
mapping_cap_account_dirty()/page_mkwrite() filesystems, do we?  Or do we?

> So I'm a bit worried.  I'm starting to think that we need to do
> "set_page_dirty_lock()". It *used* to be the case that because we held
> the page table lock and the page used to be mapped (even if we then
> unmapped it), page_mapping() could not go away from under us because
> truncate would see it in the rmap list and then get serialized on that
> page table lock. But moving the set_page_dirty() later - and to
> outside the page table lock - means that we probably need to use that
> version that takes the page lock.
> 
> Which might kind of suck from a performance angle. But it might
> explain DaveH's BUG_ON() when testing those patches?
> 
> I wonder if we could hold on to the page mapping some other way than
> taking that page lock, because taking that page lock sounds
> potentially damn expensive.

At first I thought you were right, and set_page_dirty_lock() needed;
but now I think not.  We only(?) need set_page_dirty_lock() when
there's a danger that the struct address_space itself might be
evicted beneath us.

But here (even without relying on Al's delayed_fput) the fput()
is done by remove_vma() called _after_ the tlb_finish_mmu().
So I see no danger of struct address_space being freed:
set_page_dirty_lock() is for, say, people who did get_user_pages(),
and cannot be sure that the original range is still mapped when
they come to do the final set_page_dirtys and put_pages.

page->mapping might be truncated to NULL at any moment without page
lock and without mapping->tree_lock (and without page table lock
helping to serialize page_mapped against unmap_mapping_range); but
__set_page_dirty_nobuffers() (admittedly not the only set_page_dirty)
is careful about that, and it's just not the problem Dave is seeing.

However... Virginia and I get rather giddy when it comes to the
clear_page_dirty_for_io() page_mkclean() dance: we trust you and
Peter and Jan on that, and page lock there has some part to play.

My suspicion, not backed up at all, is that by leaving the
set_page_dirty() until after the page has been unmapped (and page
table lock dropped), that dance has been disturbed in such a way
that ext4 can be tricked into freeing page buffers before it will
need them again for a final dirty writeout.

Kind words deleted :) Your 2 patches below for easier reference.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23  3:08                       ` Hugh Dickins
@ 2014-04-23  4:23                         ` Linus Torvalds
  2014-04-23  6:14                           ` Benjamin Herrenschmidt
  2014-04-23 18:41                         ` Jan Kara
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-23  4:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, Peter Zijlstra, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Tue, Apr 22, 2014 at 8:08 PM, Hugh Dickins <hughd@google.com> wrote:
>
> At first I thought you were right, and set_page_dirty_lock() needed;
> but now I think not.  We only(?) need set_page_dirty_lock() when
> there's a danger that the struct address_space itself might be
> evicted beneath us.
>
> But here (even without relying on Al's delayed_fput) the fput()
> is done by remove_vma() called _after_ the tlb_finish_mmu().
> So I see no danger of struct address_space being freed:

Yeah, that's what the comments say. I'm a bit worried about
"page->mapping" races, though. So I don't think the address_space gets
freed, but I worry about truncate NULL'ing out page->mapping under us.
There, the page being either mapped in a page table (with the page
table lock held), or holding the page lock should protect us against
concurrent truncate doing that.

So I'm still a bit worried.

> page->mapping might be truncated to NULL at any moment without page
> lock and without mapping->tree_lock (and without page table lock
> helping to serialize page_mapped against unmap_mapping_range); but
> __set_page_dirty_nobuffers() (admittedly not the only set_page_dirty)
> is careful about that, and it's just not the problem Dave is seeing.

Yeah, I guess you're right. And no, page->mapping becoming NULL
doesn't actually explain Dave's issue anyway.

> However... Virginia and I get rather giddy when it comes to the
> clear_page_dirty_for_io() page_mkclean() dance: we trust you and
> Peter and Jan on that, and page lock there has some part to play.
>
> My suspicion, not backed up at all, is that by leaving the
> set_page_dirty() until after the page has been unmapped (and page
> table lock dropped), that dance has been disturbed in such a way
> that ext4 can be tricked into freeing page buffers before it will
> need them again for a final dirty writeout.

So I don't think it's new, but I agree that it opens a *much* wider
window where the dirty bit is not visible in either the page tables or
(yet) in the page dirty bit.

The same does happen in try_to_unmap_one() and in
clear_page_dirty_for_io(), but in both of those cases we hold the page
lock for the entire duration of the sequence, so this whole "page is
not visibly dirty and page lock is not held" is new. And yes, I could
imagine that that makes some code go "Ahh, we don't need buffers for
that page any more then".

In short, I think your suspicion is on the right track. I will have to
think about this.

But I'm starting to consider this whole thing to be a 3.16 issue by
now. It wasn't as simple as it looked, and while our old location of
set_page_dirty() is clearly wrong, and DaveH even got a test-case for
it (which I initially doubted would even be possible), I still
seriously doubt that anybody sane who cares about data consistency
will do concurrent unmaps (or MADV_DONTNEED) while another writer is
actively using that mapping.

Pretty much by definition, if you care about data consistency, you'd
never do insane things like that. You'd make damn sure that every
writer is all done before you unmap the area they are writing to.

So this is a "oops, we're clearly doing something wrong by marking the
page dirty too early early, but anybody who really hits it has it
coming to them" kind of situation. We want to fix it, but it doesn't
seem to be a high priority.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23  4:23                         ` Linus Torvalds
@ 2014-04-23  6:14                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2014-04-23  6:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Jan Kara, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Tue, 2014-04-22 at 21:23 -0700, Linus Torvalds wrote:
> 
> But I'm starting to consider this whole thing to be a 3.16 issue by
> now. It wasn't as simple as it looked, and while our old location of
> set_page_dirty() is clearly wrong, and DaveH even got a test-case for
> it (which I initially doubted would even be possible), I still
> seriously doubt that anybody sane who cares about data consistency
> will do concurrent unmaps (or MADV_DONTNEED) while another writer is
> actively using that mapping.

I'm more worried about users of unmap_mapping_ranges() than concurrent
munmap(). Things like the DRM playing tricks like swapping a mapping
from memory to frame buffer and vice-versa.

In any case, I agree with delaying that for 3.16, it's still very
unlikely that we hit this in any case that actually matters.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23  3:08                       ` Hugh Dickins
  2014-04-23  4:23                         ` Linus Torvalds
@ 2014-04-23 18:41                         ` Jan Kara
  2014-04-23 19:33                           ` Linus Torvalds
  2014-04-23 20:11                           ` Hugh Dickins
  1 sibling, 2 replies; 49+ messages in thread
From: Jan Kara @ 2014-04-23 18:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Jan Kara, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On Tue 22-04-14 20:08:59, Hugh Dickins wrote:
> On Tue, 22 Apr 2014, Linus Torvalds wrote:
> > On Tue, Apr 22, 2014 at 12:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > That said, Dave Hansen did report a BUG_ON() in
> > mpage_prepare_extent_to_map(). His line number was odd, but I assume
> > it's this one:
> > 
> >         BUG_ON(PageWriteback(page));
> > 
> > which may be indicative of some oddity here wrt the dirty bit.
> 
> Whereas later mail from Dave showed it to be the
> 	BUG_ON(!PagePrivate(page));
> in page_buffers() from fs/ext4/inode.c mpage_prepare_extent_to_map().
> But still presumably some kind of fallout from your patches.
> 
> Once upon a time there was a page_has_buffers() check in there,
> but Honza decided that's nowadays unnecessary in f8bec37037ac
> "ext4: dirty page has always buffers attached".  Cc'ed,
> he may very well have some good ideas.
> 
> Reading that commit reminded me of how we actually don't expect that
> set_page_dirty() in zap_pte_range() to do anything at all on the usual
> mapping_cap_account_dirty()/page_mkwrite() filesystems, do we?  Or do we?
  Yes, for shared file mappings we (as in filesystems implementing
page_mkwrite() handler) expect a page is writeably mmapped iff the page is
dirty. So in particular we don't expect set_page_dirty() in zap_pte_range()
to do anything because if the pte has dirty bit set, we are tearing down a
writeable mapping of the page and thus the page should be already dirty.

Now the devil is in synchronization of different places where transitions
from/to writeably-mapped state happen. In the fault path (do_wp_page())
where transition to writeably-mapped happens we hold page lock while
calling set_page_dirty(). In the writeout path (clear_page_dirty_for_io())
where we transition from writeably-mapped we hold the page lock as well
while calling page_mkclean() and possibly set_page_dirty(). So these two
places are nicely serialized. However zap_pte_range() doesn't hold page
lock so it can race with the previous two places. Before Linus' patches we
called set_page_dirty() under pte lock in zap_pte_range() and also before
decrementing page->mapcount. So if zap_pte_range() raced with
clear_page_dirty_for_io() we were guaranteed that by the time
clear_page_dirty_for_io() returns, pte dirty bit is cleared and
set_page_dirty() was called (either from clear_page_dirty_for_io() or from
zap_pte_range()).

However with Linus' patches set_page_dirty() from zap_pte_range() gets
called after decremeting page->mapcount so page_mkclean() won't event try
to walk rmap. And even if page_mkclean() did walk the rmap, zap_pte_range()
calls set_page_dirty() after dropping pte lock so it can get called long
after page_mkclean() (and clear_page_dirty_for_io()) has returned.

Now I'm not sure how to fix Linus' patches. For all I care we could just
rip out pte dirty bit handling for file mappings. However last time I
suggested this you corrected me that tmpfs & ramfs need this. I assume this
is still the case - however, given we unconditionally mark the page dirty
for write faults, where exactly do we need this?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23 18:41                         ` Jan Kara
@ 2014-04-23 19:33                           ` Linus Torvalds
  2014-04-24  6:51                             ` Peter Zijlstra
  2014-04-23 20:11                           ` Hugh Dickins
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-23 19:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Hugh Dickins, Peter Zijlstra, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]

On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@suse.cz> wrote:
>
> Now I'm not sure how to fix Linus' patches. For all I care we could just
> rip out pte dirty bit handling for file mappings. However last time I
> suggested this you corrected me that tmpfs & ramfs need this. I assume this
> is still the case - however, given we unconditionally mark the page dirty
> for write faults, where exactly do we need this?

Honza, you're missing the important part: it does not matter one whit
that we unconditionally mark the page dirty, when we do it *early*,
and it can be then be marked clean before it's actually clean!

The problem is that page cleaning can clean the page when there are
still writers dirtying the page. Page table tear-down removes the
entry from the page tables, but it's still there in the TLB on other
CPU's. So other CPU's are possibly writing to the page, when
clear_page_dirty_for_io() has marked it clean (because it didn't see
the page table entries that got torn down, and it hasn't seen the
dirty bit in the page yet).

I'm including Dave Hansen's "racewrite.c" with his commentary:

 "This is a will-it-scale test-case which handles all the thread creation
  and CPU binding for me: https://github.com/antonblanchard/will-it-scale
  .  Just stick the test case in tests/.  I also loopback-mounted a ramfs
  file as an ext4 filesystem on /mnt to make sure the writeback could
  happen fast.

  This reproduces the bug pretty darn quickly and with as few as 4 threads
  running like this:  ./racewrite_threads -t 4 -s 999"

and

 "It reproduces in about 5 seconds on my 4770 on an unpatched kernel.  It
  also reproduces on a _normal_ filesystem and doesn't apparently need the
  loopback-mounted ext4 ramfs file that I was trying before."

so this can actually be triggered.

              Linus

[-- Attachment #2: racewrite.c --]
[-- Type: text/x-csrc, Size: 1584 bytes --]

#define _GNU_SOURCE
#define _XOPEN_SOURCE 500
#include <sched.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>

#define BUFLEN 4096

static char wistmpfile[] = "/mnt/willitscale.XXXXXX";

char *testcase_description = "Same file pwrite";

char *buf;
#define FILE_SIZE (4096*1024)
void testcase_prepare(void)
{
	int fd = mkstemp(wistmpfile);

	assert(fd >= 0);
	assert(pwrite(fd, "X", 1, FILE_SIZE-1) == 1);
	buf = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE,
			       MAP_SHARED, fd, 0);
	assert(buf != (void *)-1);
	close(fd);
}

void testcase(unsigned long long *iterations)
{
	int cpu = sched_getcpu();
	int fd = open(wistmpfile, O_RDWR);
	off_t offset = sched_getcpu() * BUFLEN;
	long counter = 0;
	long counterread = 0;
	long *counterbuf = (void *)&buf[offset];

	printf("offset: %ld\n", offset);
	printf("       buf: %p\n", buf);
	printf("counterbuf: %p\n", counterbuf);
	assert(fd >= 0);

	while (1) {
		int ret;
		if (cpu == 1) {
			ret = madvise(buf, FILE_SIZE, MADV_DONTNEED);
			continue;
		}
	
		*counterbuf = counter;
		posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED);
		ret = pread(fd, &counterread, sizeof(counterread), offset);
		assert(ret == sizeof(counterread));

		if (counterread != counter) {
			printf("cpu: %d\n", cpu);
			printf("    counter %ld\n", counter);
			printf("counterread %ld\n", counterread);
			printf("*counterbuf %ld\n", *counterbuf);
			while(1);
		}
		counter++;	
		(*iterations)++;
	}
}

void testcase_cleanup(void)
{
	unlink(wistmpfile);
}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23 19:33                           ` Linus Torvalds
@ 2014-04-24  6:51                             ` Peter Zijlstra
  2014-04-24 18:40                               ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-24  6:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Hugh Dickins, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Wed, Apr 23, 2014 at 12:33:15PM -0700, Linus Torvalds wrote:
> On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > Now I'm not sure how to fix Linus' patches. For all I care we could just
> > rip out pte dirty bit handling for file mappings. However last time I
> > suggested this you corrected me that tmpfs & ramfs need this. I assume this
> > is still the case - however, given we unconditionally mark the page dirty
> > for write faults, where exactly do we need this?
> 
> Honza, you're missing the important part: it does not matter one whit
> that we unconditionally mark the page dirty, when we do it *early*,
> and it can be then be marked clean before it's actually clean!
> 
> The problem is that page cleaning can clean the page when there are
> still writers dirtying the page. Page table tear-down removes the
> entry from the page tables, but it's still there in the TLB on other
> CPU's. 

> So other CPU's are possibly writing to the page, when
> clear_page_dirty_for_io() has marked it clean (because it didn't see
> the page table entries that got torn down, and it hasn't seen the
> dirty bit in the page yet).

So page_mkclean() does an rmap walk to mark the page RO, it does mm wide
TLB invalidations while doing so.

zap_pte_range() only removes the rmap entry after it does the
ptep_get_and_clear_full(), which doesn't do any TLB invalidates.

So as Linus says the page_mkclean() can actually miss a page, because
it's already removed from rmap() but due to zap_pte_range() can still
have active TLB entries.

So in order to fix this we'd have to delay the page_remove_rmap() as
well, but that's tricky because rmap walkers cannot deal with
in-progress teardown. Will need to think on this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-24  6:51                             ` Peter Zijlstra
@ 2014-04-24 18:40                               ` Hugh Dickins
  2014-04-24 19:45                                 ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-24 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jan Kara, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, 24 Apr 2014, Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 12:33:15PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@suse.cz> wrote:
> > >
> > > Now I'm not sure how to fix Linus' patches. For all I care we could just
> > > rip out pte dirty bit handling for file mappings. However last time I
> > > suggested this you corrected me that tmpfs & ramfs need this. I assume this
> > > is still the case - however, given we unconditionally mark the page dirty
> > > for write faults, where exactly do we need this?
> > 
> > Honza, you're missing the important part: it does not matter one whit
> > that we unconditionally mark the page dirty, when we do it *early*,
> > and it can be then be marked clean before it's actually clean!
> > 
> > The problem is that page cleaning can clean the page when there are
> > still writers dirtying the page. Page table tear-down removes the
> > entry from the page tables, but it's still there in the TLB on other
> > CPU's. 
> 
> > So other CPU's are possibly writing to the page, when
> > clear_page_dirty_for_io() has marked it clean (because it didn't see
> > the page table entries that got torn down, and it hasn't seen the
> > dirty bit in the page yet).
> 
> So page_mkclean() does an rmap walk to mark the page RO, it does mm wide
> TLB invalidations while doing so.
> 
> zap_pte_range() only removes the rmap entry after it does the
> ptep_get_and_clear_full(), which doesn't do any TLB invalidates.
> 
> So as Linus says the page_mkclean() can actually miss a page, because
> it's already removed from rmap() but due to zap_pte_range() can still
> have active TLB entries.
> 
> So in order to fix this we'd have to delay the page_remove_rmap() as
> well, but that's tricky because rmap walkers cannot deal with
> in-progress teardown. Will need to think on this.

I've grown sceptical that Linus's approach can be made to work
safely with page_mkclean(), as it stands at present anyway.

I think that (in the exceptional case when a shared file pte_dirty has
been encountered, and this mm is active on other cpus) zap_pte_range()
needs to flush TLB on other cpus of this mm, just before its
pte_unmap_unlock(): then it respects the usual page_mkclean() protocol.

Or has that already been rejected earlier in the thread,
as too costly for some common case?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-24 18:40                               ` Hugh Dickins
@ 2014-04-24 19:45                                 ` Linus Torvalds
  2014-04-24 20:02                                   ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-24 19:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Jan Kara, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, Apr 24, 2014 at 11:40 AM, Hugh Dickins <hughd@google.com> wrote:
> safely with page_mkclean(), as it stands at present anyway.
>
> I think that (in the exceptional case when a shared file pte_dirty has
> been encountered, and this mm is active on other cpus) zap_pte_range()
> needs to flush TLB on other cpus of this mm, just before its
> pte_unmap_unlock(): then it respects the usual page_mkclean() protocol.
>
> Or has that already been rejected earlier in the thread,
> as too costly for some common case?

Hmm. The problem is that right now we actually try very hard to batch
as much as possible in order to avoid extra TLB flushes (we limit it
to around 10k pages per batch, but that's still a *lot* of pages). The
TLB flush IPI calls are noticeable under some loads.

And it's certainly much too much to free 10k pages under a spinlock.
The latencies would be horrendous.

We could add some special logic that only triggers for the dirty pages
case, but it would still have to handle the case of "we batched up
9000 clean pages, and then we hit a dirty page", so it would get
rather costly quickly.

Or we could have a separate array for dirty pages, and limit those to
a much smaller number, and do just the dirty pages under the lock, and
then the rest after releasing the lock. Again, a fair amount of new
complexity.

I would almost prefer to have some special (per-mapping?) lock or
something, and make page_mkclean() be serialize with the unmapping
case.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-24 19:45                                 ` Linus Torvalds
@ 2014-04-24 20:02                                   ` Hugh Dickins
  2014-04-24 23:46                                     ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-24 20:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jan Kara, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, 24 Apr 2014, Linus Torvalds wrote:
> On Thu, Apr 24, 2014 at 11:40 AM, Hugh Dickins <hughd@google.com> wrote:
> > safely with page_mkclean(), as it stands at present anyway.
> >
> > I think that (in the exceptional case when a shared file pte_dirty has
> > been encountered, and this mm is active on other cpus) zap_pte_range()
> > needs to flush TLB on other cpus of this mm, just before its
> > pte_unmap_unlock(): then it respects the usual page_mkclean() protocol.
> >
> > Or has that already been rejected earlier in the thread,
> > as too costly for some common case?
> 
> Hmm. The problem is that right now we actually try very hard to batch
> as much as possible in order to avoid extra TLB flushes (we limit it
> to around 10k pages per batch, but that's still a *lot* of pages). The
> TLB flush IPI calls are noticeable under some loads.
> 
> And it's certainly much too much to free 10k pages under a spinlock.
> The latencies would be horrendous.

There is no need to free all the pages immediately after doing the
TLB flush: that's merely how it's structured at present; page freeing
can be left until the end as now, or when out from under the spinlock.

What's sadder, I think, is that we would have to flush TLB for each
page table spanned by the mapping (if other cpus are really active);
but that's still much better batching than what page_mkclean() itself
does (none).

> 
> We could add some special logic that only triggers for the dirty pages
> case, but it would still have to handle the case of "we batched up
> 9000 clean pages, and then we hit a dirty page", so it would get
> rather costly quickly.
> 
> Or we could have a separate array for dirty pages, and limit those to
> a much smaller number, and do just the dirty pages under the lock, and
> then the rest after releasing the lock. Again, a fair amount of new
> complexity.
> 
> I would almost prefer to have some special (per-mapping?) lock or
> something, and make page_mkclean() be serialize with the unmapping
> case.

Yes, that might be a possibility.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-24 20:02                                   ` Hugh Dickins
@ 2014-04-24 23:46                                     ` Linus Torvalds
  2014-04-25  1:37                                       ` Benjamin Herrenschmidt
  2014-04-25 16:30                                       ` Dave Hansen
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-24 23:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Jan Kara, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]

On Thu, Apr 24, 2014 at 1:02 PM, Hugh Dickins <hughd@google.com> wrote:
>
> There is no need to free all the pages immediately after doing the
> TLB flush: that's merely how it's structured at present; page freeing
> can be left until the end as now, or when out from under the spinlock.

Hmm. In fact, if we to the actual TLB flush still under the ptl lock,
the current code will "just work". We can just keep the
set_page_dirty() at the scanning part, because there's no race with
mkclean() as long as we hold the lock.

So all that requires would be to split our current "tlb_flush_mmu()"
into the actual tlb flushing part, and the free_pages_and_swap_cache()
part. And then we do the TLB flushing inside the ptl, to make sure
that we flush tlb's before anybody can do mkclean().

And then we make the case of doing "set_page_dirty()" force a TLB
flush (but *not* force breaking out of the loop).

This gives us the best of all worlds:

 - maximum batching for the common case (no shared dirty pte entries)

 - if we find any dirty page table entries, we will batch as much as
we can within the ptl lock

 - we do the TLB shootdown holding the page table lock (but that's not
new - ptep_get_and_flush does the same

 - but we do the batched freeing of pages outside the lock

 - and the patch is pretty simple too (no need for the "one dirty bit
in the 'struct page *' pointer" games.

IOW, how about the attached patch that entirely replaces my previous
two patches. DaveH - does this fix your test-case, while _not_
introducing any new BUG_ON() triggers?

I didn't test the patch, maybe I did something stupid. It compiles for
me, but it only works for the HAVE_GENERIC_MMU_GATHER case, but
introducing tlb_flush_mmu_tlbonly() and tlb_flush_mmu_free() into the
non-generic cases should be trivial, since they really are just that
old "tlb_flush_mmu()" function split up (the tlb_flush_mmu() function
remains available for other non-forced flush users)

So assuming this does work for DaveH, then the arm/ia64/um/whatever
people would need to do those trivial transforms too, but it really
shouldn't be too painful.

Comments? DaveH?

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 2970 bytes --]

 mm/memory.c | 53 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 93e332d5ed77..037b812a9531 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -232,17 +232,18 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 #endif
 }
 
-void tlb_flush_mmu(struct mmu_gather *tlb)
+static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-	struct mmu_gather_batch *batch;
-
-	if (!tlb->need_flush)
-		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
+}
+
+static void tlb_flush_mmu_free(struct mmu_gather *tlb)
+{
+	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch; batch = batch->next) {
 		free_pages_and_swap_cache(batch->pages, batch->nr);
@@ -251,6 +252,14 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 	tlb->active = &tlb->local;
 }
 
+void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+	if (!tlb->need_flush)
+		return;
+	tlb_flush_mmu_tlbonly(tlb);
+	tlb_flush_mmu_free(tlb);
+}
+
 /* tlb_finish_mmu
  *	Called at the end of the shootdown operation to free up any resources
  *	that were required.
@@ -1127,8 +1136,10 @@ again:
 			if (PageAnon(page))
 				rss[MM_ANONPAGES]--;
 			else {
-				if (pte_dirty(ptent))
+				if (pte_dirty(ptent)) {
+					force_flush = 1;
 					set_page_dirty(page);
+				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
@@ -1137,9 +1148,10 @@ again:
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			force_flush = !__tlb_remove_page(tlb, page);
-			if (force_flush)
+			if (unlikely(!__tlb_remove_page(tlb, page))) {
+				force_flush = 1;
 				break;
+			}
 			continue;
 		}
 		/*
@@ -1174,18 +1186,11 @@ again:
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(start_pte, ptl);
 
-	/*
-	 * mmu_gather ran out of room to batch pages, we break out of
-	 * the PTE lock to avoid doing the potential expensive TLB invalidate
-	 * and page-free while holding it.
-	 */
+	/* Do the actual TLB flush before dropping ptl */
 	if (force_flush) {
 		unsigned long old_end;
 
-		force_flush = 0;
-
 		/*
 		 * Flush the TLB just for the previous segment,
 		 * then update the range to be the remaining
@@ -1193,11 +1198,21 @@ again:
 		 */
 		old_end = tlb->end;
 		tlb->end = addr;
-
-		tlb_flush_mmu(tlb);
-
+		tlb_flush_mmu_tlbonly(tlb);
 		tlb->start = addr;
 		tlb->end = old_end;
+	}
+	pte_unmap_unlock(start_pte, ptl);
+
+	/*
+	 * If we forced a TLB flush (either due to running out of
+	 * batch buffers or because we needed to flush dirty TLB
+	 * entries before releasing the ptl), free the batched
+	 * memory too. Restart if we didn't do everything.
+	 */
+	if (force_flush) {
+		force_flush = 0;
+		tlb_flush_mmu_free(tlb);
 
 		if (addr != end)
 			goto again;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-24 23:46                                     ` Linus Torvalds
@ 2014-04-25  1:37                                       ` Benjamin Herrenschmidt
  2014-04-25  2:41                                         ` Benjamin Herrenschmidt
  2014-04-25 16:30                                       ` Dave Hansen
  1 sibling, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2014-04-25  1:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Jan Kara, Dave Hansen,
	H. Peter Anvin, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, 2014-04-24 at 16:46 -0700, Linus Torvalds wrote:
> - we do the TLB shootdown holding the page table lock (but that's not
> new - ptep_get_and_flush does the same
> 

The flip side is that we do a lot more IPIs for large invalidates,
since we drop the PTL on every page table page.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25  1:37                                       ` Benjamin Herrenschmidt
@ 2014-04-25  2:41                                         ` Benjamin Herrenschmidt
  2014-04-25  2:46                                           ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2014-04-25  2:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Jan Kara, Dave Hansen,
	H. Peter Anvin, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, 2014-04-25 at 11:37 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2014-04-24 at 16:46 -0700, Linus Torvalds wrote:
> > - we do the TLB shootdown holding the page table lock (but that's not
> > new - ptep_get_and_flush does the same
> > 
> 
> The flip side is that we do a lot more IPIs for large invalidates,
> since we drop the PTL on every page table page.

Oh I missed that your patch was smart enough to only do that in the
presence of non-anonymous dirty pages. That should take care of the
common case of short lived programs, those should still fit in a
single big batch.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25  2:41                                         ` Benjamin Herrenschmidt
@ 2014-04-25  2:46                                           ` Linus Torvalds
  2014-04-25  2:50                                             ` H. Peter Anvin
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-25  2:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Hugh Dickins, Peter Zijlstra, Jan Kara, Dave Hansen,
	H. Peter Anvin, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, Apr 24, 2014 at 7:41 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2014-04-25 at 11:37 +1000, Benjamin Herrenschmidt wrote:
>>
>> The flip side is that we do a lot more IPIs for large invalidates,
>> since we drop the PTL on every page table page.
>
> Oh I missed that your patch was smart enough to only do that in the
> presence of non-anonymous dirty pages. That should take care of the
> common case of short lived programs, those should still fit in a
> single big batch.

Right. It only causes extra TLB shootdowns for dirty shared mappings.

Which, let's face it, don't actually ever happen. Using mmap to write
to files is actually very rare, because it really sucks in just about
all possible ways. There are almost no loads where it's not better to
just use a "write()" system call instead.

So the dirty shared mapping case _exists_, but it's pretty darn
unusual. The case where you have lots of mmap/munmap activity is even
less unusual. I suspect the most common case is for stress-testing the
VM, because nobody sane does it for an actual real load.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25  2:46                                           ` Linus Torvalds
@ 2014-04-25  2:50                                             ` H. Peter Anvin
  2014-04-25  3:03                                               ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: H. Peter Anvin @ 2014-04-25  2:50 UTC (permalink / raw)
  To: Linus Torvalds, Benjamin Herrenschmidt
  Cc: Hugh Dickins, Peter Zijlstra, Jan Kara, Dave Hansen,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On 04/24/2014 07:46 PM, Linus Torvalds wrote:
> On Thu, Apr 24, 2014 at 7:41 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
>> On Fri, 2014-04-25 at 11:37 +1000, Benjamin Herrenschmidt wrote:
>>>
>>> The flip side is that we do a lot more IPIs for large invalidates,
>>> since we drop the PTL on every page table page.
>>
>> Oh I missed that your patch was smart enough to only do that in the
>> presence of non-anonymous dirty pages. That should take care of the
>> common case of short lived programs, those should still fit in a
>> single big batch.
> 
> Right. It only causes extra TLB shootdowns for dirty shared mappings.
> 
> Which, let's face it, don't actually ever happen. Using mmap to write
> to files is actually very rare, because it really sucks in just about
> all possible ways. There are almost no loads where it's not better to
> just use a "write()" system call instead.
> 
> So the dirty shared mapping case _exists_, but it's pretty darn
> unusual. The case where you have lots of mmap/munmap activity is even
> less unusual. I suspect the most common case is for stress-testing the
> VM, because nobody sane does it for an actual real load.
> 

The cases where they occur the mappings tend to be highly stable, i.e.
map once *specifically* to be able to do a whole bunch of things without
system calls, and then unmap when done.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25  2:50                                             ` H. Peter Anvin
@ 2014-04-25  3:03                                               ` Linus Torvalds
  2014-04-25 12:01                                                 ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-25  3:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Benjamin Herrenschmidt, Hugh Dickins, Peter Zijlstra, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, Apr 24, 2014 at 7:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> The cases where they occur the mappings tend to be highly stable, i.e.
> map once *specifically* to be able to do a whole bunch of things without
> system calls, and then unmap when done.

Yes. But even that tends to be unusual. mmap() really is bad at
writing, since you inevitably get read-modify-write patterns etc. So
it's only useful for fixing up things after-the-fact, which in itself
is a horrible pattern.

Don't get me wrong - it exists, but it's really quite rare because it
has so many problems. Even people who do "fixup" kind of stuff tend to
map things privately, change things, and then write out the end
result. That way you can get atomicity by then doing a single
"rename()" at the end, for example.

The traditional case for it used to be the nntp index, and these days
I know some imap indexer (dovecot?) uses it. Every other example of it
I have ever seen has been a VM stress tester..

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25  3:03                                               ` Linus Torvalds
@ 2014-04-25 12:01                                                 ` Hugh Dickins
  2014-04-25 13:51                                                   ` Peter Zijlstra
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Hugh Dickins @ 2014-04-25 12:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Peter Zijlstra, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Thu, 24 Apr 2014, Linus Torvalds wrote:
> On Thu, Apr 24, 2014 at 7:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > The cases where they occur the mappings tend to be highly stable, i.e.
> > map once *specifically* to be able to do a whole bunch of things without
> > system calls, and then unmap when done.
> 
> Yes. But even that tends to be unusual. mmap() really is bad at
> writing, since you inevitably get read-modify-write patterns etc. So
> it's only useful for fixing up things after-the-fact, which in itself
> is a horrible pattern.
> 
> Don't get me wrong - it exists, but it's really quite rare because it
> has so many problems. Even people who do "fixup" kind of stuff tend to
> map things privately, change things, and then write out the end
> result. That way you can get atomicity by then doing a single
> "rename()" at the end, for example.
> 
> The traditional case for it used to be the nntp index, and these days
> I know some imap indexer (dovecot?) uses it. Every other example of it
> I have ever seen has been a VM stress tester..

Your patch looks good to me (nice use of force_flush), and runs fine
here in normal usage; but I've not actually tried Dave's racewrite.c.

However, I have had a couple of contrarian half-thoughts, that
ordinarily I'd prefer to mull over more before blurting out,
but in the circumstances better say sooner than later.

One, regarding dirty shared mappings: you're thinking above of
mmap()'ing proper filesystem files, but this case also includes
shared memory - I expect there are uses of giant amounts of shared
memory, for which we really would prefer not to slow the teardown.

And confusingly, those are not subject to the special page_mkclean()
constraints, but still need to be handled in a correct manner: your
patch is fine, but might be overkill for them - I'm not yet sure.

Two, Ben said earlier that he's more worried about users of
unmap_mapping_range() than concurrent munmap(); and you said
earlier that you would almost prefer to have some special lock
to serialize with page_mkclean().

Er, i_mmap_mutex.

That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
take to iterate over the file vmas.  So perhaps there's no race at all
in the unmap_mapping_range() case.  And easy (I imagine) to fix the
race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.

But exit and munmap() don't take i_mmap_mutex: perhaps they should
when encountering a VM_SHARED vma (I believe VM_SHARED should be
peculiar to having vm_file set, but test both below because I don't
want to oops in some odd corner where a special vma is set up).

Hugh

--- 3.15-rc2/mm/madvise.c	2013-11-03 15:41:51.000000000 -0800
+++ linux/mm/madvise.c	2014-04-25 04:10:40.124514427 -0700
@@ -274,10 +274,16 @@ static long madvise_dontneed(struct vm_a
 			     struct vm_area_struct **prev,
 			     unsigned long start, unsigned long end)
 {
+	struct address_space *mapping = NULL;
+
 	*prev = vma;
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
 		return -EINVAL;

+	if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
+		mapping = vma->vm_file->f_mapping;
+		mutex_lock(&mapping->i_mmap_mutex);
+	}
 	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
 		struct zap_details details = {
 			.nonlinear_vma = vma,
@@ -286,6 +292,8 @@ static long madvise_dontneed(struct vm_a
 		zap_page_range(vma, start, end - start, &details);
 	} else
 		zap_page_range(vma, start, end - start, NULL);
+	if (mapping)
+		mutex_unlock(&mapping->i_mmap_mutex);
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 12:01                                                 ` Hugh Dickins
@ 2014-04-25 13:51                                                   ` Peter Zijlstra
  2014-04-25 19:41                                                     ` Hugh Dickins
  2014-04-25 16:54                                                   ` Dave Hansen
  2014-04-25 17:56                                                   ` Linus Torvalds
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-25 13:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, Apr 25, 2014 at 05:01:23AM -0700, Hugh Dickins wrote:
> One, regarding dirty shared mappings: you're thinking above of
> mmap()'ing proper filesystem files, but this case also includes
> shared memory - I expect there are uses of giant amounts of shared
> memory, for which we really would prefer not to slow the teardown.
> 
> And confusingly, those are not subject to the special page_mkclean()
> constraints, but still need to be handled in a correct manner: your
> patch is fine, but might be overkill for them - I'm not yet sure.

I think we could look at mapping_cap_account_dirty(page->mapping) while
holding the ptelock, the mapping can't go away while we hold that lock.

And afaict that's the exact differentiator between these two cases.

> Two, Ben said earlier that he's more worried about users of
> unmap_mapping_range() than concurrent munmap(); and you said
> earlier that you would almost prefer to have some special lock
> to serialize with page_mkclean().
> 
> Er, i_mmap_mutex.
> 
> That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> take to iterate over the file vmas.  So perhaps there's no race at all
> in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.

Ooh shiney.. yes that might work! 

> But exit and munmap() don't take i_mmap_mutex: perhaps they should
> when encountering a VM_SHARED vma 

Well, they will of course take it in order to detach the vma from the
rmap address_space::i_mmap tree.

> (I believe VM_SHARED should be
> peculiar to having vm_file seta, but test both below because I don't
> want to oops in some odd corner where a special vma is set up).

I think you might be on to something there...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 13:51                                                   ` Peter Zijlstra
@ 2014-04-25 19:41                                                     ` Hugh Dickins
  2014-04-26 18:07                                                       ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-25 19:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, 25 Apr 2014, Peter Zijlstra wrote:
> On Fri, Apr 25, 2014 at 05:01:23AM -0700, Hugh Dickins wrote:
> > One, regarding dirty shared mappings: you're thinking above of
> > mmap()'ing proper filesystem files, but this case also includes
> > shared memory - I expect there are uses of giant amounts of shared
> > memory, for which we really would prefer not to slow the teardown.
> > 
> > And confusingly, those are not subject to the special page_mkclean()
> > constraints, but still need to be handled in a correct manner: your
> > patch is fine, but might be overkill for them - I'm not yet sure.
> 
> I think we could look at mapping_cap_account_dirty(page->mapping) while
> holding the ptelock, the mapping can't go away while we hold that lock.
> 
> And afaict that's the exact differentiator between these two cases.

Yes, that's easily done, but I wasn't sure whether it was correct to
skip on shmem or not - just because shmem doesn't participate in the
page_mkclean() protocol, doesn't imply it's free from similar bugs.

I haven't seen a precise description of the bug we're anxious to fix:
Dave's MADV_DONTNEED should be easily fixable, that's not a concern;
Linus's first patch wrote of writing racing with cleaning, but didn't
give a concrete example.

How about this: a process with one thread repeatedly (but not very often)
writing timestamp into a shared file mapping, but another thread munmaps
it at some point; and another process repeatedly (but not very often)
reads the timestamp file (or peeks at it through mmap); with memory
pressure forcing page reclaim.

In the page_mkclean() shared file case, the second process might see
the timestamp move backwards: because a write from the timestamping
thread went into the pagecache after it had been written, but the
page not re-marked dirty; so when reclaimed and later read back
from disk, the older timestamp is seen.

But I think you can remove "page_mkclean() " from that paragraph:
the same can happen with shmem written out to swap.

It could not happen with ramfs, but I don't think we're desperate
to special case ramfs these days.

Hugh

> 
> > Two, Ben said earlier that he's more worried about users of
> > unmap_mapping_range() than concurrent munmap(); and you said
> > earlier that you would almost prefer to have some special lock
> > to serialize with page_mkclean().
> > 
> > Er, i_mmap_mutex.
> > 
> > That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> > take to iterate over the file vmas.  So perhaps there's no race at all
> > in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> > race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.
> 
> Ooh shiney.. yes that might work! 
> 
> > But exit and munmap() don't take i_mmap_mutex: perhaps they should
> > when encountering a VM_SHARED vma 
> 
> Well, they will of course take it in order to detach the vma from the
> rmap address_space::i_mmap tree.
> 
> > (I believe VM_SHARED should be
> > peculiar to having vm_file seta, but test both below because I don't
> > want to oops in some odd corner where a special vma is set up).
> 
> I think you might be on to something there...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 19:41                                                     ` Hugh Dickins
@ 2014-04-26 18:07                                                       ` Peter Zijlstra
  2014-04-27  7:20                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-26 18:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

> > I think we could look at mapping_cap_account_dirty(page->mapping) while
> > holding the ptelock, the mapping can't go away while we hold that lock.
> > 
> > And afaict that's the exact differentiator between these two cases.
> 
> Yes, that's easily done, but I wasn't sure whether it was correct to
> skip on shmem or not - just because shmem doesn't participate in the
> page_mkclean() protocol, doesn't imply it's free from similar bugs.
> 
> I haven't seen a precise description of the bug we're anxious to fix:
> Dave's MADV_DONTNEED should be easily fixable, that's not a concern;
> Linus's first patch wrote of writing racing with cleaning, but didn't
> give a concrete example.

The way I understand it is that we observe the PTE dirty and set PAGE
dirty before we make the PTE globally unavailable (through a TLB flush),
and thereby we can mistakenly loose updates; by thinking a page is in
fact clean even though we can still get updates.

But I suspect you got that far..

> How about this: a process with one thread repeatedly (but not very often)
> writing timestamp into a shared file mapping, but another thread munmaps
> it at some point; and another process repeatedly (but not very often)
> reads the timestamp file (or peeks at it through mmap); with memory
> pressure forcing page reclaim.
> 
> In the page_mkclean() shared file case, the second process might see
> the timestamp move backwards: because a write from the timestamping
> thread went into the pagecache after it had been written, but the
> page not re-marked dirty; so when reclaimed and later read back
> from disk, the older timestamp is seen.
> 
> But I think you can remove "page_mkclean() " from that paragraph:
> the same can happen with shmem written out to swap.

I'm not entirely seeing how this could happen for shmem; the swap path
goes through try_to_unmap_one(), which removes the PTE and does a full
TLB flush. Anything after that will get a fault (which will hit either
the swapcache or actual swap).

The rmap walk of try_to_unmap() is the same as the one for
page_mkclean().

So if we're good on the one, we're good on the other.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-26 18:07                                                       ` Peter Zijlstra
@ 2014-04-27  7:20                                                         ` Peter Zijlstra
  2014-04-27 12:20                                                           ` Hugh Dickins
  2014-04-27 16:21                                                           ` Linus Torvalds
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-27  7:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sat, Apr 26, 2014 at 08:07:11PM +0200, Peter Zijlstra wrote:
> > > I think we could look at mapping_cap_account_dirty(page->mapping) while
> > > holding the ptelock, the mapping can't go away while we hold that lock.
> > > 
> > > And afaict that's the exact differentiator between these two cases.
> > 
> > Yes, that's easily done, but I wasn't sure whether it was correct to
> > skip on shmem or not - just because shmem doesn't participate in the
> > page_mkclean() protocol, doesn't imply it's free from similar bugs.
> > 
> > I haven't seen a precise description of the bug we're anxious to fix:
> > Dave's MADV_DONTNEED should be easily fixable, that's not a concern;
> > Linus's first patch wrote of writing racing with cleaning, but didn't
> > give a concrete example.
> 
> The way I understand it is that we observe the PTE dirty and set PAGE
> dirty before we make the PTE globally unavailable (through a TLB flush),
> and thereby we can mistakenly loose updates; by thinking a page is in
> fact clean even though we can still get updates.
> 
> But I suspect you got that far..

OK, so I've been thinking and figured I either mis-understand how the
hardware works or don't understand how Linus' patch will actually fully
fix the issue.

So what both try_to_unmap_one() and zap_pte_range() end up doing is
clearing the PTE entry and then flushing the TLBs.

However, that still leaves a window where there are remote TLB entries.
What if any of those remote entries cause a write (or have a dirty bit
cached) while we've already removed the PTE entry.

This means that the remote CPU cannot update the PTE anymore (its not
there after all).

Will the hardware fault when it does a translation and needs to update
the dirty/access bits while the PTE entry is !present?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27  7:20                                                         ` Peter Zijlstra
@ 2014-04-27 12:20                                                           ` Hugh Dickins
  2014-04-27 19:33                                                             ` Peter Zijlstra
  2014-04-27 20:09                                                             ` Hugh Dickins
  2014-04-27 16:21                                                           ` Linus Torvalds
  1 sibling, 2 replies; 49+ messages in thread
From: Hugh Dickins @ 2014-04-27 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sun, 27 Apr 2014, Peter Zijlstra wrote:
> On Sat, Apr 26, 2014 at 08:07:11PM +0200, Peter Zijlstra wrote:
> > > > I think we could look at mapping_cap_account_dirty(page->mapping) while
> > > > holding the ptelock, the mapping can't go away while we hold that lock.
> > > > 
> > > > And afaict that's the exact differentiator between these two cases.
> > > 
> > > Yes, that's easily done, but I wasn't sure whether it was correct to
> > > skip on shmem or not - just because shmem doesn't participate in the
> > > page_mkclean() protocol, doesn't imply it's free from similar bugs.
> > > 
> > > I haven't seen a precise description of the bug we're anxious to fix:
> > > Dave's MADV_DONTNEED should be easily fixable, that's not a concern;
> > > Linus's first patch wrote of writing racing with cleaning, but didn't
> > > give a concrete example.
> > 
> > The way I understand it is that we observe the PTE dirty and set PAGE
> > dirty before we make the PTE globally unavailable (through a TLB flush),
> > and thereby we can mistakenly loose updates; by thinking a page is in
> > fact clean even though we can still get updates.
> > 
> > But I suspect you got that far..
> 
> OK, so I've been thinking and figured I either mis-understand how the
> hardware works or don't understand how Linus' patch will actually fully
> fix the issue.
> 
> So what both try_to_unmap_one() and zap_pte_range() end up doing is
> clearing the PTE entry and then flushing the TLBs.
> 
> However, that still leaves a window where there are remote TLB entries.
> What if any of those remote entries cause a write (or have a dirty bit
> cached) while we've already removed the PTE entry.
> 
> This means that the remote CPU cannot update the PTE anymore (its not
> there after all).
> 
> Will the hardware fault when it does a translation and needs to update
> the dirty/access bits while the PTE entry is !present?

Yes - but I'm sure you know that, just not while you wrote the mail ;)

But it will not fault while it still has the entry in its TLB,
with dirty (and access) bits set in that entry in its TLB.

The problem is with those entries, which already have dirty set
in the TLB, although it's now cleared in the page table itself.

I'm answering this mail because it only seems to need "Yes";
but well aware that I've not yet answered your yesterday's mail.
Sorry, my yesterday had to be spent on... other stuff.

I'm sleeping at present (well, not quite) and preparing a reply in
the interstices of my sleep - if I don't change my mind before
answering, I still think shmem needs Linus's (or my) patch.

But woke with a panic attack that we have overlooked the question
of how page reclaim's page_mapped() checks are serialized.
Perhaps this concern will evaporate with the morning dew,
perhaps it will not...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27 12:20                                                           ` Hugh Dickins
@ 2014-04-27 19:33                                                             ` Peter Zijlstra
  2014-04-27 19:47                                                               ` Linus Torvalds
  2014-04-27 20:09                                                             ` Hugh Dickins
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-27 19:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sun, Apr 27, 2014 at 05:20:25AM -0700, Hugh Dickins wrote:
> On Sun, 27 Apr 2014, Peter Zijlstra wrote:
> > Will the hardware fault when it does a translation and needs to update
> > the dirty/access bits while the PTE entry is !present?
> 
> Yes - but I'm sure you know that, just not while you wrote the mail ;)

Possibly, but all of a sudden I was fearing hardware was 'creative' and
we'd need something along the lines of what Linus proposed:

    entry = atomic_xchg(pte, 0);
    flush_tlb();
    entry |= *pte;

Or possible event:

   *ptep = pte_wrprotect(*ptep);
   flush_tlb();
   entry = atomic_xchg(ptep, 0);
   flush_tlb();

The horrible things one fears.. :-) The latter option would be required
if TLBs would have a write-back dirty bit cache, *shudder*.

> But it will not fault while it still has the entry in its TLB,
> with dirty (and access) bits set in that entry in its TLB.
> 
> The problem is with those entries, which already have dirty set
> in the TLB, although it's now cleared in the page table itself.
> 
> I'm answering this mail because it only seems to need "Yes";
> but well aware that I've not yet answered your yesterday's mail.
> Sorry, my yesterday had to be spent on... other stuff.

Yeah, not worries, I spend most of the weekend preparing and then having
a birthday party for my (now 3 year old) daughter :-)

> I'm sleeping at present (well, not quite) and preparing a reply in
> the interstices of my sleep - if I don't change my mind before
> answering, I still think shmem needs Linus's (or my) patch.

Oh, absolutely. I wasn't arguing it didn't need it. I was merely
pointing out that if one was to add to Linus' patch such that we'd only
do the force_flush for mapping_cap_account_dirty() we wouldn't need
extra things to deal with shmem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27 19:33                                                             ` Peter Zijlstra
@ 2014-04-27 19:47                                                               ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-27 19:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sun, Apr 27, 2014 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Oh, absolutely. I wasn't arguing it didn't need it. I was merely
> pointing out that if one was to add to Linus' patch such that we'd only
> do the force_flush for mapping_cap_account_dirty() we wouldn't need
> extra things to deal with shmem.

I think we can certainly add that check if we find out that it is
indeed a performance problem. I *could* imagine loads where people
mmap/munmap shmem regions at a high rate, but don't actually know of
any (remember: for this to matter they also have to dirty the pages).

In the absence of such knowledge, I'd rather not make things more
complex than they already are.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27 12:20                                                           ` Hugh Dickins
  2014-04-27 19:33                                                             ` Peter Zijlstra
@ 2014-04-27 20:09                                                             ` Hugh Dickins
  2014-04-28  9:25                                                               ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-27 20:09 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara, Dave Hansen,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On Sun, 27 Apr 2014, Hugh Dickins wrote:
> 
> But woke with a panic attack that we have overlooked the question
> of how page reclaim's page_mapped() checks are serialized.
> Perhaps this concern will evaporate with the morning dew,
> perhaps it will not...

It was a real concern, but we happen to be rescued by the innocuous-
looking is_page_cache_freeable() check at the beginning of pageout():
which will deserve its own comment, but that can follow later.

My concern was with page reclaim's shrink_page_list() racing against
munmap's or exit's (or madvise's) zap_pte_range() unmapping the page.

Once zap_pte_range() has cleared the pte from a vma, neither
try_to_unmap() nor page_mkclean() will see that vma as containing
the page, so neither will do its own flush TLB of the cpus involved,
before proceeding to writepage.

Linus's patch (serialializing with ptlock) or my patch (serializing
with i_mmap_mutex) both almost fix that, but it seemed not entirely:
because try_to_unmap() is only called when page_mapped(), and
page_mkclean() quits early without taking locks when !page_mapped().

So in the interval when zap_pte_range() has brought page_mapcount()
down to 0, but not yet flushed TLB on all mapping cpus, it looked as
if we still had a problem - neither try_to_unmap() nor page_mkclean()
would take the lock either of us rely upon for serialization.

But pageout()'s preliminary is_page_cache_freeable() check makes
it safe in the end: although page_mapcount() has gone down to 0,
page_count() remains raised until the free_pages_and_swap_cache()
after the TLB flush.

So I now believe we're safe after all with either patch, and happy
for Linus to go ahead with his.

Peter, returning at last to your question of whether we could exempt
shmem from the added overhead of either patch.  Until just now I
thought not, because of the possibility that the shmem_writepage()
could occur while one of the mm's cpus remote from zap_pte_range()
cpu was still modifying the page.  But now that I see the role
played by is_page_cache_freeable(), and of course the zapping end
has never dropped its reference on the page before the TLB flush,
however late that occurred, hmmm, maybe yes, shmem can be exempted.

But I'd prefer to dwell on that a bit longer: we can add that as
an optimization later if it holds up to scrutiny.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27 20:09                                                             ` Hugh Dickins
@ 2014-04-28  9:25                                                               ` Peter Zijlstra
  2014-04-28 10:14                                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-28  9:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sun, Apr 27, 2014 at 01:09:54PM -0700, Hugh Dickins wrote:
> On Sun, 27 Apr 2014, Hugh Dickins wrote:
> > 
> > But woke with a panic attack that we have overlooked the question
> > of how page reclaim's page_mapped() checks are serialized.
> > Perhaps this concern will evaporate with the morning dew,
> > perhaps it will not...
> 
> It was a real concern, but we happen to be rescued by the innocuous-
> looking is_page_cache_freeable() check at the beginning of pageout():
> which will deserve its own comment, but that can follow later.
> 
> My concern was with page reclaim's shrink_page_list() racing against
> munmap's or exit's (or madvise's) zap_pte_range() unmapping the page.
> 
> Once zap_pte_range() has cleared the pte from a vma, neither
> try_to_unmap() nor page_mkclean() will see that vma as containing
> the page, so neither will do its own flush TLB of the cpus involved,
> before proceeding to writepage.
> 
> Linus's patch (serialializing with ptlock) or my patch (serializing
> with i_mmap_mutex) both almost fix that, but it seemed not entirely:
> because try_to_unmap() is only called when page_mapped(), and
> page_mkclean() quits early without taking locks when !page_mapped().

Argh!! very good spotting that.

> So in the interval when zap_pte_range() has brought page_mapcount()
> down to 0, but not yet flushed TLB on all mapping cpus, it looked as
> if we still had a problem - neither try_to_unmap() nor page_mkclean()
> would take the lock either of us rely upon for serialization.
> 
> But pageout()'s preliminary is_page_cache_freeable() check makes
> it safe in the end: although page_mapcount() has gone down to 0,
> page_count() remains raised until the free_pages_and_swap_cache()
> after the TLB flush.
> 
> So I now believe we're safe after all with either patch, and happy
> for Linus to go ahead with his.

OK, so I'm just not seeing that atm. Will have another peek later,
hopefully when more fully awake.

> Peter, returning at last to your question of whether we could exempt
> shmem from the added overhead of either patch.  Until just now I
> thought not, because of the possibility that the shmem_writepage()
> could occur while one of the mm's cpus remote from zap_pte_range()
> cpu was still modifying the page.  But now that I see the role
> played by is_page_cache_freeable(), and of course the zapping end
> has never dropped its reference on the page before the TLB flush,
> however late that occurred, hmmm, maybe yes, shmem can be exempted.
> 
> But I'd prefer to dwell on that a bit longer: we can add that as
> an optimization later if it holds up to scrutiny.

For sure.. No need to rush that. And if a (performance) regression shows
up in the meantime, we immediately have a good test case too :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-28  9:25                                                               ` Peter Zijlstra
@ 2014-04-28 10:14                                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2014-04-28 10:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Mon, Apr 28, 2014 at 11:25:40AM +0200, Peter Zijlstra wrote:
> > So in the interval when zap_pte_range() has brought page_mapcount()
> > down to 0, but not yet flushed TLB on all mapping cpus, it looked as
> > if we still had a problem - neither try_to_unmap() nor page_mkclean()
> > would take the lock either of us rely upon for serialization.
> > 
> > But pageout()'s preliminary is_page_cache_freeable() check makes
> > it safe in the end: although page_mapcount() has gone down to 0,
> > page_count() remains raised until the free_pages_and_swap_cache()
> > after the TLB flush.
> > 
> > So I now believe we're safe after all with either patch, and happy
> > for Linus to go ahead with his.
> 
> OK, so I'm just not seeing that atm. Will have another peek later,
> hopefully when more fully awake.

Sigh.. I suppose I should do more mm/ stuff, I'm getting real rusty.

So it looks like we also have a page-ref per map, every time we install
a page (->fault) we grab an extra ref.

So yes, we'll have >2 refs until the final free_page_and_swap_cache()


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27  7:20                                                         ` Peter Zijlstra
  2014-04-27 12:20                                                           ` Hugh Dickins
@ 2014-04-27 16:21                                                           ` Linus Torvalds
  2014-04-27 23:13                                                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-27 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, H. Peter Anvin, Benjamin Herrenschmidt, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sun, Apr 27, 2014 at 12:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> OK, so I've been thinking and figured I either mis-understand how the
> hardware works or don't understand how Linus' patch will actually fully
> fix the issue.
>
> So what both try_to_unmap_one() and zap_pte_range() end up doing is
> clearing the PTE entry and then flushing the TLBs.

Right. And we can't do it in the other order, since if we flush the
TLBs first, another cpu (or even this cpu, doing speculative memory
accesses) could re-fill them between the flush and the clearing of the
page table.

So that's race #1, easily fixed by ordering the TLB flush after
clearing the page table entry. We've never done this particular race
wrong, afaik. It's _so_ obvious that it's the only one we've always
gotten right.

On to the ones we've historically gotten wrong:

> However, that still leaves a window where there are remote TLB entries.
> What if any of those remote entries cause a write (or have a dirty bit
> cached) while we've already removed the PTE entry.

So this is race #2: the race between clearing the entry and a TLB miss
loading it and marking it accessed or dirty.

That race is handled by the fact that the CPU does the accessed/dirty
bit update as an atomic read-modify-write operation, and actually
re-checks the PTE entry as it does so.

So in theory a CPU could just remember what address it loaded the TLB
entry from, and do a blind "set the dirty bit" with just an atomic
"or" operation. In fact, for a while I thought that CPU's could do
that, and the TLB flushing sequence would be:

    entry = atomic_xchg(pte, 0);
    flush_tlb();
    entry |= *pte;

so that we'd catch any races with the A/D bit getting set.

It turns out no CPU actually does that, and I'm not sure we ever had
that code sequence in the kernel (but some code archaeologist might go
look).

What CPU's actually do is simpler both for them and for us: instead of
remembering where they loaded the TLB entry from, they re-walk the
page table when they mark the TLB dirty, and if the page table entry
is no longer present (or if it's non-writable), the store will fault
instead of marking the TLB entry dirty.

So race #2 doesn't need the above complex sequence, but it still
*does* need that TLB entry to be loaded with an atomic exchange with
zero (or at least with something that clears the present bit, zero
obviously being the simplest such value). So a simple

    entry = atomic_xchg(pte, 0);

is sufficient for this race (except we call it "ptep_get_and_clear()" ;)

Of course, *If* a CPU were to remember the address it loaded the TLB
entry from, then such a CPU might as well make the TLB be part of the
cache-coherency domain, and then we wouldn't need to do any TLB
flushing at all. I wish.

> Will the hardware fault when it does a translation and needs to update
> the dirty/access bits while the PTE entry is !present?

Yes indeed, see above (but see how broken hardware _could_ work, which
would be really painful for us).

What we are fighting is race #3: the TLB happily exists on this or
other CPU's, an dis _not_ getting updated (so no re-walk), but _is_
getting used.

And we've actually had a fix for race #3 for a long time: the whole
"don't free the pages until after the flush" is very much this issue.
So it's not a new condition by any means (as far as I can tell, the
mmu_gather infrastructure was introduced in 2.4.9.13, so 2001 - the
exact commit predates even BK history).

But this new issue is related to race #3, but purely in software: when
we do the "set_page_dirty()" before doing the TLB flush, we need to
protect against our cleaning that bit until after the flush.

And we've now had three different ways to fix that race, one
introducing a new race (my original two-patch series that delayed the
set_page_dirty() the same way we delay the page freeing), one using a
new lock entirely (Hugh latest patch - mapping->i_mmap_mutex isn't a
new lock, but in this context it is), and one that extends the rules
we already had in place for the single-PTE cases (do the
set_page_dirty() and TLB flushing atomically wrt the page table lock,
which makes it atomic wrt mkclean_one).

And the reason I think I'll merge my patch rather than Hugh's (despite
Hugh's being smaller) is exactly the fact that it doesn't really
introduce any new locking rules - it just fixes the fact that we
didn't really realize how important it was, and didn't follow the same
rules as the single-pte cases did.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-27 16:21                                                           ` Linus Torvalds
@ 2014-04-27 23:13                                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2014-04-27 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Hugh Dickins, H. Peter Anvin, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Sun, 2014-04-27 at 09:21 -0700, Linus Torvalds wrote:

> So in theory a CPU could just remember what address it loaded the TLB
> entry from, and do a blind "set the dirty bit" with just an atomic
> "or" operation. In fact, for a while I thought that CPU's could do
> that, and the TLB flushing sequence would be:
> 
>     entry = atomic_xchg(pte, 0);
>     flush_tlb();
>     entry |= *pte;
> 
> so that we'd catch any races with the A/D bit getting set.
>
> It turns out no CPU actually does that, and I'm not sure we ever had
> that code sequence in the kernel (but some code archaeologist might go
> look).

Today hash based powerpc's do the update in the hash table using a byte
store, not an atomic compare. That's one of the reasons we don't
currently exploit the HW facility for dirty/accessed. (There are others,
such as pages being evicted from the hash, we would need a path to
transfer dirty back to the struct page, etc...)

 .../...

> Of course, *If* a CPU were to remember the address it loaded the TLB
> entry from, then such a CPU might as well make the TLB be part of the
> cache-coherency domain, and then we wouldn't need to do any TLB
> flushing at all. I wish.

Hrm... Remembering the address as part of the data is one thing, having
it in the tag for snoops is another :) I can see CPU designers wanting
to do the first and not the second.... Though most CPUs I've seen are 4
or 8 ways set-associative so it's not as bad as adding a big CAM
thankfully.

> > Will the hardware fault when it does a translation and needs to update
> > the dirty/access bits while the PTE entry is !present?
> 
> Yes indeed, see above (but see how broken hardware _could_ work, which
> would be really painful for us).
> 
> What we are fighting is race #3: the TLB happily exists on this or
> other CPU's, an dis _not_ getting updated (so no re-walk), but _is_
> getting used.

Right, and it's little brother which is that the update and the access
that caused it aren't atomic with each other, thus the access can be
seen some time after the R/C update. (This was my original concern until
I realized that it was in fact the same race as the dirty TLB entry
still in the other CPUs).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 12:01                                                 ` Hugh Dickins
  2014-04-25 13:51                                                   ` Peter Zijlstra
@ 2014-04-25 16:54                                                   ` Dave Hansen
  2014-04-25 18:41                                                     ` Hugh Dickins
  2014-04-25 17:56                                                   ` Linus Torvalds
  2 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2014-04-25 16:54 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Peter Zijlstra, Jan Kara,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On 04/25/2014 05:01 AM, Hugh Dickins wrote:
> Er, i_mmap_mutex.
> 
> That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> take to iterate over the file vmas.  So perhaps there's no race at all
> in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.
> 
> But exit and munmap() don't take i_mmap_mutex: perhaps they should
> when encountering a VM_SHARED vma (I believe VM_SHARED should be
> peculiar to having vm_file set, but test both below because I don't
> want to oops in some odd corner where a special vma is set up).

Hey Hugh,

Do you want some testing on this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 16:54                                                   ` Dave Hansen
@ 2014-04-25 18:41                                                     ` Hugh Dickins
  2014-04-25 22:00                                                       ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-25 18:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt,
	Peter Zijlstra, Jan Kara, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, 25 Apr 2014, Dave Hansen wrote:
> On 04/25/2014 05:01 AM, Hugh Dickins wrote:
> > Er, i_mmap_mutex.
> > 
> > That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> > take to iterate over the file vmas.  So perhaps there's no race at all
> > in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> > race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.
> > 
> > But exit and munmap() don't take i_mmap_mutex: perhaps they should
> > when encountering a VM_SHARED vma (I believe VM_SHARED should be
> > peculiar to having vm_file set, but test both below because I don't
> > want to oops in some odd corner where a special vma is set up).
> 
> Hey Hugh,
> 
> Do you want some testing on this?

Yes, please do: I just haven't gotten around to cloning the git
tree and trying it.  It's quite likely that we shall go Linus's
way rather than this, but still useful to have the information
as to whether this way really is viable.

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 18:41                                                     ` Hugh Dickins
@ 2014-04-25 22:00                                                       ` Dave Hansen
  2014-04-26  3:11                                                         ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2014-04-25 22:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt,
	Peter Zijlstra, Jan Kara, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On 04/25/2014 11:41 AM, Hugh Dickins wrote:
> On Fri, 25 Apr 2014, Dave Hansen wrote:
>> On 04/25/2014 05:01 AM, Hugh Dickins wrote:
>>> Er, i_mmap_mutex.
>>>
>>> That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
>>> take to iterate over the file vmas.  So perhaps there's no race at all
>>> in the unmap_mapping_range() case.  And easy (I imagine) to fix the
>>> race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.
>>>
>>> But exit and munmap() don't take i_mmap_mutex: perhaps they should
>>> when encountering a VM_SHARED vma (I believe VM_SHARED should be
>>> peculiar to having vm_file set, but test both below because I don't
>>> want to oops in some odd corner where a special vma is set up).
>>
>> Hey Hugh,
>>
>> Do you want some testing on this?
> 
> Yes, please do: I just haven't gotten around to cloning the git
> tree and trying it.  It's quite likely that we shall go Linus's
> way rather than this, but still useful to have the information
> as to whether this way really is viable.

Your patch works fine for the madvise() case.  The effect appears the
same as Linus's to my test case at least.  I didn't test any unmaps or
other creative uses of unmap_mapping_range().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 22:00                                                       ` Dave Hansen
@ 2014-04-26  3:11                                                         ` Hugh Dickins
  2014-04-26  3:48                                                           ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-26  3:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, H. Peter Anvin, Benjamin Herrenschmidt,
	Peter Zijlstra, Jan Kara, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, 25 Apr 2014, Dave Hansen wrote:
> On 04/25/2014 11:41 AM, Hugh Dickins wrote:
> > On Fri, 25 Apr 2014, Dave Hansen wrote:
> >> On 04/25/2014 05:01 AM, Hugh Dickins wrote:
> >>> Er, i_mmap_mutex.
> >>>
> >>> That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> >>> take to iterate over the file vmas.  So perhaps there's no race at all
> >>> in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> >>> race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.
> >>
> >> Do you want some testing on this?
> > 
> > Yes, please do: I just haven't gotten around to cloning the git
> > tree and trying it.  It's quite likely that we shall go Linus's
> > way rather than this, but still useful to have the information
> > as to whether this way really is viable.
> 
> Your patch works fine for the madvise() case.  The effect appears the
> same as Linus's to my test case at least.  I didn't test any unmaps or
> other creative uses of unmap_mapping_range().

Thanks a lot for checking that, Dave, I'm glad to hear it worked.

Right, that patch only addressed the MADV_DONTNEED case: I've now
extended it, reverting the change in madvise.c, and doing it in
unmap_single_vma() instead, to cover all the cases.

So here is my alternative to Linus's "split 'tlb_flush_mmu()'" patch.
I don't really have a preference between the two approaches, and it
looks like Linus is now happy with his, so I don't expect this one to
go anywhere; unless someone else can see a significant advantage to it.
Not very thoroughly tested, I should add.

[PATCH] mm: unmap_single_vma take i_mmap_mutex

unmap_single_vma() take i_mmap_mutex on VM_SHARED mapping, and do the
tlb_flush_mmu() before releasing it, so that other cpus cannot modify
pages while they might be written as clean; but unmap_mapping_range()
already has i_mmap_mutex, so exclude that by a note in zap_details.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/mm.h |    1 +
 mm/memory.c        |   38 ++++++++++++++++++++++++++++++++------
 2 files changed, 33 insertions(+), 6 deletions(-)

--- 3.15-rc2/include/linux/mm.h	2014-04-13 17:24:36.120507176 -0700
+++ linux/include/linux/mm.h	2014-04-25 17:17:01.740484354 -0700
@@ -1073,6 +1073,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool mutex_is_held;	/* unmap_mapping_range() holds i_mmap_mutex */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
--- 3.15-rc2/mm/memory.c	2014-04-13 17:24:36.656507188 -0700
+++ linux/mm/memory.c	2014-04-25 19:01:42.564633627 -0700
@@ -1294,12 +1294,12 @@ static void unmap_page_range(struct mmu_
 	mem_cgroup_uncharge_end();
 }
 
-
 static void unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr,
 		struct zap_details *details)
 {
+	struct mutex *mutex;
 	unsigned long start = max(vma->vm_start, start_addr);
 	unsigned long end;
 
@@ -1329,12 +1329,38 @@ static void unmap_single_vma(struct mmu_
 			 * safe to do nothing in this case.
 			 */
 			if (vma->vm_file) {
-				mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
-				mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+				mutex = &vma->vm_file->f_mapping->i_mmap_mutex;
+				mutex_lock(mutex);
+				__unmap_hugepage_range_final(tlb, vma, start,
+								end, NULL);
+				mutex_unlock(mutex);
+			}
+		} else {
+			/*
+			 * When unmapping a shared writable mapping, we must
+			 * take care that TLB is flushed on other cpus running
+			 * this mm, before page_mkclean() or page reclaim loses
+			 * this vma from its rmap walk: otherwise another cpu
+			 * could modify page while it's being written as clean.
+			 * unmap_mapping_range() already holds i_mmap_mutex
+			 * preventing that, we must take it for other cases.
+			 */
+			mutex = NULL;
+			if (vma->vm_file && (vma->vm_flags & VM_SHARED) &&
+			    (!details || !details->mutex_is_held)) {
+				mutex = &vma->vm_file->f_mapping->i_mmap_mutex;
+				mutex_lock(mutex);
 			}
-		} else
 			unmap_page_range(tlb, vma, start, end, details);
+			if (mutex) {
+				unsigned long old_end = tlb->end;
+				tlb->end = end;
+				tlb_flush_mmu(tlb);
+				tlb->start = end;
+				tlb->end = old_end;
+				mutex_unlock(mutex);
+			}
+		}
 	}
 }
 
@@ -3009,7 +3035,7 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-
+	details.mutex_is_held = true;
 
 	mutex_lock(&mapping->i_mmap_mutex);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-26  3:11                                                         ` Hugh Dickins
@ 2014-04-26  3:48                                                           ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-04-26  3:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	Peter Zijlstra, Jan Kara, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, Apr 25, 2014 at 8:11 PM, Hugh Dickins <hughd@google.com> wrote:
>
> So here is my alternative to Linus's "split 'tlb_flush_mmu()'" patch.
> I don't really have a preference between the two approaches, and it
> looks like Linus is now happy with his, so I don't expect this one to
> go anywhere; unless someone else can see a significant advantage to it.

Hmm. I like that it's smaller and doesn't need any arch support.

I really dislike that 'details.mutex_is_held' flag, though. I dislike
pretty much *all* of details, but that one just bugs me extra much,
because it's basically static call chain information, and it really
smells like it should be possible to just do this all in the (few)
callers instead of having a flag about the one caller that did it.

In general, I hate conditional locking.  And here the conditionals are
getting pretty odd and complex.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 12:01                                                 ` Hugh Dickins
  2014-04-25 13:51                                                   ` Peter Zijlstra
  2014-04-25 16:54                                                   ` Dave Hansen
@ 2014-04-25 17:56                                                   ` Linus Torvalds
  2014-04-25 19:13                                                     ` Hugh Dickins
  2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-04-25 17:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Peter Zijlstra, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, Apr 25, 2014 at 5:01 AM, Hugh Dickins <hughd@google.com> wrote:
>
> Two, Ben said earlier that he's more worried about users of
> unmap_mapping_range() than concurrent munmap(); and you said
> earlier that you would almost prefer to have some special lock
> to serialize with page_mkclean().
>
> Er, i_mmap_mutex.
>
> That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> take to iterate over the file vmas.  So perhaps there's no race at all
> in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.

Hmm. unmap_mapping_range() is just abotu the only thing that _does_
take i_mmap_mutex. unmap_single_vma() does it for
is_vm_hugetlb_page(), which is a bit confusing. And normally we only
take it for the actual final vma link/unlink, not for the actual
traversal. So we'd have to change that all quite radically (or we'd
have to drop and re-take it).

So I'm not quite convinced. Your simple patch looks simple and should
certainly fix DaveH's test-case, but then leaves munmap/exit as a
separate thing to fix. And I don't see how to do that cleanly (it
really looks like "we'll just have to take that semaphore again
separately).

i_mmap_mutex is likely not contended, but we *do* take it for private
mappings too (and for read-only ones), so this lock is actually much
more common than the dirty shared mapping.

So I think I prefer my patch, even if that may be partly due to just
it being mine ;)

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-25 17:56                                                   ` Linus Torvalds
@ 2014-04-25 19:13                                                     ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2014-04-25 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, Peter Zijlstra, Jan Kara,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Fri, 25 Apr 2014, Linus Torvalds wrote:
> On Fri, Apr 25, 2014 at 5:01 AM, Hugh Dickins <hughd@google.com> wrote:
> >
> > Two, Ben said earlier that he's more worried about users of
> > unmap_mapping_range() than concurrent munmap(); and you said
> > earlier that you would almost prefer to have some special lock
> > to serialize with page_mkclean().
> >
> > Er, i_mmap_mutex.
> >
> > That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk,
> > take to iterate over the file vmas.  So perhaps there's no race at all
> > in the unmap_mapping_range() case.  And easy (I imagine) to fix the
> > race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below.
> 
> Hmm. unmap_mapping_range() is just abotu the only thing that _does_
> take i_mmap_mutex. unmap_single_vma() does it for
> is_vm_hugetlb_page(), which is a bit confusing. And normally we only
> take it for the actual final vma link/unlink, not for the actual
> traversal. So we'd have to change that all quite radically (or we'd
> have to drop and re-take it).
> 
> So I'm not quite convinced. Your simple patch looks simple and should
> certainly fix DaveH's test-case, but then leaves munmap/exit as a
> separate thing to fix. And I don't see how to do that cleanly (it
> really looks like "we'll just have to take that semaphore again
> separately).

Yes, mine is quite nice for the MADV_DONTNEED case, but needs more
complication to handle munmap/exit.  I don't want to drop and retake,
I'm hoping we can decide what to do via the zap_details.

It would still be messy that sometimes we come in with the mutex
and sometimes we take it inside; but then it's already messy that
sometimes we have it and sometimes we don't.

I'll try extending the patch to munmap/exit in a little bit, and
send out the result for comparison later today. 

> 
> i_mmap_mutex is likely not contended, but we *do* take it for private
> mappings too (and for read-only ones), so this lock is actually much
> more common than the dirty shared mapping.

We only need to take it in the (presumed rare beyond shared memory)
VM_SHARED case.  I'm not very keen on adding a mutex into the exit
path, but at least this one used to be a spinlock, and there should
be nowhere that tries to allocate memory while holding it (I'm wary
of adding OOM-kill deadlocks) - aside from tlb_next_batch()'s
GFP_NOWAIT|__GFP_NOWARN attempts.

> 
> So I think I prefer my patch, even if that may be partly due to just
> it being mine ;)

Yes, fair enough, I'm not against it: I just felt rather ashamed of
going on about page_mkclean() protocol and ptlock, while forgetting
all about i_mmap_mutex.  Let's see how beautiful mine turns out ;)

And it may be worth admitting that here we avoid CONFIG_PREEMPT,
and have been bitten in the past by free_pages_and_swap_cache()
latencies, so drastically reduced MAX_GATHER_BATCH_COUNT: so we
wouldn't expect to see much hit from your more frequent TLB flushes.

I must answer Peter now...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-24 23:46                                     ` Linus Torvalds
  2014-04-25  1:37                                       ` Benjamin Herrenschmidt
@ 2014-04-25 16:30                                       ` Dave Hansen
  1 sibling, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2014-04-25 16:30 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins
  Cc: Peter Zijlstra, Jan Kara, H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On 04/24/2014 04:46 PM, Linus Torvalds wrote:
> IOW, how about the attached patch that entirely replaces my previous
> two patches. DaveH - does this fix your test-case, while _not_
> introducing any new BUG_ON() triggers?
> 
> I didn't test the patch, maybe I did something stupid. It compiles for
> me, but it only works for the HAVE_GENERIC_MMU_GATHER case, but
> introducing tlb_flush_mmu_tlbonly() and tlb_flush_mmu_free() into the
> non-generic cases should be trivial, since they really are just that
> old "tlb_flush_mmu()" function split up (the tlb_flush_mmu() function
> remains available for other non-forced flush users)
> 
> So assuming this does work for DaveH, then the arm/ia64/um/whatever
> people would need to do those trivial transforms too, but it really
> shouldn't be too painful.

It looks happy on both my debugging kernel (which was triggering it
before) and the one without lockdep and all the things that normally
slow it down and change timing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23 18:41                         ` Jan Kara
  2014-04-23 19:33                           ` Linus Torvalds
@ 2014-04-23 20:11                           ` Hugh Dickins
  2014-04-24  8:49                             ` Jan Kara
  1 sibling, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2014-04-23 20:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Peter Zijlstra, Dave Hansen, H. Peter Anvin,
	Benjamin Herrenschmidt, linux-arch@vger.kernel.org, linux-mm,
	Russell King - ARM Linux, Tony Luck

On Wed, 23 Apr 2014, Jan Kara wrote:
> On Tue 22-04-14 20:08:59, Hugh Dickins wrote:
> > On Tue, 22 Apr 2014, Linus Torvalds wrote:
> > > On Tue, Apr 22, 2014 at 12:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > That said, Dave Hansen did report a BUG_ON() in
> > > mpage_prepare_extent_to_map(). His line number was odd, but I assume
> > > it's this one:
> > > 
> > >         BUG_ON(PageWriteback(page));
> > > 
> > > which may be indicative of some oddity here wrt the dirty bit.
> > 
> > Whereas later mail from Dave showed it to be the
> > 	BUG_ON(!PagePrivate(page));
> > in page_buffers() from fs/ext4/inode.c mpage_prepare_extent_to_map().
> > But still presumably some kind of fallout from your patches.
> > 
> > Once upon a time there was a page_has_buffers() check in there,
> > but Honza decided that's nowadays unnecessary in f8bec37037ac
> > "ext4: dirty page has always buffers attached".  Cc'ed,
> > he may very well have some good ideas.
> > 
> > Reading that commit reminded me of how we actually don't expect that
> > set_page_dirty() in zap_pte_range() to do anything at all on the usual
> > mapping_cap_account_dirty()/page_mkwrite() filesystems, do we?  Or do we?
>   Yes, for shared file mappings we (as in filesystems implementing
> page_mkwrite() handler) expect a page is writeably mmapped iff the page is
> dirty. So in particular we don't expect set_page_dirty() in zap_pte_range()
> to do anything because if the pte has dirty bit set, we are tearing down a
> writeable mapping of the page and thus the page should be already dirty.
> 
> Now the devil is in synchronization of different places where transitions
> from/to writeably-mapped state happen. In the fault path (do_wp_page())
> where transition to writeably-mapped happens we hold page lock while
> calling set_page_dirty(). In the writeout path (clear_page_dirty_for_io())
> where we transition from writeably-mapped we hold the page lock as well
> while calling page_mkclean() and possibly set_page_dirty(). So these two
> places are nicely serialized. However zap_pte_range() doesn't hold page
> lock so it can race with the previous two places. Before Linus' patches we
> called set_page_dirty() under pte lock in zap_pte_range() and also before
> decrementing page->mapcount. So if zap_pte_range() raced with
> clear_page_dirty_for_io() we were guaranteed that by the time
> clear_page_dirty_for_io() returns, pte dirty bit is cleared and
> set_page_dirty() was called (either from clear_page_dirty_for_io() or from
> zap_pte_range()).
> 
> However with Linus' patches set_page_dirty() from zap_pte_range() gets
> called after decremeting page->mapcount so page_mkclean() won't event try
> to walk rmap. And even if page_mkclean() did walk the rmap, zap_pte_range()
> calls set_page_dirty() after dropping pte lock so it can get called long
> after page_mkclean() (and clear_page_dirty_for_io()) has returned.

Right, thanks a lot for fleshing that out.

> 
> Now I'm not sure how to fix Linus' patches. For all I care we could just
> rip out pte dirty bit handling for file mappings. However last time I
> suggested this you corrected me that tmpfs & ramfs need this. I assume this
> is still the case - however, given we unconditionally mark the page dirty
> for write faults, where exactly do we need this?

Good, Linus has already replied to you on this this: you appear to be
suggesting that there would be no issue, and Linus's patches would not
be needed at all, if only tmpfs and ramfs played by the others' rules.

But (sadly) I don't think that's so: just because zap_pte_range()'s
current "if (pte_dirty) set_page_dirty" does nothing on most filesystems,
does not imply that nothing needs to be done on most filesystems, now
that we're alert to the delayed TLB flushing issue.

Just to answer your (interesting but irrelevant!) question about tmpfs
and ramfs: their issue is with read faults which bring in a zeroed page,
with page and pte not marked dirty.  If userspace modifies that page, the
pte_dirty needs to be propagated through to PageDirty, to prevent page
reclaim from simply freeing the apparently clean page.

(To be honest, I haven't checked the ramfs case recently: perhaps its
pages are marked unevictable, and never even reach the code which might
free them.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dirty/Access bits vs. page content
  2014-04-23 20:11                           ` Hugh Dickins
@ 2014-04-24  8:49                             ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2014-04-24  8:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, Linus Torvalds, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, Benjamin Herrenschmidt,
	linux-arch@vger.kernel.org, linux-mm, Russell King - ARM Linux,
	Tony Luck

On Wed 23-04-14 13:11:20, Hugh Dickins wrote:
> On Wed, 23 Apr 2014, Jan Kara wrote:
> > Now I'm not sure how to fix Linus' patches. For all I care we could just
> > rip out pte dirty bit handling for file mappings. However last time I
> > suggested this you corrected me that tmpfs & ramfs need this. I assume this
> > is still the case - however, given we unconditionally mark the page dirty
> > for write faults, where exactly do we need this?
> 
> Good, Linus has already replied to you on this this: you appear to be
> suggesting that there would be no issue, and Linus's patches would not
> be needed at all, if only tmpfs and ramfs played by the others' rules.
  No, that's not what I wanted to say. I wanted to say - keep Linus'
patches and additionally rip out pte dirty bit handling for "normal"
filesystems to not confuse filesystems with dirty pages where they don't
expect them. But after reading replies and thinking about it even that is
not enough because then we would again miss writes done by other cpus after
we tore down rmap but before we flushed TLBs.

> But (sadly) I don't think that's so: just because zap_pte_range()'s
> current "if (pte_dirty) set_page_dirty" does nothing on most filesystems,
> does not imply that nothing needs to be done on most filesystems, now
> that we're alert to the delayed TLB flushing issue.
> 
> Just to answer your (interesting but irrelevant!) question about tmpfs
> and ramfs: their issue is with read faults which bring in a zeroed page,
> with page and pte not marked dirty.  If userspace modifies that page, the
> pte_dirty needs to be propagated through to PageDirty, to prevent page
> reclaim from simply freeing the apparently clean page.
  Ah, I keep forgetting about that vma_wants_writenotify() thing in mmap
which changes whether a shared page is mapped RW or RO on read fault.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2014-04-28 10:14 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1398032742.19682.11.camel@pasglop>
     [not found] ` <CA+55aFz1sK+PF96LYYZY7OB7PBpxZu-uNLWLvPiRz-tJsBqX3w@mail.gmail.com>
     [not found]   ` <1398054064.19682.32.camel@pasglop>
     [not found]     ` <1398057630.19682.38.camel@pasglop>
     [not found]       ` <CA+55aFwWHBtihC3w9E4+j4pz+6w7iTnYhTf4N3ie15BM9thxLQ@mail.gmail.com>
     [not found]         ` <53558507.9050703@zytor.com>
     [not found]           ` <CA+55aFxGm6J6N=4L7exLUFMr1_siNGHpK=wApd9GPCH1=63PPA@mail.gmail.com>
     [not found]             ` <53559F48.8040808@intel.com>
2014-04-22  0:31               ` Dirty/Access bits vs. page content Linus Torvalds
2014-04-22  0:44                 ` Linus Torvalds
2014-04-22  5:15                   ` Tony Luck
2014-04-22 14:55                     ` Linus Torvalds
2014-04-22  7:34                   ` Peter Zijlstra
2014-04-22  7:54                   ` Peter Zijlstra
2014-04-22 21:36                     ` Linus Torvalds
2014-04-22 21:46                       ` Dave Hansen
2014-04-22 22:08                         ` Linus Torvalds
2014-04-22 22:41                           ` Dave Hansen
2014-04-23  2:44                         ` Linus Torvalds
2014-04-23  3:08                       ` Hugh Dickins
2014-04-23  4:23                         ` Linus Torvalds
2014-04-23  6:14                           ` Benjamin Herrenschmidt
2014-04-23 18:41                         ` Jan Kara
2014-04-23 19:33                           ` Linus Torvalds
2014-04-24  6:51                             ` Peter Zijlstra
2014-04-24 18:40                               ` Hugh Dickins
2014-04-24 19:45                                 ` Linus Torvalds
2014-04-24 20:02                                   ` Hugh Dickins
2014-04-24 23:46                                     ` Linus Torvalds
2014-04-25  1:37                                       ` Benjamin Herrenschmidt
2014-04-25  2:41                                         ` Benjamin Herrenschmidt
2014-04-25  2:46                                           ` Linus Torvalds
2014-04-25  2:50                                             ` H. Peter Anvin
2014-04-25  3:03                                               ` Linus Torvalds
2014-04-25 12:01                                                 ` Hugh Dickins
2014-04-25 13:51                                                   ` Peter Zijlstra
2014-04-25 19:41                                                     ` Hugh Dickins
2014-04-26 18:07                                                       ` Peter Zijlstra
2014-04-27  7:20                                                         ` Peter Zijlstra
2014-04-27 12:20                                                           ` Hugh Dickins
2014-04-27 19:33                                                             ` Peter Zijlstra
2014-04-27 19:47                                                               ` Linus Torvalds
2014-04-27 20:09                                                             ` Hugh Dickins
2014-04-28  9:25                                                               ` Peter Zijlstra
2014-04-28 10:14                                                                 ` Peter Zijlstra
2014-04-27 16:21                                                           ` Linus Torvalds
2014-04-27 23:13                                                             ` Benjamin Herrenschmidt
2014-04-25 16:54                                                   ` Dave Hansen
2014-04-25 18:41                                                     ` Hugh Dickins
2014-04-25 22:00                                                       ` Dave Hansen
2014-04-26  3:11                                                         ` Hugh Dickins
2014-04-26  3:48                                                           ` Linus Torvalds
2014-04-25 17:56                                                   ` Linus Torvalds
2014-04-25 19:13                                                     ` Hugh Dickins
2014-04-25 16:30                                       ` Dave Hansen
2014-04-23 20:11                           ` Hugh Dickins
2014-04-24  8:49                             ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).