[RFC 0/6] MADV_FREE: respect pte_dirty, not PG

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty.
@ 2015-06-03  6:15 Minchan Kim
  2015-06-03  6:15 ` [RFC 1/6] mm: keep dirty bit on KSM page Minchan Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

MADV_FREE relies on the dirty bit in page table entry to decide
whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, as one of exmaple, if swap-in by read fault happens,
page table entry point out the page doesn't have marked dirty bit
so MADV_FREE might discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives
on swap cache and since it was evicted from the swap cache,
the page has PG_dirty flag. So both page flags checks effectively
prevent wrong discarding by MADV_FREE.

A problem in above logic is that swapped-in page has PG_dirty
since they are removed from swap cache so VM cannot consider
those pages as freeable any more alghouth madvise_free is
called in future. Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and removed from swapcache.
               page table doesn't mark dirty bit and page
               descriptor includes PG_dirty
..
..
madvise_free(ptr);
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

Rather than relying on the PG_dirty of page descriptor for
preventing discarding a page, dirty bit in page table is more
straightforward and simple.

So, this patch try to make page table entry's dirty bit mark so
it doesn't need to take care of PG_dirty.
For it, it fixes several cases(e.g, KSM, migration, swapin, swapoff)
then, finally it makes MADV_FREE simple.

With this, it removes complicated logic and makes freeable page
checking by madvise_free simple.(ie, +90/-108).
Of course, we could solve above mentioned PG_Dirty problem.

I tested this patchset(memcg, tmpfs, swapon/off, THP, KSM) and
found no problem but it still needs careful review.

Minchan Kim (6):
  mm: keep dirty bit on KSM page
  mm: keep dirty bit on anonymous page migration
  mm: mark dirty bit on swapped-in page
  mm: mark dirty bit on unuse_pte
  mm: decouple PG_dirty from MADV_FREE
  mm: MADV_FREE refactoring

 include/linux/rmap.h |  9 ++----
 mm/ksm.c             | 19 ++++++++++---
 mm/madvise.c         | 13 ---------
 mm/memory.c          |  6 ++--
 mm/migrate.c         |  4 +++
 mm/rmap.c            | 78 +++++++++++++++++++++++++---------------------------
 mm/swapfile.c        |  6 +++-
 mm/vmscan.c          | 63 ++++++++++++++----------------------------
 8 files changed, 90 insertions(+), 108 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/6] mm: keep dirty bit on KSM page
  2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
@ 2015-06-03  6:15 ` Minchan Kim
  2015-06-03  6:15 ` [RFC 2/6] mm: keep dirty bit on anonymous page migration Minchan Kim
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

I encountered segfault of test program while I tested MADV_FREE
with KSM. By investigation,

1. A KSM page is mapped on page table of A, B processes with
   !pte_dirty(but it marked the page as PG_dirty if pte_dirty is on)

2. MADV_FREE of A process can remove the page from swap cache
   if it was in there and then clear *PG_dirty* to indicate we could
   discard it instead of swapping out.

3. So, the KSM page's status is !pte_dirty of A, B processes &&
   !PageDirty.

4. VM judges it as freeable page and discard it.

5. Process B encounters segfault even though B didn't call MADV_FREE.

Clearing PG_dirty after anonymous page is removed from swap cache
was no problem on integrity POV for private page(ie, normal anon page,
not KSM). Just worst case caused by that was unnecessary write out
which we have avoided it if same data is already on swap.

However, with introducing MADV_FREE, it could make above problem
so this patch fixes it with keeping dirty bit of the page table
when the page is replaced with KSM page.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/ksm.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index bc7be0ee2080..9c07346e57f2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -901,9 +901,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			set_pte_at(mm, addr, ptep, entry);
 			goto out_unlock;
 		}
-		if (pte_dirty(entry))
-			set_page_dirty(page);
-		entry = pte_mkclean(pte_wrprotect(entry));
+
+		entry = pte_wrprotect(entry);
 		set_pte_at_notify(mm, addr, ptep, entry);
 	}
 	*orig_pte = *ptep;
@@ -932,11 +931,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
 	pte_t *ptep;
+	pte_t entry;
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	bool dirty;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -956,12 +957,22 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 		goto out_mn;
 	}
 
+	dirty = pte_dirty(*ptep);
 	get_page(kpage);
 	page_add_anon_rmap(kpage, vma, addr);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+	entry = mk_pte(kpage, vma->vm_page_prot);
+	/*
+	 * Keep a dirty bit to prevent a KSM page sudden freeing
+	 * by MADV_FREE.
+	 */
+	if (dirty)
+		entry = pte_mkdirty(entry);
+
+	set_pte_at_notify(mm, addr, ptep, entry);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 2/6] mm: keep dirty bit on anonymous page migration
  2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
  2015-06-03  6:15 ` [RFC 1/6] mm: keep dirty bit on KSM page Minchan Kim
@ 2015-06-03  6:15 ` Minchan Kim
  2015-06-03  6:15 ` [RFC 3/6] mm: mark dirty bit on swapped-in page Minchan Kim
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

Currently, If VM migrates anonymous page, we lose dirty bit of
page table entry. Instead, VM translates dirty bit of page table
as PG_dirty of page flags. It was okay because dirty bit of
page table for anonymous page was no matter to swap out.
Instead, VM took care of PG_dirty.

However, with introducing MADV_FREE, it's important to keep
page table's dirty bit because It could make MADV_FREE handling
logics more straighforward without taking care of PG_dirty.

This patch aims for preparing to remove PG_dirty check for MADV_FREE.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/migrate.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25e79d9..add30c3aaaa9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -151,6 +151,10 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	if (is_write_migration_entry(entry))
 		pte = maybe_mkwrite(pte, vma);
 
+	/* MADV_FREE relies on pte_dirty. */
+	if (PageAnon(new))
+		pte = pte_mkdirty(pte);
+
 #ifdef CONFIG_HUGETLB_PAGE
 	if (PageHuge(new)) {
 		pte = pte_mkhuge(pte);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 3/6] mm: mark dirty bit on swapped-in page
  2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
  2015-06-03  6:15 ` [RFC 1/6] mm: keep dirty bit on KSM page Minchan Kim
  2015-06-03  6:15 ` [RFC 2/6] mm: keep dirty bit on anonymous page migration Minchan Kim
@ 2015-06-03  6:15 ` Minchan Kim
  2015-06-09 19:07   ` Cyrill Gorcunov
  2015-06-03  6:15 ` [RFC 4/6] mm: mark dirty bit on unuse_pte Minchan Kim
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim,
	Cyrill Gorcunov, Pavel Emelyanov, Yalin Wang

Basically, MADV_FREE relys on the dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, if swap-in by read fault happens, page table entry
point out the page doesn't have marked dirty bit so MADV_FREE
might discard the page wrongly.

To fix the problem, this patch marks page table entry of page
swapping-in as dirty so VM shouldn't discard the page suddenly
under us.

With MADV_FREE point of view, marking dirty unconditionally is
no problem because we dropped swapped page in MADV_FREE sycall
context(ie, Look at madvise_free_pte_range) so every swapping-in
pages are no MADV_FREE hinted pages.

Cc: Hugh Dickins <hughd@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8a2fc9945b46..d1709f763152 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2557,9 +2557,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
-	pte = mk_pte(page, vma->vm_page_prot);
+
+	/* Mark dirty bit of page table because MADV_FREE relies on it */
+	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = maybe_mkwrite(pte, vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 4/6] mm: mark dirty bit on unuse_pte
  2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
                   ` (2 preceding siblings ...)
  2015-06-03  6:15 ` [RFC 3/6] mm: mark dirty bit on swapped-in page Minchan Kim
@ 2015-06-03  6:15 ` Minchan Kim
  2015-06-03  6:15 ` [RFC 5/6] mm: decouple PG_dirty from MADV_FREE Minchan Kim
  2015-06-03  6:15 ` [RFC 6/6] mm: MADV_FREE refactoring Minchan Kim
  5 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

Basically, MADV_FREE relys on the dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, if swapoff happens, page table entry point out the page
doesn't have marked dirty bit so MADV_FREE might discard the page
wrongly.

To fix the problem, this patch marks page table entry of page
as dirty when swapoff hanppens VM shouldn't discard the page
suddenly under us.

With MADV_FREE point of view, marking dirty unconditionally is
no problem because we dropped swapped page in MADV_FREE sycall
context(ie, Look at madvise_free_pte_range) so every swapping-in
pages are no MADV_FREE hinted pages.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/swapfile.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7e72103f23b..cc8b79ab2190 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1118,8 +1118,12 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
+	/*
+	 * For preventing sudden freeing by MADV_FREE, pte must have a
+	 * dirty flag.
+	 */
 	set_pte_at(vma->vm_mm, addr, pte,
-		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
+		   pte_mkdirty(pte_mkold(mk_pte(page, vma->vm_page_prot))));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr);
 		mem_cgroup_commit_charge(page, memcg, true);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 5/6] mm: decouple PG_dirty from MADV_FREE
  2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
                   ` (3 preceding siblings ...)
  2015-06-03  6:15 ` [RFC 4/6] mm: mark dirty bit on unuse_pte Minchan Kim
@ 2015-06-03  6:15 ` Minchan Kim
  2015-06-03  6:15 ` [RFC 6/6] mm: MADV_FREE refactoring Minchan Kim
  5 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

Basically, MADV_FREE relies on dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, as a example, if swap-in by read fault happens,
page table entry doesn't have dirty bit so MADV_FREE could discard
the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty since they are removed from swap cache so VM cannot consider
those pages as freeable any more alghouth madvise_free is called in future.
Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and removed from swapcache.
               page table doesn't mark dirty bit and page
               descriptor includes PG_dirty
..
..
madvise_free(ptr);
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

So, rather than relying on the PG_dirty of page descriptor
for preventing discarding a page, dirty bit in page table is more
straightforward and simple.

Now, every anonymous page handling(ex, anon/swap/cow fault handling,
KSM, THP, Migration) takes care of pte dirty bit to keep it so
we don't need to check PG_dirty to identify MADV_FREE hinted page
so this patch removes PageDirty check.

With this, it removes complicated logic and makes freeable page
checking as well as solving above mentioned problem.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/rmap.c   | 2 +-
 mm/vmscan.c | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 9c045940ed10..a2e4f64c392e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1280,7 +1280,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

 		if (flags & TTU_FREE) {
 			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!dirty && !PageDirty(page)) {
+			if (!dirty) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 37e90db1520b..c5fbb7c64deb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}

-	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
-			!PageDirty(page))
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
 		*freeable = true;

 	/* Reclaim if clean, defer dirty pages to writeback */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 6/6] mm: MADV_FREE refactoring
  2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
                   ` (4 preceding siblings ...)
  2015-06-03  6:15 ` [RFC 5/6] mm: decouple PG_dirty from MADV_FREE Minchan Kim
@ 2015-06-03  6:15 ` Minchan Kim
  5 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2015-06-03  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko,
	Johannes Weiner, linux-mm, linux-kernel, Minchan Kim

This patch does some clean up.

Firstly, it removes unnecessary PageSwapCache and ClearPageDirty
in madvise_fre_pte_range. The reason I did in there is to prevent
wrong discarding by non-dirty marked page table entries of
anonymous pages.
However, eariler patches in this patchset removed dependency
between MADV_FREE and PG_dirty and took care of anonymous page's
dirty bit in page table so there is no reason to add such logics
in fast path now.

Secondly, this patch removes freeable check in page_referenced.
It was pointess because we could decide freeable in try_to_unmap_one
with just pte_dirty. If it is judged by freeable page(ie, page table
doesn't have dirty bit), just doesn't store swap location of the pte
(ie, nuke the page table entry ). Otherwise, we could store swap
location on pte so that the page should be swapped-out.

This patch introduces SWAP_DISCARD which represents the passed page
should be discarded instead of swapping.
It is set if page is no mapped at any page table and there is
no swap slot for the page. If so, shrink_page_list does ClearPageDirty
for skipping pageout and then VM reclaims it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h |  9 ++----
 mm/madvise.c         | 13 ---------
 mm/rmap.c            | 78 +++++++++++++++++++++++++---------------------------
 mm/vmscan.c          | 62 ++++++++++++++---------------------------
 4 files changed, 62 insertions(+), 100 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bf36b6e644c4..352c72074e69 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -184,8 +184,7 @@ static inline void page_dup_rmap(struct page *page)
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *memcg, unsigned long *vm_flags,
-			int *is_pte_dirty);
+			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
@@ -262,12 +261,9 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *memcg,
-				  unsigned long *vm_flags,
-				  int *is_pte_dirty)
+				  unsigned long *vm_flags)
 {
 	*vm_flags = 0;
-	if (is_pte_dirty)
-		*is_pte_dirty = 0;
 	return 0;
 }
 
@@ -288,5 +284,6 @@ static inline int page_mkclean(struct page *page)
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 #define SWAP_MLOCK	3
+#define SWAP_DISCARD	4
 
 #endif	/* _LINUX_RMAP_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index d215ea949630..68446e8ea6f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -317,19 +317,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!page)
 			continue;
 
-		if (PageSwapCache(page)) {
-			if (!trylock_page(page))
-				continue;
-
-			if (!try_to_free_swap(page)) {
-				unlock_page(page);
-				continue;
-			}
-
-			ClearPageDirty(page);
-			unlock_page(page);
-		}
-
 		/*
 		 * Some of architecture(ex, PPC) don't update TLB
 		 * with set_pte_at and tlb_remove_tlb_entry so for
diff --git a/mm/rmap.c b/mm/rmap.c
index a2e4f64c392e..aa4a5174bc69 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -712,7 +712,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 }
 
 struct page_referenced_arg {
-	int dirtied;
 	int mapcount;
 	int referenced;
 	unsigned long vm_flags;
@@ -727,7 +726,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	int referenced = 0;
-	int dirty = 0;
 	struct page_referenced_arg *pra = arg;
 
 	if (unlikely(PageTransHuge(page))) {
@@ -752,14 +750,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 
-		/*
-		 * Use pmd_freeable instead of raw pmd_dirty because in some
-		 * of architecture, pmd_dirty is not defined unless
-		 * CONFIG_TRANSPARENT_HUGEPAGE is enabled
-		 */
-		if (!pmd_freeable(*pmd))
-			dirty++;
-
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
@@ -790,9 +780,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 
-		if (pte_dirty(*pte))
-			dirty++;
-
 		pte_unmap_unlock(pte, ptl);
 	}
 
@@ -801,9 +788,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pra->vm_flags |= vma->vm_flags;
 	}
 
-	if (dirty)
-		pra->dirtied++;
-
 	pra->mapcount--;
 	if (!pra->mapcount)
 		return SWAP_SUCCESS; /* To break the loop */
@@ -828,7 +812,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
  * @is_locked: caller holds lock on the page
  * @memcg: target memory cgroup
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
- * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -836,8 +819,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *memcg,
-		    unsigned long *vm_flags,
-		    int *is_pte_dirty)
+		    unsigned long *vm_flags)
 {
 	int ret;
 	int we_locked = 0;
@@ -852,8 +834,6 @@ int page_referenced(struct page *page,
 	};
 
 	*vm_flags = 0;
-	if (is_pte_dirty)
-		*is_pte_dirty = 0;
 
 	if (!page_mapped(page))
 		return 0;
@@ -882,9 +862,6 @@ int page_referenced(struct page *page,
 	if (we_locked)
 		unlock_page(page);
 
-	if (is_pte_dirty)
-		*is_pte_dirty = pra.dirtied;
-
 	return pra.referenced;
 }
 
@@ -1206,8 +1183,13 @@ void page_remove_rmap(struct page *page)
 	 */
 }
 
+struct rmap_arg {
+	enum ttu_flags flags;
+	int discard;
+};
+
 /*
- * @arg: enum ttu_flags will be passed to this argument
+ * @arg: struct rmap_arg will be passed to this argument
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
@@ -1217,7 +1199,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
-	enum ttu_flags flags = (enum ttu_flags)arg;
+	struct rmap_arg *rmap_arg = (struct rmap_arg *)arg;
+	enum ttu_flags flags = rmap_arg->flags;
 	int dirty = 0;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
@@ -1278,17 +1261,11 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
-		if (flags & TTU_FREE) {
-			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!dirty) {
-				/* It's a freeable page by MADV_FREE */
-				dec_mm_counter(mm, MM_ANONPAGES);
-				goto discard;
-			} else {
-				set_pte_at(mm, address, pte, pteval);
-				ret = SWAP_FAIL;
-				goto out_unmap;
-			}
+		if (flags & TTU_FREE && !dirty) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			rmap_arg->discard++;
+			goto discard;
 		}
 
 		if (PageSwapCache(page)) {
@@ -1405,15 +1382,23 @@ static int page_not_mapped(struct page *page)
 int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
+	int mapcount;
+	struct rmap_arg rmap_arg = {
+		.flags = flags,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
-		.arg = (void *)flags,
+		.arg = (void *)&rmap_arg,
 		.done = page_not_mapped,
 		.anon_lock = page_lock_anon_vma_read,
 	};
 
 	VM_BUG_ON_PAGE(!PageHuge(page) && PageTransHuge(page), page);
 
+	if (flags & TTU_FREE)
+		mapcount = page_mapcount(page);
+
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
 	 * The VMA is moved under the anon_vma lock but not the
@@ -1427,8 +1412,21 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 
 	ret = rmap_walk(page, &rwc);
 
-	if (ret != SWAP_MLOCK && !page_mapped(page))
-		ret = SWAP_SUCCESS;
+	/*
+	 * The mapcount would be zero if some process frees @page in parallel.
+	 * In this case, we shouldn't count it as SWAP_DISCARD. Otherwise,
+	 * people can see increased pglazyfreed via vmstat even though there
+	 * is no process called MADV_FREE.
+	 */
+	if (ret != SWAP_MLOCK && !page_mapped(page)) {
+		if ((flags & TTU_FREE) && mapcount &&
+			mapcount == rmap_arg.discard &&
+				!page_swapcount(page))
+			ret = SWAP_DISCARD;
+		else
+			ret = SWAP_SUCCESS;
+	}
+
 	return ret;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5fbb7c64deb..4ab599810c8d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -754,17 +754,15 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct scan_control *sc,
-						  bool *freeable)
+						  struct scan_control *sc)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
-	int pte_dirty;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
-					  &vm_flags, &pte_dirty);
+					  &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
@@ -805,9 +803,6 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
-	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
-		*freeable = true;
-
 	/* Reclaim if clean, defer dirty pages to writeback */
 	if (referenced_page && !PageSwapBacked(page))
 		return PAGEREF_RECLAIM_CLEAN;
@@ -876,7 +871,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
-		bool freeable = false;
+		int unmap_ret = SWAP_AGAIN;
+		enum ttu_flags tmp_ttu_flags = ttu_flags;
 
 		cond_resched();
 
@@ -1000,8 +996,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (!force_reclaim)
-			references = page_check_references(page, sc,
-							&freeable);
+			references = page_check_references(page, sc);
 
 		switch (references) {
 		case PAGEREF_ACTIVATE:
@@ -1013,12 +1008,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			; /* try to reclaim the page below */
 		}
 
-		/*
-		 * Anonymous process memory has backing store?
-		 * Try to allocate it some swap space here.
-		 */
-		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!freeable) {
+		if (PageAnon(page)) {
+			tmp_ttu_flags |= TTU_FREE;
+			/*
+			 * Anonymous process memory has backing store?
+			 * Try to allocate it some swap space here.
+			 */
+			if (!PageSwapCache(page)) {
 				if (!(sc->gfp_mask & __GFP_IO))
 					goto keep_locked;
 				if (!add_to_swap(page, page_list))
@@ -1026,44 +1022,26 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				may_enter_fs = 1;
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
-			} else {
-				if (likely(!PageTransHuge(page)))
-					goto unmap;
-				/* try_to_unmap isn't aware of THP page */
-				if (unlikely(split_huge_page_to_list(page,
-								page_list)))
-					goto keep_locked;
 			}
 		}
-unmap:
+
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && (mapping || freeable)) {
-			switch (try_to_unmap(page,
-				freeable ? TTU_FREE : ttu_flags)) {
+		if (page_mapped(page) && mapping) {
+			switch (unmap_ret = try_to_unmap(page, tmp_ttu_flags)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
 			case SWAP_MLOCK:
 				goto cull_mlocked;
+			case SWAP_DISCARD:
+				ClearPageDirty(page);
 			case SWAP_SUCCESS:
+				break;
 				/* try to free the page below */
-				if (!freeable)
-					break;
-				/*
-				 * Freeable anon page doesn't have mapping
-				 * due to skipping of swapcache so we free
-				 * page in here rather than __remove_mapping.
-				 */
-				VM_BUG_ON_PAGE(PageSwapCache(page), page);
-				if (!page_freeze_refs(page, 1))
-					goto keep_locked;
-				__ClearPageLocked(page);
-				count_vm_event(PGLAZYFREED);
-				goto free_it;
 			}
 		}
 
@@ -1175,6 +1153,8 @@ unmap:
 		 */
 		__ClearPageLocked(page);
 free_it:
+		if (unmap_ret == SWAP_DISCARD)
+			count_vm_event(PGLAZYFREED);
 		nr_reclaimed++;
 
 		/*
@@ -1820,7 +1800,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
-				    &vm_flags, NULL)) {
+				    &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 3/6] mm: mark dirty bit on swapped-in page
  2015-06-03  6:15 ` [RFC 3/6] mm: mark dirty bit on swapped-in page Minchan Kim
@ 2015-06-09 19:07   ` Cyrill Gorcunov
  2015-06-09 23:52     ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Cyrill Gorcunov @ 2015-06-09 19:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Yalin Wang

On Wed, Jun 03, 2015 at 03:15:42PM +0900, Minchan Kim wrote:
> Basically, MADV_FREE relys on the dirty bit in page table entry
> to decide whether VM allows to discard the page or not.
> IOW, if page table entry includes marked dirty bit, VM shouldn't
> discard the page.
> 
> However, if swap-in by read fault happens, page table entry
> point out the page doesn't have marked dirty bit so MADV_FREE
> might discard the page wrongly.
> 
> To fix the problem, this patch marks page table entry of page
> swapping-in as dirty so VM shouldn't discard the page suddenly
> under us.
> 
> With MADV_FREE point of view, marking dirty unconditionally is
> no problem because we dropped swapped page in MADV_FREE sycall
> context(ie, Look at madvise_free_pte_range) so every swapping-in
> pages are no MADV_FREE hinted pages.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/memory.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8a2fc9945b46..d1709f763152 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2557,9 +2557,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> -	pte = mk_pte(page, vma->vm_page_prot);
> +
> +	/* Mark dirty bit of page table because MADV_FREE relies on it */
> +	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
>  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +		pte = maybe_mkwrite(pte, vma);
>  		flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
>  		exclusive = 1;

Hi Minchan! Really sorry for delay in reply. Look, I don't understand
the moment -- if page has fault on read then before the patch the
PTE won't carry the dirty flag but now we do set it up unconditionally
and to me it looks somehow strange at least because this as well
sets soft-dirty bit on pages which were not modified but only swapped
out. Am I missing something obvious?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 3/6] mm: mark dirty bit on swapped-in page
  2015-06-09 19:07   ` Cyrill Gorcunov
@ 2015-06-09 23:52     ` Minchan Kim
  2015-06-10  7:23       ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2015-06-09 23:52 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Yalin Wang

Hello Cyrill,

On Tue, Jun 09, 2015 at 10:07:37PM +0300, Cyrill Gorcunov wrote:
> On Wed, Jun 03, 2015 at 03:15:42PM +0900, Minchan Kim wrote:
> > Basically, MADV_FREE relys on the dirty bit in page table entry
> > to decide whether VM allows to discard the page or not.
> > IOW, if page table entry includes marked dirty bit, VM shouldn't
> > discard the page.
> > 
> > However, if swap-in by read fault happens, page table entry
> > point out the page doesn't have marked dirty bit so MADV_FREE
> > might discard the page wrongly.
> > 
> > To fix the problem, this patch marks page table entry of page
> > swapping-in as dirty so VM shouldn't discard the page suddenly
> > under us.
> > 
> > With MADV_FREE point of view, marking dirty unconditionally is
> > no problem because we dropped swapped page in MADV_FREE sycall
> > context(ie, Look at madvise_free_pte_range) so every swapping-in
> > pages are no MADV_FREE hinted pages.
> > 
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/memory.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8a2fc9945b46..d1709f763152 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2557,9 +2557,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  
> >  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> > -	pte = mk_pte(page, vma->vm_page_prot);
> > +
> > +	/* Mark dirty bit of page table because MADV_FREE relies on it */
> > +	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
> >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> > -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +		pte = maybe_mkwrite(pte, vma);
> >  		flags &= ~FAULT_FLAG_WRITE;
> >  		ret |= VM_FAULT_WRITE;
> >  		exclusive = 1;
> 
> Hi Minchan! Really sorry for delay in reply. Look, I don't understand
> the moment -- if page has fault on read then before the patch the
> PTE won't carry the dirty flag but now we do set it up unconditionally
> and to me it looks somehow strange at least because this as well
> sets soft-dirty bit on pages which were not modified but only swapped
> out. Am I missing something obvious?

It's same one I sent a while ago and you said it's okay at that time. ;-)
Okay, It might be lack of description compared to one I sent long time ago
because I moved some part of description to another patch and I didn't Cc
you. Sorry. I hope below will remind you.

https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg857827.html

In summary, the problem is that in MADV_FREE point of view,
clean anonymous page(ie, no dirty) in  page table entry has a problem
about sudden discarding under us by reclaimer. Otherwise, VM cannot
discard MADV_FREE hinted pages by PageDirty flag of page descriptor.

This patchset aims for solving the problem.
Please feel free to ask if you have questions without wasting your time
unless you can remind after reading above URL

Thanks for looking!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 3/6] mm: mark dirty bit on swapped-in page
  2015-06-09 23:52     ` Minchan Kim
@ 2015-06-10  7:23       ` Cyrill Gorcunov
  2015-06-10  8:00         ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Cyrill Gorcunov @ 2015-06-10  7:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Yalin Wang

On Wed, Jun 10, 2015 at 08:52:06AM +0900, Minchan Kim wrote:
> > > +++ b/mm/memory.c
> > > @@ -2557,9 +2557,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  
> > >  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> > >  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> > > -	pte = mk_pte(page, vma->vm_page_prot);
> > > +
> > > +	/* Mark dirty bit of page table because MADV_FREE relies on it */
> > > +	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
> > >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> > > -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +		pte = maybe_mkwrite(pte, vma);
> > >  		flags &= ~FAULT_FLAG_WRITE;
> > >  		ret |= VM_FAULT_WRITE;
> > >  		exclusive = 1;
> > 
> > Hi Minchan! Really sorry for delay in reply. Look, I don't understand
> > the moment -- if page has fault on read then before the patch the
> > PTE won't carry the dirty flag but now we do set it up unconditionally
> > and to me it looks somehow strange at least because this as well
> > sets soft-dirty bit on pages which were not modified but only swapped
> > out. Am I missing something obvious?
> 
> It's same one I sent a while ago and you said it's okay at that time. ;-)

Ah, I recall. If there is no way to escape dirtifying the page in pte itself
maybe we should at least not make it softdirty on read faults?

> Okay, It might be lack of description compared to one I sent long time ago
> because I moved some part of description to another patch and I didn't Cc
> you. Sorry. I hope below will remind you.
> 
> https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg857827.html
> 
> In summary, the problem is that in MADV_FREE point of view,
> clean anonymous page(ie, no dirty) in  page table entry has a problem
> about sudden discarding under us by reclaimer. Otherwise, VM cannot
> discard MADV_FREE hinted pages by PageDirty flag of page descriptor.
> 
> This patchset aims for solving the problem.
> Please feel free to ask if you have questions without wasting your time
> unless you can remind after reading above URL
> 
> Thanks for looking!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 3/6] mm: mark dirty bit on swapped-in page
  2015-06-10  7:23       ` Cyrill Gorcunov
@ 2015-06-10  8:00         ` Minchan Kim
  2015-06-10  8:05           ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2015-06-10  8:00 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Yalin Wang

On Wed, Jun 10, 2015 at 10:23:05AM +0300, Cyrill Gorcunov wrote:
> On Wed, Jun 10, 2015 at 08:52:06AM +0900, Minchan Kim wrote:
> > > > +++ b/mm/memory.c
> > > > @@ -2557,9 +2557,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > > >  
> > > >  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> > > >  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> > > > -	pte = mk_pte(page, vma->vm_page_prot);
> > > > +
> > > > +	/* Mark dirty bit of page table because MADV_FREE relies on it */
> > > > +	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
> > > >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> > > > -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > > +		pte = maybe_mkwrite(pte, vma);
> > > >  		flags &= ~FAULT_FLAG_WRITE;
> > > >  		ret |= VM_FAULT_WRITE;
> > > >  		exclusive = 1;
> > > 
> > > Hi Minchan! Really sorry for delay in reply. Look, I don't understand
> > > the moment -- if page has fault on read then before the patch the
> > > PTE won't carry the dirty flag but now we do set it up unconditionally
> > > and to me it looks somehow strange at least because this as well
> > > sets soft-dirty bit on pages which were not modified but only swapped
> > > out. Am I missing something obvious?
> > 
> > It's same one I sent a while ago and you said it's okay at that time. ;-)
> 
> Ah, I recall. If there is no way to escape dirtifying the page in pte itself
> maybe we should at least not make it softdirty on read faults?

You mean this? 

diff --git a/mm/memory.c b/mm/memory.c
index e1c45d0..c95340d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2557,9 +2557,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
        inc_mm_counter_fast(mm, MM_ANONPAGES);
        dec_mm_counter_fast(mm, MM_SWAPENTS);
-       pte = mk_pte(page, vma->vm_page_prot);
+
+       /* Mark dirty bit of page table because MADV_FREE relies on it */
+       pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
+       if (!flgas & FAULT_FLAG_WRITE)
+               pte = pte_clear_flags(pte, _PAGE_SOFT_DIRTY)
+
        if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-               pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+               pte = maybe_mkwrite(pte, vma);
                flags &= ~FAULT_FLAG_WRITE;
                ret |= VM_FAULT_WRITE;
                exclusive = 1;

It could be doable if everyone doesn't have strong objection
on this patchset.

I will wait more review.
Thanks.



> 
> > Okay, It might be lack of description compared to one I sent long time ago
> > because I moved some part of description to another patch and I didn't Cc
> > you. Sorry. I hope below will remind you.
> > 
> > https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg857827.html
> > 
> > In summary, the problem is that in MADV_FREE point of view,
> > clean anonymous page(ie, no dirty) in  page table entry has a problem
> > about sudden discarding under us by reclaimer. Otherwise, VM cannot
> > discard MADV_FREE hinted pages by PageDirty flag of page descriptor.
> > 
> > This patchset aims for solving the problem.
> > Please feel free to ask if you have questions without wasting your time
> > unless you can remind after reading above URL
> > 
> > Thanks for looking!

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 3/6] mm: mark dirty bit on swapped-in page
  2015-06-10  8:00         ` Minchan Kim
@ 2015-06-10  8:05           ` Cyrill Gorcunov
  0 siblings, 0 replies; 12+ messages in thread
From: Cyrill Gorcunov @ 2015-06-10  8:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Yalin Wang

On Wed, Jun 10, 2015 at 05:00:35PM +0900, Minchan Kim wrote:
> > 
> > Ah, I recall. If there is no way to escape dirtifying the page in pte itself
> > maybe we should at least not make it softdirty on read faults?
> 
> You mean this? 
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e1c45d0..c95340d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2557,9 +2557,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>         inc_mm_counter_fast(mm, MM_ANONPAGES);
>         dec_mm_counter_fast(mm, MM_SWAPENTS);
> -       pte = mk_pte(page, vma->vm_page_prot);
> +
> +       /* Mark dirty bit of page table because MADV_FREE relies on it */
> +       pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
> +       if (!flgas & FAULT_FLAG_WRITE)
> +               pte = pte_clear_flags(pte, _PAGE_SOFT_DIRTY)
> +
>         if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> -               pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +               pte = maybe_mkwrite(pte, vma);
>                 flags &= ~FAULT_FLAG_WRITE;
>                 ret |= VM_FAULT_WRITE;
>                 exclusive = 1;
> 
> It could be doable if everyone doesn't have strong objection
> on this patchset.
> 
> I will wait more review.

Yeah, something like this. Lets wait for opinions, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-06-10  8:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-03  6:15 [RFC 0/6] MADV_FREE: respect pte_dirty, not PG_dirty Minchan Kim
2015-06-03  6:15 ` [RFC 1/6] mm: keep dirty bit on KSM page Minchan Kim
2015-06-03  6:15 ` [RFC 2/6] mm: keep dirty bit on anonymous page migration Minchan Kim
2015-06-03  6:15 ` [RFC 3/6] mm: mark dirty bit on swapped-in page Minchan Kim
2015-06-09 19:07   ` Cyrill Gorcunov
2015-06-09 23:52     ` Minchan Kim
2015-06-10  7:23       ` Cyrill Gorcunov
2015-06-10  8:00         ` Minchan Kim
2015-06-10  8:05           ` Cyrill Gorcunov
2015-06-03  6:15 ` [RFC 4/6] mm: mark dirty bit on unuse_pte Minchan Kim
2015-06-03  6:15 ` [RFC 5/6] mm: decouple PG_dirty from MADV_FREE Minchan Kim
2015-06-03  6:15 ` [RFC 6/6] mm: MADV_FREE refactoring Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).