MADV_FREE is broken

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* MADV_FREE is broken
@ 2017-09-20  9:01 Artem Savkov
  2017-09-20 22:37 ` Shaohua Li
  0 siblings, 1 reply; 3+ messages in thread
From: Artem Savkov @ 2017-09-20  9:01 UTC (permalink / raw)
  To: Shaohua Li, Minchan Kim
  Cc: Michal Hocko, Johannes Weiner, Hillf Danton, Jan Stancek,
	linux-mm, linux-kernel

Hi All,

We recently started noticing madvise09[1] test from ltp failing strangely. The
test does the following: maps 32 pages, sets MADV_FREE for the range it got,
dirties 2 of the pages, creates memory pressure and check that nondirty pages
are free. The test hanged while accessing the last 4 pages(29-32) of madvised
range at line 121 [2]. Any other process (gdb/cat) accessing those pages
would also hang as would rebooting the machine. It doesn't trigger any debug
warnings or kasan.

The issue bisected to "802a3a92ad7a mm: reclaim MADV_FREE pages" (so 4.12 and
up are affected).

I did some poking around and found out that the "bad" pages had SwapBacked flag
set in shrink_page_list() which confused it a lot. It looks like
mark_page_lazyfree() only calls lru_lazyfree_fn() when the pagevec is full
(that is in batches of 14) and never drains the rest (so last four in madvise09
case).

The patch below greatly reduces the failure rate, but doesn't fix it
completely, it still shows up with the same symptoms (hanging trying to access
last 4 pages) after a bunch of retries.

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
[2] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c#L121

diff --git a/mm/madvise.c b/mm/madvise.c
index 21261ff0466f..a0b868e8b7d2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -453,6 +453,7 @@ static void madvise_free_page_range(struct mmu_gather *tlb,

 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
+	lru_add_drain();
 	tlb_end_vma(tlb, vma);
 }

-- 
Regards,
  Artem

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: MADV_FREE is broken
  2017-09-20  9:01 MADV_FREE is broken Artem Savkov
@ 2017-09-20 22:37 ` Shaohua Li
  2017-09-21  9:41   ` Artem Savkov
  0 siblings, 1 reply; 3+ messages in thread
From: Shaohua Li @ 2017-09-20 22:37 UTC (permalink / raw)
  To: Artem Savkov
  Cc: Shaohua Li, Minchan Kim, Michal Hocko, Johannes Weiner,
	Hillf Danton, Jan Stancek, linux-mm, linux-kernel

On Wed, Sep 20, 2017 at 11:01:47AM +0200, Artem Savkov wrote:
> Hi All,
> 
> We recently started noticing madvise09[1] test from ltp failing strangely. The
> test does the following: maps 32 pages, sets MADV_FREE for the range it got,
> dirties 2 of the pages, creates memory pressure and check that nondirty pages
> are free. The test hanged while accessing the last 4 pages(29-32) of madvised
> range at line 121 [2]. Any other process (gdb/cat) accessing those pages
> would also hang as would rebooting the machine. It doesn't trigger any debug
> warnings or kasan.
> 
> The issue bisected to "802a3a92ad7a mm: reclaim MADV_FREE pages" (so 4.12 and
> up are affected).
> 
> I did some poking around and found out that the "bad" pages had SwapBacked flag
> set in shrink_page_list() which confused it a lot. It looks like
> mark_page_lazyfree() only calls lru_lazyfree_fn() when the pagevec is full
> (that is in batches of 14) and never drains the rest (so last four in madvise09
> case).
> 
> The patch below greatly reduces the failure rate, but doesn't fix it
> completely, it still shows up with the same symptoms (hanging trying to access
> last 4 pages) after a bunch of retries.
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> [2] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c#L121
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 21261ff0466f..a0b868e8b7d2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -453,6 +453,7 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
>  
>  	tlb_start_vma(tlb, vma);
>  	walk_page_range(addr, end, &free_walk);
> +	lru_add_drain();
>  	tlb_end_vma(tlb, vma);
>  }

Looks there is a race between clear pte dirty bit and clear SwapBacked bit.
draining the vect helps a little, but not sufficient. If SwapBacked is set, we
could add the page to swapcache, but since we the page isn't dirty, we don't
write the page out. This could cause data corruption. There is another place we
wrongly clear SwapBacked bit. Below is a test patch which seems to fix the
issue, please give a try.


diff --git a/mm/swap.c b/mm/swap.c
index 62d96b8..5c58257 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -575,7 +575,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
-	    !PageUnevictable(page)) {
+	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 
 		del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711d..be1c98e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool new_added_swapcache = false;
 
 		cond_resched();
 
@@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
+				new_added_swapcache = true;
 			}
 		} else if (unlikely(PageTransHuge(page))) {
 			/* Split file THP */
@@ -1185,6 +1187,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				nr_unmap_fail++;
 				goto activate_locked;
 			}
+			/* race with MADV_FREE */
+			if (PageAnon(page) && !PageDirty(page) &&
+			    PageSwapBacked(page) && new_added_swapcache)
+				set_page_dirty(page);
 		}
 
 		if (PageDirty(page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: MADV_FREE is broken
  2017-09-20 22:37 ` Shaohua Li
@ 2017-09-21  9:41   ` Artem Savkov
  0 siblings, 0 replies; 3+ messages in thread
From: Artem Savkov @ 2017-09-21  9:41 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shaohua Li, Minchan Kim, Michal Hocko, Johannes Weiner,
	Jan Stancek, linux-mm, linux-kernel

On Wed, Sep 20, 2017 at 03:37:33PM -0700, Shaohua Li wrote:
> On Wed, Sep 20, 2017 at 11:01:47AM +0200, Artem Savkov wrote:
> > Hi All,
> > 
> > We recently started noticing madvise09[1] test from ltp failing strangely. The
> > test does the following: maps 32 pages, sets MADV_FREE for the range it got,
> > dirties 2 of the pages, creates memory pressure and check that nondirty pages
> > are free. The test hanged while accessing the last 4 pages(29-32) of madvised
> > range at line 121 [2]. Any other process (gdb/cat) accessing those pages
> > would also hang as would rebooting the machine. It doesn't trigger any debug
> > warnings or kasan.
> > 
> > The issue bisected to "802a3a92ad7a mm: reclaim MADV_FREE pages" (so 4.12 and
> > up are affected).
> > 
> > I did some poking around and found out that the "bad" pages had SwapBacked flag
> > set in shrink_page_list() which confused it a lot. It looks like
> > mark_page_lazyfree() only calls lru_lazyfree_fn() when the pagevec is full
> > (that is in batches of 14) and never drains the rest (so last four in madvise09
> > case).
> > 
> > The patch below greatly reduces the failure rate, but doesn't fix it
> > completely, it still shows up with the same symptoms (hanging trying to access
> > last 4 pages) after a bunch of retries.
> > 
> > [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> > [2] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c#L121
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 21261ff0466f..a0b868e8b7d2 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -453,6 +453,7 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
> >  
> >  	tlb_start_vma(tlb, vma);
> >  	walk_page_range(addr, end, &free_walk);
> > +	lru_add_drain();
> >  	tlb_end_vma(tlb, vma);
> >  }
> 
> Looks there is a race between clear pte dirty bit and clear SwapBacked bit.
> draining the vect helps a little, but not sufficient. If SwapBacked is set, we
> could add the page to swapcache, but since we the page isn't dirty, we don't
> write the page out. This could cause data corruption. There is another place we
> wrongly clear SwapBacked bit. Below is a test patch which seems to fix the
> issue, please give a try.

Ran it for quite some time and it does seem to fix the issue.

> diff --git a/mm/swap.c b/mm/swap.c
> index 62d96b8..5c58257 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -575,7 +575,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
>  	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> -	    !PageUnevictable(page)) {
> +	    !PageSwapCache(page) && !PageUnevictable(page)) {
>  		bool active = PageActive(page);
>  
>  		del_page_from_lru_list(page, lruvec,

Shouldn't the same check be added to mark_page_lazyfree()?

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711d..be1c98e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		int may_enter_fs;
>  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
>  		bool dirty, writeback;
> +		bool new_added_swapcache = false;
>  
>  		cond_resched();
>  
> @@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  				/* Adding to swap updated mapping */
>  				mapping = page_mapping(page);
> +				new_added_swapcache = true;
>  			}
>  		} else if (unlikely(PageTransHuge(page))) {
>  			/* Split file THP */
> @@ -1185,6 +1187,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				nr_unmap_fail++;
>  				goto activate_locked;
>  			}
> +			/* race with MADV_FREE */
> +			if (PageAnon(page) && !PageDirty(page) &&
> +			    PageSwapBacked(page) && new_added_swapcache)
> +				set_page_dirty(page);
>  		}
>  
>  		if (PageDirty(page)) {

-- 
Regards,
  Artem

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-09-21  9:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-20  9:01 MADV_FREE is broken Artem Savkov
2017-09-20 22:37 ` Shaohua Li
2017-09-21  9:41   ` Artem Savkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).