From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 5F0506B0033 for ; Thu, 21 Sep 2017 16:49:38 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id y29so11847015pff.6 for ; Thu, 21 Sep 2017 13:49:38 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id u3si1706234plm.546.2017.09.21.13.49.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Sep 2017 13:49:37 -0700 (PDT) From: Shaohua Li Subject: [PATCH 0/2] mm: fix race condition in MADV_FREE Date: Thu, 21 Sep 2017 13:27:09 -0700 Message-Id: Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Artem Savkov , Kernel-team@fb.com, Shaohua Li From: Shaohua Li Artem Savkov reported a race condition[1] in MADV_FREE. MADV_FREE clear pte dirty bit and then mark the page lazyfree. There is no lock to prevent the page is added to swap cache between these two steps by page reclaim. There are two problems: - page in swapcache is marked lazyfree (clear SwapBacked). This confuses some code pathes, like page fault handling. - The page is added into swapcache, and freed but the page isn't swapout because pte isn't dity. This will cause data corruption. The patches will fix the issues. Thanks, Shaohua [1] https://marc.info/?l=linux-mm&m=150589811300667&w=2 Shaohua Li (2): mm: avoid marking swap cached page as lazyfree mm: fix data corruption caused by lazyfree page mm/swap.c | 4 ++-- mm/vmscan.c | 12 ++++++++++++ 2 files changed, 14 insertions(+), 2 deletions(-) -- 2.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id B58596B0253 for ; Thu, 21 Sep 2017 16:49:38 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id f84so11904128pfj.0 for ; Thu, 21 Sep 2017 13:49:38 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id g11si1673900pgf.438.2017.09.21.13.49.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Sep 2017 13:49:37 -0700 (PDT) From: Shaohua Li Subject: [PATCH 1/2] mm: avoid marking swap cached page as lazyfree Date: Thu, 21 Sep 2017 13:27:10 -0700 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Artem Savkov , Kernel-team@fb.com, Shaohua Li , stable@vger.kernel.org, Johannes Weiner , Michal Hocko , Hillf Danton , Minchan Kim , Hugh Dickins , Rik van Riel , Mel Gorman , Andrew Morton From: Shaohua Li MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear SwapBacked). There is no lock to prevent the page is added to swap cache between these two steps by page reclaim. If the page is added to swap cache, marking the page lazyfree will confuse page fault if the page is reclaimed and refault. Reported-and-tested-y: Artem Savkov Fix: 802a3a92ad7a(mm: reclaim MADV_FREE pages) Signed-off-by: Shaohua Li Cc: stable@vger.kernel.org Cc: Johannes Weiner Cc: Michal Hocko Cc: Hillf Danton Cc: Minchan Kim Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Cc: Andrew Morton --- mm/swap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 9295ae9..a77d68f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -575,7 +575,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, void *arg) { if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && - !PageUnevictable(page)) { + !PageSwapCache(page) && !PageUnevictable(page)) { bool active = PageActive(page); del_page_from_lru_list(page, lruvec, @@ -665,7 +665,7 @@ void deactivate_file_page(struct page *page) void mark_page_lazyfree(struct page *page) { if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && - !PageUnevictable(page)) { + !PageSwapCache(page) && !PageUnevictable(page)) { struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs); get_page(page); -- 2.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 61FAF6B0253 for ; Thu, 21 Sep 2017 16:49:39 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id 11so13525010pge.4 for ; Thu, 21 Sep 2017 13:49:39 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id z190si1510103pgd.390.2017.09.21.13.49.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Sep 2017 13:49:38 -0700 (PDT) From: Shaohua Li Subject: [PATCH 2/2] mm: fix data corruption caused by lazyfree page Date: Thu, 21 Sep 2017 13:27:11 -0700 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Artem Savkov , Kernel-team@fb.com, Shaohua Li , Johannes Weiner , Michal Hocko , Hillf Danton , Minchan Kim , Hugh Dickins , Rik van Riel , Mel Gorman , Andrew Morton From: Shaohua Li MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear SwapBacked). There is no lock to prevent the page is added to swap cache between these two steps by page reclaim. If page reclaim finds such page, it will simply add the page to swap cache without pageout the page to swap because the page is marked as clean. Next time, page fault will read data from the swap slot which doesn't have the original data, so we have a data corruption. To fix issue, we mark the page dirty and pageout the page. However, we shouldn't dirty all pages which is clean and in swap cache. swapin page is swap cache and clean too. So we only dirty page which is added into swap cache in page reclaim, which shouldn't be swapin page. Normal anonymous pages should be dirty already. Reported-and-tested-y: Artem Savkov Fix: 802a3a92ad7a(mm: reclaim MADV_FREE pages) Signed-off-by: Shaohua Li Cc: Johannes Weiner Cc: Michal Hocko Cc: Hillf Danton Cc: Minchan Kim Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Cc: Andrew Morton --- mm/vmscan.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index d811c81..820ee8d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, int may_enter_fs; enum page_references references = PAGEREF_RECLAIM_CLEAN; bool dirty, writeback; + bool new_swap_page = false; cond_resched(); @@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, /* Adding to swap updated mapping */ mapping = page_mapping(page); + new_swap_page = true; } } else if (unlikely(PageTransHuge(page))) { /* Split file THP */ @@ -1185,6 +1187,16 @@ static unsigned long shrink_page_list(struct list_head *page_list, nr_unmap_fail++; goto activate_locked; } + + /* + * MADV_FREE clear pte dirty bit, but not yet clear + * SwapBacked for a page. We can't directly free the + * page because we already set swap entry in pte. The + * check guarantees this is such page and not a clean + * swapin page + */ + if (!PageDirty(page) && new_swap_page) + set_page_dirty(page); } if (PageDirty(page)) { -- 2.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f70.google.com (mail-it0-f70.google.com [209.85.214.70]) by kanga.kvack.org (Postfix) with ESMTP id 60CBC6B0033 for ; Thu, 21 Sep 2017 21:34:26 -0400 (EDT) Received: by mail-it0-f70.google.com with SMTP id c195so13115992itb.5 for ; Thu, 21 Sep 2017 18:34:26 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id l191si1879739oig.433.2017.09.21.18.34.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Sep 2017 18:34:25 -0700 (PDT) Message-ID: <1506044061.21121.70.camel@redhat.com> Subject: Re: [PATCH 1/2] mm: avoid marking swap cached page as lazyfree From: Rik van Riel Date: Thu, 21 Sep 2017 21:34:21 -0400 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li , linux-mm@kvack.org Cc: Artem Savkov , Kernel-team@fb.com, Shaohua Li , stable@vger.kernel.org, Johannes Weiner , Michal Hocko , Hillf Danton , Minchan Kim , Hugh Dickins , Mel Gorman , Andrew Morton On Thu, 2017-09-21 at 13:27 -0700, Shaohua Li wrote: > From: Shaohua Li > > MADV_FREE clears pte dirty bit and then marks the page lazyfree > (clear > SwapBacked). There is no lock to prevent the page is added to swap > cache > between these two steps by page reclaim. If the page is added to swap > cache, marking the page lazyfree will confuse page fault if the page > is > reclaimed and refault. > > Reported-and-tested-y: Artem Savkov > Fix: 802a3a92ad7a(mm: reclaim MADV_FREE pages) > Signed-off-by: Shaohua Li > Cc: stable@vger.kernel.org > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Hillf Danton > Cc: Minchan Kim > Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Mel Gorman > Cc: Andrew Morton > Reviewed-by: Rik van Riel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id A39056B0033 for ; Fri, 22 Sep 2017 02:01:44 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id p5so348538pgn.7 for ; Thu, 21 Sep 2017 23:01:44 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id c2si2249024pli.812.2017.09.21.23.01.42 for ; Thu, 21 Sep 2017 23:01:43 -0700 (PDT) Date: Fri, 22 Sep 2017 15:01:41 +0900 From: Minchan Kim Subject: Re: [PATCH 2/2] mm: fix data corruption caused by lazyfree page Message-ID: <20170922060141.GA18314@bbox> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: linux-mm@kvack.org, Artem Savkov , Kernel-team@fb.com, Shaohua Li , Johannes Weiner , Michal Hocko , Hillf Danton , Hugh Dickins , Rik van Riel , Mel Gorman , Andrew Morton On Thu, Sep 21, 2017 at 01:27:11PM -0700, Shaohua Li wrote: > From: Shaohua Li > > MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear > SwapBacked). There is no lock to prevent the page is added to swap cache > between these two steps by page reclaim. If page reclaim finds such > page, it will simply add the page to swap cache without pageout the page > to swap because the page is marked as clean. Next time, page fault will > read data from the swap slot which doesn't have the original data, so we > have a data corruption. To fix issue, we mark the page dirty and pageout > the page. > > However, we shouldn't dirty all pages which is clean and in swap cache. > swapin page is swap cache and clean too. So we only dirty page which is > added into swap cache in page reclaim, which shouldn't be swapin page. > Normal anonymous pages should be dirty already. > > Reported-and-tested-y: Artem Savkov > Fix: 802a3a92ad7a(mm: reclaim MADV_FREE pages) > Signed-off-by: Shaohua Li > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Hillf Danton > Cc: Minchan Kim > Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Mel Gorman > Cc: Andrew Morton > --- > mm/vmscan.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d811c81..820ee8d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > int may_enter_fs; > enum page_references references = PAGEREF_RECLAIM_CLEAN; > bool dirty, writeback; > + bool new_swap_page = false; > > cond_resched(); > > @@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > /* Adding to swap updated mapping */ > mapping = page_mapping(page); > + new_swap_page = true; > } > } else if (unlikely(PageTransHuge(page))) { > /* Split file THP */ > @@ -1185,6 +1187,16 @@ static unsigned long shrink_page_list(struct list_head *page_list, > nr_unmap_fail++; > goto activate_locked; > } > + > + /* > + * MADV_FREE clear pte dirty bit, but not yet clear > + * SwapBacked for a page. We can't directly free the > + * page because we already set swap entry in pte. The > + * check guarantees this is such page and not a clean > + * swapin page > + */ > + if (!PageDirty(page) && new_swap_page) > + set_page_dirty(page); > } > > if (PageDirty(page)) { > -- > 2.9.5 > Couldn't we simple roll back to the logic before MADV_FREE's birth? diff --git a/mm/swap_state.c b/mm/swap_state.c index 71ce2d1ccbf7..548c19b5f78e 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -231,7 +231,7 @@ int add_to_swap(struct page *page) * deadlock in the swap out path. */ /* - * Add it to the swap cache. + * Add it to the swap cache and mark it dirty */ err = add_to_swap_cache(page, entry, __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); @@ -243,6 +243,7 @@ int add_to_swap(struct page *page) */ goto fail; + SetPageDirty(page); return 1; fail: To me, it would be more simple/readable rather than introducing a new branch in complicated shrink_page_list. And I don't see why we cannot merge [1/2] and [2/2]. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id CCA2E6B0038 for ; Fri, 22 Sep 2017 14:45:24 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id 6so3630569pgh.0 for ; Fri, 22 Sep 2017 11:45:24 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id g7si230507pgq.827.2017.09.22.11.45.23 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 22 Sep 2017 11:45:23 -0700 (PDT) Date: Fri, 22 Sep 2017 11:45:20 -0700 From: Shaohua Li Subject: Re: [PATCH 2/2] mm: fix data corruption caused by lazyfree page Message-ID: <20170922184520.5gya5hsacudkr2z7@kernel.org> References: <20170922060141.GA18314@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170922060141.GA18314@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm@kvack.org, Artem Savkov , Kernel-team@fb.com, Shaohua Li , Johannes Weiner , Michal Hocko , Hillf Danton , Hugh Dickins , Rik van Riel , Mel Gorman , Andrew Morton On Fri, Sep 22, 2017 at 03:01:41PM +0900, Minchan Kim wrote: > On Thu, Sep 21, 2017 at 01:27:11PM -0700, Shaohua Li wrote: > > From: Shaohua Li > > > > MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear > > SwapBacked). There is no lock to prevent the page is added to swap cache > > between these two steps by page reclaim. If page reclaim finds such > > page, it will simply add the page to swap cache without pageout the page > > to swap because the page is marked as clean. Next time, page fault will > > read data from the swap slot which doesn't have the original data, so we > > have a data corruption. To fix issue, we mark the page dirty and pageout > > the page. > > > > However, we shouldn't dirty all pages which is clean and in swap cache. > > swapin page is swap cache and clean too. So we only dirty page which is > > added into swap cache in page reclaim, which shouldn't be swapin page. > > Normal anonymous pages should be dirty already. > > > > Reported-and-tested-y: Artem Savkov > > Fix: 802a3a92ad7a(mm: reclaim MADV_FREE pages) > > Signed-off-by: Shaohua Li > > Cc: Johannes Weiner > > Cc: Michal Hocko > > Cc: Hillf Danton > > Cc: Minchan Kim > > Cc: Hugh Dickins > > Cc: Rik van Riel > > Cc: Mel Gorman > > Cc: Andrew Morton > > --- > > mm/vmscan.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index d811c81..820ee8d 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > int may_enter_fs; > > enum page_references references = PAGEREF_RECLAIM_CLEAN; > > bool dirty, writeback; > > + bool new_swap_page = false; > > > > cond_resched(); > > > > @@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > > > /* Adding to swap updated mapping */ > > mapping = page_mapping(page); > > + new_swap_page = true; > > } > > } else if (unlikely(PageTransHuge(page))) { > > /* Split file THP */ > > @@ -1185,6 +1187,16 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > nr_unmap_fail++; > > goto activate_locked; > > } > > + > > + /* > > + * MADV_FREE clear pte dirty bit, but not yet clear > > + * SwapBacked for a page. We can't directly free the > > + * page because we already set swap entry in pte. The > > + * check guarantees this is such page and not a clean > > + * swapin page > > + */ > > + if (!PageDirty(page) && new_swap_page) > > + set_page_dirty(page); > > } > > > > if (PageDirty(page)) { > > -- > > 2.9.5 > > > > Couldn't we simple roll back to the logic before MADV_FREE's birth? > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 71ce2d1ccbf7..548c19b5f78e 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -231,7 +231,7 @@ int add_to_swap(struct page *page) > * deadlock in the swap out path. > */ > /* > - * Add it to the swap cache. > + * Add it to the swap cache and mark it dirty > */ > err = add_to_swap_cache(page, entry, > __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); > @@ -243,6 +243,7 @@ int add_to_swap(struct page *page) > */ > goto fail; > > + SetPageDirty(page); > return 1; > > fail: > > To me, it would be more simple/readable rather than introducing > a new branch in complicated shrink_page_list. This is neat, thanks for the suggestion! I'll use set_page_dirty, becuase swapcache set_page_dirty not just set the page dirty. > And I don't see why we cannot merge [1/2] and [2/2]. I feel two separate patches are more clear, but I'll let Andrew decide. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org