Re: [PATCH] thp: use is_zero_pfn after pte_present check

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] thp: use is_zero_pfn after pte_present check
Date: Mon, 12 Oct 2015 23:57:46 +0900	[thread overview]
Message-ID: <20151012145746.GA11396@bbox> (raw)
In-Reply-To: <20151012101320.GB2544@node>

Hello,

On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
> > Use is_zero_pfn on pteval only after pte_present check on pteval
> > (It might be better idea to introduce is_zero_pte where checks
> > pte_present first). Otherwise, it could work with swap or
> > migration entry and if pte_pfn's result is equal to zero_pfn
> > by chance, we lose user's data in __collapse_huge_page_copy.
> > So if you're luck, the application is segfaulted and finally you
> > could see below message when the application is exit.
> > 
> > BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> 
> Did you acctually steped on the bug?
> If yes it's subject for stable@, I think.

Yes, I did with my testing program which made heavy swap-in/out/
swapoff with MADV_DONTNEED in a memcg.
Actually, I marked this patch as -stable but removed it right before
sending because my test program is artificial and didn't see any
report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
I might miss it).
In addition, sometime I saw someone insists on "It's not a stable
material if it's not a bug with real workload". I don't want to
involve such non-technical stuff so waited someone nudges me to
mark it as -stable and finally, you did. ;-)
If other reviewers are not against, I will Cc -stable in next spin.

> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > 
> > I found this bug with MADV_FREE hard test. Sometime, I saw
> > "Bad rss-counter" message with MM_SWAPENTS but it's really
> > rare, once a day if I was luck or once in five days if I was
> > unlucky so I am doing test still and just pass a few days but
> > I hope it will fix the issue.
> > 
> >  mm/huge_memory.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 4b06b8db9df2..349590aa4533 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> >  	     _pte++, _address += PAGE_SIZE) {
> >  		pte_t pteval = *_pte;
> > -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +		if (pte_none(pteval)) {
> 
> In -mm tree we have is_swap_pte() check before this point in
> khugepaged_scan_pmd()

Actually, I tested this patch with v4.2 kernel so it doesn't have
the check.
Now, I look through optimistic check for swapin readahead patch
in current mmotm.
It seems the check couldn't prevent this problem because it releases
pte lock and anon_vma lock before being isolated the page in
__collapse_huge_page_isolate so the page could be swapped out again.

> 
> Also, what about similar pattern in __collapse_huge_page_isolate() and
> __collapse_huge_page_copy()? Shouldn't they be fixed as well?

I see what's wrong here.
/me slaps self.
The line I was about to change was in __collapse_huge_page_isolate
but I changed khugepaged_scan_pmd by mistake at last modification
since that part is almost same. :(
Fortunately my testing kernel is doing right version.
Here it goes.

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan@kernel.org>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] thp: use is_zero_pfn after pte_present check
Date: Mon, 12 Oct 2015 23:57:46 +0900	[thread overview]
Message-ID: <20151012145746.GA11396@bbox> (raw)
In-Reply-To: <20151012101320.GB2544@node>

Hello,

On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
> > Use is_zero_pfn on pteval only after pte_present check on pteval
> > (It might be better idea to introduce is_zero_pte where checks
> > pte_present first). Otherwise, it could work with swap or
> > migration entry and if pte_pfn's result is equal to zero_pfn
> > by chance, we lose user's data in __collapse_huge_page_copy.
> > So if you're luck, the application is segfaulted and finally you
> > could see below message when the application is exit.
> > 
> > BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> 
> Did you acctually steped on the bug?
> If yes it's subject for stable@, I think.

Yes, I did with my testing program which made heavy swap-in/out/
swapoff with MADV_DONTNEED in a memcg.
Actually, I marked this patch as -stable but removed it right before
sending because my test program is artificial and didn't see any
report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
I might miss it).
In addition, sometime I saw someone insists on "It's not a stable
material if it's not a bug with real workload". I don't want to
involve such non-technical stuff so waited someone nudges me to
mark it as -stable and finally, you did. ;-)
If other reviewers are not against, I will Cc -stable in next spin.

> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > 
> > I found this bug with MADV_FREE hard test. Sometime, I saw
> > "Bad rss-counter" message with MM_SWAPENTS but it's really
> > rare, once a day if I was luck or once in five days if I was
> > unlucky so I am doing test still and just pass a few days but
> > I hope it will fix the issue.
> > 
> >  mm/huge_memory.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 4b06b8db9df2..349590aa4533 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> >  	     _pte++, _address += PAGE_SIZE) {
> >  		pte_t pteval = *_pte;
> > -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +		if (pte_none(pteval)) {
> 
> In -mm tree we have is_swap_pte() check before this point in
> khugepaged_scan_pmd()

Actually, I tested this patch with v4.2 kernel so it doesn't have
the check.
Now, I look through optimistic check for swapin readahead patch
in current mmotm.
It seems the check couldn't prevent this problem because it releases
pte lock and anon_vma lock before being isolated the page in
__collapse_huge_page_isolate so the page could be swapped out again.

> 
> Also, what about similar pattern in __collapse_huge_page_isolate() and
> __collapse_huge_page_copy()? Shouldn't they be fixed as well?

I see what's wrong here.
/me slaps self.
The line I was about to change was in __collapse_huge_page_isolate
but I changed khugepaged_scan_pmd by mistake at last modification
since that part is almost same. :(
Fortunately my testing kernel is doing right version.
Here it goes.

>From 2a2e4b247e132d823af30655dbc0b57738e9d6ee Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 12 Oct 2015 09:52:46 +0900
Subject: [PATCH] thp: use is_zero_pfn only after pte_present check

Use is_zero_pfn on pteval only after pte_present check on pteval
(It might be better idea to introduce is_zero_pte where checks
pte_present first). Otherwise, it could work with swap or
migration entry and if pte_pfn's result is equal to zero_pfn
by chance, we lose user's data in __collapse_huge_page_copy.
So if you're luck, the application is segfaulted and finally you
could see below message when the application is exit.

BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8db9df2..bbac913f96bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2206,7 +2206,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
-		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+		if (pte_none(pteval) || (pte_present(pteval) &&
+				is_zero_pfn(pte_pfn(pteval)))) {
 			if (!userfaultfd_armed(vma) &&
 			    ++none_or_zero <= khugepaged_max_ptes_none)
 				continue;
-- 
1.9.1

In khugepaged_scan_pmd, although there is no is_swap_pte check in
v4.2, we don't need to check pte_present check right before is_zero_pfn
because that part is just scanning operation so even if something wrong
happens rarely, it should filter out in __collapse_huge_page_isolate
with this patch.

In __collapse_huge_page_copy, we don't need the check, either.
Because every ptes in the vma's 2M area point out isolated LRU pages
and zero page so any pages couldn't be swap-out.

Thanks for the review.

next prev parent reply	other threads:[~2015-10-12 14:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-12  1:54 [PATCH] thp: use is_zero_pfn after pte_present check Minchan Kim
2015-10-12  1:54 ` Minchan Kim
2015-10-12 10:13 ` Kirill A. Shutemov
2015-10-12 10:13   ` Kirill A. Shutemov
2015-10-12 14:57   ` Minchan Kim [this message]
2015-10-12 14:57     ` Minchan Kim
2015-10-12 15:15     ` Vlastimil Babka
2015-10-12 15:15       ` Vlastimil Babka
2015-10-12 20:20       ` Andrea Arcangeli
2015-10-12 20:20         ` Andrea Arcangeli
2015-10-12 15:27     ` Kirill A. Shutemov
2015-10-12 15:27       ` Kirill A. Shutemov

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:4b06b8db9df dfblob:bbac913f96b )
 OR (
bs:"thp: use is_zero_pfn only after pte_present check" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151012145746.GA11396@bbox \
    --to=minchan@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.