All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] thp: use is_zero_pfn after pte_present check
Date: Mon, 12 Oct 2015 18:27:33 +0300	[thread overview]
Message-ID: <20151012152733.GA6447@node> (raw)
In-Reply-To: <20151012145746.GA11396@bbox>

On Mon, Oct 12, 2015 at 11:57:46PM +0900, Minchan Kim wrote:
> Hello,
> 
> On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
> > > Use is_zero_pfn on pteval only after pte_present check on pteval
> > > (It might be better idea to introduce is_zero_pte where checks
> > > pte_present first). Otherwise, it could work with swap or
> > > migration entry and if pte_pfn's result is equal to zero_pfn
> > > by chance, we lose user's data in __collapse_huge_page_copy.
> > > So if you're luck, the application is segfaulted and finally you
> > > could see below message when the application is exit.
> > > 
> > > BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> > 
> > Did you acctually steped on the bug?
> > If yes it's subject for stable@, I think.
> 
> Yes, I did with my testing program which made heavy swap-in/out/
> swapoff with MADV_DONTNEED in a memcg.
> Actually, I marked this patch as -stable but removed it right before
> sending because my test program is artificial and didn't see any
> report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
> I might miss it).
> In addition, sometime I saw someone insists on "It's not a stable
> material if it's not a bug with real workload". I don't want to
> involve such non-technical stuff so waited someone nudges me to
> mark it as -stable and finally, you did. ;-)
> If other reviewers are not against, I will Cc -stable in next spin.
> 
> > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > > 
> > > I found this bug with MADV_FREE hard test. Sometime, I saw
> > > "Bad rss-counter" message with MM_SWAPENTS but it's really
> > > rare, once a day if I was luck or once in five days if I was
> > > unlucky so I am doing test still and just pass a few days but
> > > I hope it will fix the issue.
> > > 
> > >  mm/huge_memory.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 4b06b8db9df2..349590aa4533 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >  	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> > >  	     _pte++, _address += PAGE_SIZE) {
> > >  		pte_t pteval = *_pte;
> > > -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > +		if (pte_none(pteval)) {
> > 
> > In -mm tree we have is_swap_pte() check before this point in
> > khugepaged_scan_pmd()
> 
> Actually, I tested this patch with v4.2 kernel so it doesn't have
> the check.
> Now, I look through optimistic check for swapin readahead patch
> in current mmotm.
> It seems the check couldn't prevent this problem because it releases
> pte lock and anon_vma lock before being isolated the page in
> __collapse_huge_page_isolate so the page could be swapped out again.
> 
> > 
> > Also, what about similar pattern in __collapse_huge_page_isolate() and
> > __collapse_huge_page_copy()? Shouldn't they be fixed as well?
> 
> I see what's wrong here.
> /me slaps self.
> The line I was about to change was in __collapse_huge_page_isolate
> but I changed khugepaged_scan_pmd by mistake at last modification
> since that part is almost same. :(
> Fortunately my testing kernel is doing right version.
> Here it goes.
> 
> From 2a2e4b247e132d823af30655dbc0b57738e9d6ee Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 12 Oct 2015 09:52:46 +0900
> Subject: [PATCH] thp: use is_zero_pfn only after pte_present check
> 
> Use is_zero_pfn on pteval only after pte_present check on pteval
> (It might be better idea to introduce is_zero_pte where checks
> pte_present first). Otherwise, it could work with swap or
> migration entry and if pte_pfn's result is equal to zero_pfn
> by chance, we lose user's data in __collapse_huge_page_copy.
> So if you're luck, the application is segfaulted and finally you
> could see below message when the application is exit.
> 
> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> ---
>  mm/huge_memory.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4b06b8db9df2..bbac913f96bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2206,7 +2206,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>  	     _pte++, address += PAGE_SIZE) {
>  		pte_t pteval = *_pte;
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +		if (pte_none(pteval) || (pte_present(pteval) &&
> +				is_zero_pfn(pte_pfn(pteval)))) {
>  			if (!userfaultfd_armed(vma) &&
>  			    ++none_or_zero <= khugepaged_max_ptes_none)
>  				continue;
> -- 
> 1.9.1
> 
> 
> In khugepaged_scan_pmd, although there is no is_swap_pte check in
> v4.2, we don't need to check pte_present check right before is_zero_pfn
> because that part is just scanning operation so even if something wrong
> happens rarely, it should filter out in __collapse_huge_page_isolate
> with this patch.
> 
> In __collapse_huge_page_copy, we don't need the check, either.
> Because every ptes in the vma's 2M area point out isolated LRU pages
> and zero page so any pages couldn't be swap-out.
> 
> Thanks for the review.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] thp: use is_zero_pfn after pte_present check
Date: Mon, 12 Oct 2015 18:27:33 +0300	[thread overview]
Message-ID: <20151012152733.GA6447@node> (raw)
In-Reply-To: <20151012145746.GA11396@bbox>

On Mon, Oct 12, 2015 at 11:57:46PM +0900, Minchan Kim wrote:
> Hello,
> 
> On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
> > > Use is_zero_pfn on pteval only after pte_present check on pteval
> > > (It might be better idea to introduce is_zero_pte where checks
> > > pte_present first). Otherwise, it could work with swap or
> > > migration entry and if pte_pfn's result is equal to zero_pfn
> > > by chance, we lose user's data in __collapse_huge_page_copy.
> > > So if you're luck, the application is segfaulted and finally you
> > > could see below message when the application is exit.
> > > 
> > > BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> > 
> > Did you acctually steped on the bug?
> > If yes it's subject for stable@, I think.
> 
> Yes, I did with my testing program which made heavy swap-in/out/
> swapoff with MADV_DONTNEED in a memcg.
> Actually, I marked this patch as -stable but removed it right before
> sending because my test program is artificial and didn't see any
> report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
> I might miss it).
> In addition, sometime I saw someone insists on "It's not a stable
> material if it's not a bug with real workload". I don't want to
> involve such non-technical stuff so waited someone nudges me to
> mark it as -stable and finally, you did. ;-)
> If other reviewers are not against, I will Cc -stable in next spin.
> 
> > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > > 
> > > I found this bug with MADV_FREE hard test. Sometime, I saw
> > > "Bad rss-counter" message with MM_SWAPENTS but it's really
> > > rare, once a day if I was luck or once in five days if I was
> > > unlucky so I am doing test still and just pass a few days but
> > > I hope it will fix the issue.
> > > 
> > >  mm/huge_memory.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 4b06b8db9df2..349590aa4533 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >  	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> > >  	     _pte++, _address += PAGE_SIZE) {
> > >  		pte_t pteval = *_pte;
> > > -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > +		if (pte_none(pteval)) {
> > 
> > In -mm tree we have is_swap_pte() check before this point in
> > khugepaged_scan_pmd()
> 
> Actually, I tested this patch with v4.2 kernel so it doesn't have
> the check.
> Now, I look through optimistic check for swapin readahead patch
> in current mmotm.
> It seems the check couldn't prevent this problem because it releases
> pte lock and anon_vma lock before being isolated the page in
> __collapse_huge_page_isolate so the page could be swapped out again.
> 
> > 
> > Also, what about similar pattern in __collapse_huge_page_isolate() and
> > __collapse_huge_page_copy()? Shouldn't they be fixed as well?
> 
> I see what's wrong here.
> /me slaps self.
> The line I was about to change was in __collapse_huge_page_isolate
> but I changed khugepaged_scan_pmd by mistake at last modification
> since that part is almost same. :(
> Fortunately my testing kernel is doing right version.
> Here it goes.
> 
> From 2a2e4b247e132d823af30655dbc0b57738e9d6ee Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 12 Oct 2015 09:52:46 +0900
> Subject: [PATCH] thp: use is_zero_pfn only after pte_present check
> 
> Use is_zero_pfn on pteval only after pte_present check on pteval
> (It might be better idea to introduce is_zero_pte where checks
> pte_present first). Otherwise, it could work with swap or
> migration entry and if pte_pfn's result is equal to zero_pfn
> by chance, we lose user's data in __collapse_huge_page_copy.
> So if you're luck, the application is segfaulted and finally you
> could see below message when the application is exit.
> 
> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> ---
>  mm/huge_memory.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4b06b8db9df2..bbac913f96bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2206,7 +2206,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>  	     _pte++, address += PAGE_SIZE) {
>  		pte_t pteval = *_pte;
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +		if (pte_none(pteval) || (pte_present(pteval) &&
> +				is_zero_pfn(pte_pfn(pteval)))) {
>  			if (!userfaultfd_armed(vma) &&
>  			    ++none_or_zero <= khugepaged_max_ptes_none)
>  				continue;
> -- 
> 1.9.1
> 
> 
> In khugepaged_scan_pmd, although there is no is_swap_pte check in
> v4.2, we don't need to check pte_present check right before is_zero_pfn
> because that part is just scanning operation so even if something wrong
> happens rarely, it should filter out in __collapse_huge_page_isolate
> with this patch.
> 
> In __collapse_huge_page_copy, we don't need the check, either.
> Because every ptes in the vma's 2M area point out isolated LRU pages
> and zero page so any pages couldn't be swap-out.
> 
> Thanks for the review.

-- 
 Kirill A. Shutemov

  parent reply	other threads:[~2015-10-12 15:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-12  1:54 [PATCH] thp: use is_zero_pfn after pte_present check Minchan Kim
2015-10-12  1:54 ` Minchan Kim
2015-10-12 10:13 ` Kirill A. Shutemov
2015-10-12 10:13   ` Kirill A. Shutemov
2015-10-12 14:57   ` Minchan Kim
2015-10-12 14:57     ` Minchan Kim
2015-10-12 15:15     ` Vlastimil Babka
2015-10-12 15:15       ` Vlastimil Babka
2015-10-12 20:20       ` Andrea Arcangeli
2015-10-12 20:20         ` Andrea Arcangeli
2015-10-12 15:27     ` Kirill A. Shutemov [this message]
2015-10-12 15:27       ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151012152733.GA6447@node \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.