Re: [PATCH] thp: use is_zero_pfn after pte_present check

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: Minchan Kim <minchan@kernel.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] thp: use is_zero_pfn after pte_present check
Date: Mon, 12 Oct 2015 17:15:06 +0200	[thread overview]
Message-ID: <561BCE7A.1080403@suse.cz> (raw)
In-Reply-To: <20151012145746.GA11396@bbox>

On 10/12/2015 04:57 PM, Minchan Kim wrote:
> Hello,
>
> On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
>> On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
>>> Use is_zero_pfn on pteval only after pte_present check on pteval
>>> (It might be better idea to introduce is_zero_pte where checks
>>> pte_present first). Otherwise, it could work with swap or
>>> migration entry and if pte_pfn's result is equal to zero_pfn
>>> by chance, we lose user's data in __collapse_huge_page_copy.
>>> So if you're luck, the application is segfaulted and finally you
>>> could see below message when the application is exit.
>>>
>>> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
>>
>> Did you acctually steped on the bug?
>> If yes it's subject for stable@, I think.
>
> Yes, I did with my testing program which made heavy swap-in/out/
> swapoff with MADV_DONTNEED in a memcg.
> Actually, I marked this patch as -stable but removed it right before
> sending because my test program is artificial and didn't see any
> report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
> I might miss it).
> In addition, sometime I saw someone insists on "It's not a stable
> material if it's not a bug with real workload". I don't want to
> involve such non-technical stuff so waited someone nudges me to
> mark it as -stable and finally, you did. ;-)

I'd also think this should go -stable, and I haven't heard the "real 
workload" argument before.

> If other reviewers are not against, I will Cc -stable in next spin.
>
>>
>>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>>> ---
>>>
>>> I found this bug with MADV_FREE hard test. Sometime, I saw
>>> "Bad rss-counter" message with MM_SWAPENTS but it's really
>>> rare, once a day if I was luck or once in five days if I was
>>> unlucky so I am doing test still and just pass a few days but
>>> I hope it will fix the issue.
>>>
>>>   mm/huge_memory.c | 12 +++++++++++-
>>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 4b06b8db9df2..349590aa4533 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>   	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>>>   	     _pte++, _address += PAGE_SIZE) {
>>>   		pte_t pteval = *_pte;
>>> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> +		if (pte_none(pteval)) {
>>
>> In -mm tree we have is_swap_pte() check before this point in
>> khugepaged_scan_pmd()
>
> Actually, I tested this patch with v4.2 kernel so it doesn't have
> the check.
> Now, I look through optimistic check for swapin readahead patch
> in current mmotm.
> It seems the check couldn't prevent this problem because it releases
> pte lock and anon_vma lock before being isolated the page in
> __collapse_huge_page_isolate so the page could be swapped out again.
>
>>
>> Also, what about similar pattern in __collapse_huge_page_isolate() and
>> __collapse_huge_page_copy()? Shouldn't they be fixed as well?
>
> I see what's wrong here.
> /me slaps self.
> The line I was about to change was in __collapse_huge_page_isolate
> but I changed khugepaged_scan_pmd by mistake at last modification
> since that part is almost same. :(
> Fortunately my testing kernel is doing right version.
> Here it goes.
>
>  From 2a2e4b247e132d823af30655dbc0b57738e9d6ee Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 12 Oct 2015 09:52:46 +0900
> Subject: [PATCH] thp: use is_zero_pfn only after pte_present check
>
> Use is_zero_pfn on pteval only after pte_present check on pteval
> (It might be better idea to introduce is_zero_pte where checks
> pte_present first). Otherwise, it could work with swap or
> migration entry and if pte_pfn's result is equal to zero_pfn
> by chance, we lose user's data in __collapse_huge_page_copy.
> So if you're luck, the application is segfaulted and finally you
> could see below message when the application is exit.
>
> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

So this patch should be stable 4.1+. Does it apply both in -next and 
4.3-rcX?

> ---
>   mm/huge_memory.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4b06b8db9df2..bbac913f96bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2206,7 +2206,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>   	     _pte++, address += PAGE_SIZE) {
>   		pte_t pteval = *_pte;
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +		if (pte_none(pteval) || (pte_present(pteval) &&
> +				is_zero_pfn(pte_pfn(pteval)))) {
>   			if (!userfaultfd_armed(vma) &&
>   			    ++none_or_zero <= khugepaged_max_ptes_none)
>   				continue;
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Vlastimil Babka <vbabka@suse.cz>
To: Minchan Kim <minchan@kernel.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] thp: use is_zero_pfn after pte_present check
Date: Mon, 12 Oct 2015 17:15:06 +0200	[thread overview]
Message-ID: <561BCE7A.1080403@suse.cz> (raw)
In-Reply-To: <20151012145746.GA11396@bbox>

On 10/12/2015 04:57 PM, Minchan Kim wrote:
> Hello,
>
> On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
>> On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
>>> Use is_zero_pfn on pteval only after pte_present check on pteval
>>> (It might be better idea to introduce is_zero_pte where checks
>>> pte_present first). Otherwise, it could work with swap or
>>> migration entry and if pte_pfn's result is equal to zero_pfn
>>> by chance, we lose user's data in __collapse_huge_page_copy.
>>> So if you're luck, the application is segfaulted and finally you
>>> could see below message when the application is exit.
>>>
>>> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
>>
>> Did you acctually steped on the bug?
>> If yes it's subject for stable@, I think.
>
> Yes, I did with my testing program which made heavy swap-in/out/
> swapoff with MADV_DONTNEED in a memcg.
> Actually, I marked this patch as -stable but removed it right before
> sending because my test program is artificial and didn't see any
> report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
> I might miss it).
> In addition, sometime I saw someone insists on "It's not a stable
> material if it's not a bug with real workload". I don't want to
> involve such non-technical stuff so waited someone nudges me to
> mark it as -stable and finally, you did. ;-)

I'd also think this should go -stable, and I haven't heard the "real 
workload" argument before.

> If other reviewers are not against, I will Cc -stable in next spin.
>
>>
>>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>>> ---
>>>
>>> I found this bug with MADV_FREE hard test. Sometime, I saw
>>> "Bad rss-counter" message with MM_SWAPENTS but it's really
>>> rare, once a day if I was luck or once in five days if I was
>>> unlucky so I am doing test still and just pass a few days but
>>> I hope it will fix the issue.
>>>
>>>   mm/huge_memory.c | 12 +++++++++++-
>>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 4b06b8db9df2..349590aa4533 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>   	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>>>   	     _pte++, _address += PAGE_SIZE) {
>>>   		pte_t pteval = *_pte;
>>> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> +		if (pte_none(pteval)) {
>>
>> In -mm tree we have is_swap_pte() check before this point in
>> khugepaged_scan_pmd()
>
> Actually, I tested this patch with v4.2 kernel so it doesn't have
> the check.
> Now, I look through optimistic check for swapin readahead patch
> in current mmotm.
> It seems the check couldn't prevent this problem because it releases
> pte lock and anon_vma lock before being isolated the page in
> __collapse_huge_page_isolate so the page could be swapped out again.
>
>>
>> Also, what about similar pattern in __collapse_huge_page_isolate() and
>> __collapse_huge_page_copy()? Shouldn't they be fixed as well?
>
> I see what's wrong here.
> /me slaps self.
> The line I was about to change was in __collapse_huge_page_isolate
> but I changed khugepaged_scan_pmd by mistake at last modification
> since that part is almost same. :(
> Fortunately my testing kernel is doing right version.
> Here it goes.
>
>  From 2a2e4b247e132d823af30655dbc0b57738e9d6ee Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 12 Oct 2015 09:52:46 +0900
> Subject: [PATCH] thp: use is_zero_pfn only after pte_present check
>
> Use is_zero_pfn on pteval only after pte_present check on pteval
> (It might be better idea to introduce is_zero_pte where checks
> pte_present first). Otherwise, it could work with swap or
> migration entry and if pte_pfn's result is equal to zero_pfn
> by chance, we lose user's data in __collapse_huge_page_copy.
> So if you're luck, the application is segfaulted and finally you
> could see below message when the application is exit.
>
> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

So this patch should be stable 4.1+. Does it apply both in -next and 
4.3-rcX?

> ---
>   mm/huge_memory.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4b06b8db9df2..bbac913f96bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2206,7 +2206,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>   	     _pte++, address += PAGE_SIZE) {
>   		pte_t pteval = *_pte;
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +		if (pte_none(pteval) || (pte_present(pteval) &&
> +				is_zero_pfn(pte_pfn(pteval)))) {
>   			if (!userfaultfd_armed(vma) &&
>   			    ++none_or_zero <= khugepaged_max_ptes_none)
>   				continue;
>

next prev parent reply	other threads:[~2015-10-12 15:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-12  1:54 [PATCH] thp: use is_zero_pfn after pte_present check Minchan Kim
2015-10-12  1:54 ` Minchan Kim
2015-10-12 10:13 ` Kirill A. Shutemov
2015-10-12 10:13   ` Kirill A. Shutemov
2015-10-12 14:57   ` Minchan Kim
2015-10-12 14:57     ` Minchan Kim
2015-10-12 15:15     ` Vlastimil Babka [this message]
2015-10-12 15:15       ` Vlastimil Babka
2015-10-12 20:20       ` Andrea Arcangeli
2015-10-12 20:20         ` Andrea Arcangeli
2015-10-12 15:27     ` Kirill A. Shutemov
2015-10-12 15:27       ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=561BCE7A.1080403@suse.cz \
    --to=vbabka@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.