All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au,
	akpm@linux-foundation.org,
	Mel Gorman <mgorman@techsingularity.net>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Mon, 08 Feb 2016 19:34:32 +0530	[thread overview]
Message-ID: <871t8n1eof.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160208075247.GB9075@node.shutemov.name>

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Mon, Feb 08, 2016 at 11:44:22AM +0530, Aneesh Kumar K.V wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>>
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>>
>> Consider following race:
>>
>> 		CPU0				CPU1
>> shrink_page_list()
>>   add_to_swap()
>>     split_huge_page_to_list()
>>       __split_huge_pmd_locked()
>>         pmdp_huge_clear_flush_notify()
>> 	// pmd_none() == true
>> 					exit_mmap()
>> 					  unmap_vmas()
>> 					    zap_pmd_range()
>> 					      // no action on pmd since pmd_none() == true
>> 	pmd_populate()
>>
>> As result the THP will not be freed. The leak is detected by check_mm():
>>
>> 	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
>>
>> The above required us to not mark pmd none during a pmd split.
>>
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access.
>
> I guess it worth mention that this serialization against ptl happens in
> huge_pmd_set_accessed(), if I didn't miss anything.

Ok will update the commit message with the below

"We do the pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)"
>
>>
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


.....
...


>>  #ifndef __HAVE_ARCH_PTE_SAME
>>  static inline int pte_same(pte_t pte_a, pte_t pte_b)
>>  {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 36c070167b71..b52d16a86e91 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	young = pmd_young(*pmd);
>>  	dirty = pmd_dirty(*pmd);
>>
>> +	pmdp_huge_splitting_flush(vma, haddr, pmd);
>
> Let's call it pmdp_huge_split_prepare().
>
> "_flush" part is ppc-specific implementation detail and generic code
> should not expect tlb to be flushed there.


Ok done

>
> Otherwise,
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
>>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>>  	pmd_populate(mm, &_pmd, pgtable);
>>
>> --
>> 2.5.0
>>


-aneesh

WARNING: multiple messages have this Message-ID (diff)
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au,
	akpm@linux-foundation.org,
	Mel Gorman <mgorman@techsingularity.net>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Mon, 08 Feb 2016 19:34:32 +0530	[thread overview]
Message-ID: <871t8n1eof.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160208075247.GB9075@node.shutemov.name>

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Mon, Feb 08, 2016 at 11:44:22AM +0530, Aneesh Kumar K.V wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>>
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>>
>> Consider following race:
>>
>> 		CPU0				CPU1
>> shrink_page_list()
>>   add_to_swap()
>>     split_huge_page_to_list()
>>       __split_huge_pmd_locked()
>>         pmdp_huge_clear_flush_notify()
>> 	// pmd_none() == true
>> 					exit_mmap()
>> 					  unmap_vmas()
>> 					    zap_pmd_range()
>> 					      // no action on pmd since pmd_none() == true
>> 	pmd_populate()
>>
>> As result the THP will not be freed. The leak is detected by check_mm():
>>
>> 	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
>>
>> The above required us to not mark pmd none during a pmd split.
>>
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access.
>
> I guess it worth mention that this serialization against ptl happens in
> huge_pmd_set_accessed(), if I didn't miss anything.

Ok will update the commit message with the below

"We do the pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)"
>
>>
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


.....
...


>>  #ifndef __HAVE_ARCH_PTE_SAME
>>  static inline int pte_same(pte_t pte_a, pte_t pte_b)
>>  {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 36c070167b71..b52d16a86e91 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	young = pmd_young(*pmd);
>>  	dirty = pmd_dirty(*pmd);
>>
>> +	pmdp_huge_splitting_flush(vma, haddr, pmd);
>
> Let's call it pmdp_huge_split_prepare().
>
> "_flush" part is ppc-specific implementation detail and generic code
> should not expect tlb to be flushed there.


Ok done

>
> Otherwise,
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
>>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>>  	pmd_populate(mm, &_pmd, pgtable);
>>
>> --
>> 2.5.0
>>


-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-02-08 14:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-08  6:14 [PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update Aneesh Kumar K.V
2016-02-08  6:14 ` Aneesh Kumar K.V
2016-02-08  7:52 ` Kirill A. Shutemov
2016-02-08  7:52   ` Kirill A. Shutemov
2016-02-08 14:04   ` Aneesh Kumar K.V [this message]
2016-02-08 14:04     ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871t8n1eof.fsf@linux.vnet.ibm.com \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@techsingularity.net \
    --cc=mpe@ellerman.id.au \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.