From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Balbir Singh <bsingharora@gmail.com>,
benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au,
akpm@linux-foundation.org,
Mel Gorman <mgorman@techsingularity.net>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Mon, 15 Feb 2016 10:07:08 +0530 [thread overview]
Message-ID: <87lh6mfv2j.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <1455504278.16012.18.camel@gmail.com>
Balbir Singh <bsingharora@gmail.com> writes:
> On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>>=C2=A0
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>=20
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>> =C2=A0arch/powerpc/include/asm/book3s/64/pgtable.h |=C2=A0=C2=A04 ++++
>> =C2=A0arch/powerpc/mm/pgtable_64.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 35
>> +++++++++++++++++++++++++++-
>> =C2=A0include/asm-generic/pgtable.h=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A08 =
+++++++
>> =C2=A0mm/huge_memory.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A01 +
>> =C2=A04 files changed, 47 insertions(+), 1 deletion(-)
>>=20
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 8d1c41d28318..ac07a30a7934 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
>> mm_struct *mm, pmd_t *pmdp);
>> =C2=A0extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned l=
ong
>> address,
>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0pmd_t *pmdp);
>> =C2=A0
>> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
>> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> + =C2=A0=C2=A0=C2=A0=C2=A0unsigned long address, pmd_t *pmdp);
>> +
>> =C2=A0#define pmd_move_must_withdraw pmd_move_must_withdraw
>> =C2=A0struct spinlock;
>> =C2=A0static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_=
ptl,
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 3124a20d0fab..c8a00da39969 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_str=
uct
>> *mm, pmd_t *pmdp)
>> =C2=A0 return pgtable;
>> =C2=A0}
>> =C2=A0
>> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> + =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned long address, pmd_t *pmdp)
>> +{
>> + VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> + BUG_ON(REGION_ID(address) !=3D USER_REGION_ID);
>> +#endif
>> + /*
>> + =C2=A0* We can't mark the pmd none here, because that will cause a race
>> + =C2=A0* against exit_mmap. We need to continue mark pmd TRANS HUGE, wh=
ile
>> + =C2=A0* we spilt, but at the same time we wan't rest of the ppc64 code
>> + =C2=A0* not to insert hash pte on this, because we will be modifying
>> + =C2=A0* the deposited pgtable in the caller of this function. Hence
>> + =C2=A0* clear the _PAGE_USER so that we move the fault handling to
>> + =C2=A0* higher level function and that will serialize against ptl.
>> + =C2=A0* We need to flush existing hash pte entries here even though,
>> + =C2=A0* the translation is still valid, because we will withdraw
>> + =C2=A0* pgtable_t after this.
>> + =C2=A0*/
>> + pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
>
> Can this break any checks for _PAGE_USER? From other paths?
Should not, that is the same condition we use for autonuma.
>
>> +}
>> +
>> +
>> =C2=A0/*
>> =C2=A0 * set a new huge pmd. We should not be called for updating
>> =C2=A0 * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
>> addr,
>> =C2=A0 return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>> =C2=A0}
>> =C2=A0
>> +/*
>> + * We use this to invalidate a pmdp entry before switching from a
>> + * hugepte to regular pmd entry.
>> + */
>> =C2=A0void pmdp_invalidate(struct vm_area_struct *vma, unsigned long add=
ress,
>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pmd_t *pmdp)
>> =C2=A0{
>> - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
>> + pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> + /*
>> + =C2=A0* This ensures that generic code that rely on IRQ disabling
>> + =C2=A0* to prevent a parallel THP split work as expected.
>> + =C2=A0*/
>> + kick_all_cpus_sync();
>
> Seems expensive, anyway I think the right should do something like or a w=
rapper
> for it
>
> on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);
>
> do_nothing is not exported, but that can be fixed :)
>
Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
can happen outside that. Now i had a variant for kick_all_cpus_sync that
ignored idle cpus. But then that needs more verification.
http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
-aneesh
WARNING: multiple messages have this Message-ID (diff)
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Balbir Singh <bsingharora@gmail.com>,
benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au,
akpm@linux-foundation.org,
Mel Gorman <mgorman@techsingularity.net>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Mon, 15 Feb 2016 10:07:08 +0530 [thread overview]
Message-ID: <87lh6mfv2j.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <1455504278.16012.18.camel@gmail.com>
Balbir Singh <bsingharora@gmail.com> writes:
> On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>>
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++++
>> arch/powerpc/mm/pgtable_64.c | 35
>> +++++++++++++++++++++++++++-
>> include/asm-generic/pgtable.h | 8 +++++++
>> mm/huge_memory.c | 1 +
>> 4 files changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 8d1c41d28318..ac07a30a7934 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
>> mm_struct *mm, pmd_t *pmdp);
>> extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long
>> address,
>> pmd_t *pmdp);
>>
>> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
>> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp);
>> +
>> #define pmd_move_must_withdraw pmd_move_must_withdraw
>> struct spinlock;
>> static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 3124a20d0fab..c8a00da39969 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct
>> *mm, pmd_t *pmdp)
>> return pgtable;
>> }
>>
>> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp)
>> +{
>> + VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> + BUG_ON(REGION_ID(address) != USER_REGION_ID);
>> +#endif
>> + /*
>> + * We can't mark the pmd none here, because that will cause a race
>> + * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
>> + * we spilt, but at the same time we wan't rest of the ppc64 code
>> + * not to insert hash pte on this, because we will be modifying
>> + * the deposited pgtable in the caller of this function. Hence
>> + * clear the _PAGE_USER so that we move the fault handling to
>> + * higher level function and that will serialize against ptl.
>> + * We need to flush existing hash pte entries here even though,
>> + * the translation is still valid, because we will withdraw
>> + * pgtable_t after this.
>> + */
>> + pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
>
> Can this break any checks for _PAGE_USER? From other paths?
Should not, that is the same condition we use for autonuma.
>
>> +}
>> +
>> +
>> /*
>> * set a new huge pmd. We should not be called for updating
>> * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
>> addr,
>> return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>> }
>>
>> +/*
>> + * We use this to invalidate a pmdp entry before switching from a
>> + * hugepte to regular pmd entry.
>> + */
>> void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> pmd_t *pmdp)
>> {
>> - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
>> + pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> + /*
>> + * This ensures that generic code that rely on IRQ disabling
>> + * to prevent a parallel THP split work as expected.
>> + */
>> + kick_all_cpus_sync();
>
> Seems expensive, anyway I think the right should do something like or a wrapper
> for it
>
> on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);
>
> do_nothing is not exported, but that can be fixed :)
>
Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
can happen outside that. Now i had a variant for kick_all_cpus_sync that
ignored idle cpus. But then that needs more verification.
http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Balbir Singh <bsingharora@gmail.com>,
benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au,
akpm@linux-foundation.org,
Mel Gorman <mgorman@techsingularity.net>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Mon, 15 Feb 2016 10:07:08 +0530 [thread overview]
Message-ID: <87lh6mfv2j.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <1455504278.16012.18.camel@gmail.com>
Balbir Singh <bsingharora@gmail.com> writes:
> On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>>
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++++
>> arch/powerpc/mm/pgtable_64.c | 35
>> +++++++++++++++++++++++++++-
>> include/asm-generic/pgtable.h | 8 +++++++
>> mm/huge_memory.c | 1 +
>> 4 files changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 8d1c41d28318..ac07a30a7934 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
>> mm_struct *mm, pmd_t *pmdp);
>> extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long
>> address,
>> pmd_t *pmdp);
>>
>> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
>> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp);
>> +
>> #define pmd_move_must_withdraw pmd_move_must_withdraw
>> struct spinlock;
>> static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 3124a20d0fab..c8a00da39969 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct
>> *mm, pmd_t *pmdp)
>> return pgtable;
>> }
>>
>> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp)
>> +{
>> + VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> + BUG_ON(REGION_ID(address) != USER_REGION_ID);
>> +#endif
>> + /*
>> + * We can't mark the pmd none here, because that will cause a race
>> + * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
>> + * we spilt, but at the same time we wan't rest of the ppc64 code
>> + * not to insert hash pte on this, because we will be modifying
>> + * the deposited pgtable in the caller of this function. Hence
>> + * clear the _PAGE_USER so that we move the fault handling to
>> + * higher level function and that will serialize against ptl.
>> + * We need to flush existing hash pte entries here even though,
>> + * the translation is still valid, because we will withdraw
>> + * pgtable_t after this.
>> + */
>> + pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
>
> Can this break any checks for _PAGE_USER? From other paths?
Should not, that is the same condition we use for autonuma.
>
>> +}
>> +
>> +
>> /*
>> * set a new huge pmd. We should not be called for updating
>> * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
>> addr,
>> return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>> }
>>
>> +/*
>> + * We use this to invalidate a pmdp entry before switching from a
>> + * hugepte to regular pmd entry.
>> + */
>> void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> pmd_t *pmdp)
>> {
>> - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
>> + pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> + /*
>> + * This ensures that generic code that rely on IRQ disabling
>> + * to prevent a parallel THP split work as expected.
>> + */
>> + kick_all_cpus_sync();
>
> Seems expensive, anyway I think the right should do something like or a wrapper
> for it
>
> on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);
>
> do_nothing is not exported, but that can be fixed :)
>
Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
can happen outside that. Now i had a variant for kick_all_cpus_sync that
ignored idle cpus. But then that needs more verification.
http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
-aneesh
next prev parent reply other threads:[~2016-02-15 4:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-09 1:20 [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update Aneesh Kumar K.V
2016-02-09 1:20 ` Aneesh Kumar K.V
2016-02-09 12:16 ` [V3] " Michael Ellerman
2016-02-09 12:16 ` Michael Ellerman
2016-02-14 5:32 ` Aneesh Kumar K.V
2016-02-14 5:32 ` Aneesh Kumar K.V
2016-02-15 2:44 ` [PATCH V3] " Balbir Singh
2016-02-15 2:44 ` Balbir Singh
2016-02-15 4:37 ` Aneesh Kumar K.V [this message]
2016-02-15 4:37 ` Aneesh Kumar K.V
2016-02-15 4:37 ` Aneesh Kumar K.V
2016-02-15 5:09 ` Balbir Singh
2016-02-15 5:09 ` Balbir Singh
2016-02-15 11:01 ` Aneesh Kumar K.V
2016-02-15 11:01 ` Aneesh Kumar K.V
2016-02-15 11:01 ` Aneesh Kumar K.V
2016-02-16 5:20 ` Balbir Singh
2016-02-16 5:20 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87lh6mfv2j.fsf@linux.vnet.ibm.com \
--to=aneesh.kumar@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=bsingharora@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@techsingularity.net \
--cc=mpe@ellerman.id.au \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.