From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>,
benh@kernel.crashing.org, paulus@samba.org,
akpm@linux-foundation.org,
Mel Gorman <mgorman@techsingularity.net>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
linux-kernel@vger.kernel.org
Subject: Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Sun, 14 Feb 2016 11:02:47 +0530 [thread overview]
Message-ID: <877fi7ambk.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160209121606.EA46C140B97@ozlabs.org>
Michael Ellerman <mpe@ellerman.id.au> writes:
> On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>>
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>>
>> Consider following race:
>>
>> CPU0 CPU1
>> shrink_page_list()
>> add_to_swap()
>> split_huge_page_to_list()
>> __split_huge_pmd_locked()
>> pmdp_huge_clear_flush_notify()
>> // pmd_none() == true
>> exit_mmap()
>> unmap_vmas()
>> zap_pmd_range()
>> // no action on pmd since pmd_none() == true
>> pmd_populate()
>>
>> As result the THP will not be freed. The leak is detected by check_mm():
>>
>> BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
>>
>> The above required us to not mark pmd none during a pmd split.
>>
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access. We do the
>> pmd_same check in all case after taking plt with
>> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
>> huge_pmd_set_accessed)
>>
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Applied to powerpc fixes, thanks.
>
> https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3
>
Can we apply the below hunk ?. The reason for marking pmd none was to
avoid clearing both _PAGE_USER and _PAGE_PRESENT on the pte. At pmd
level that used to mean a hugepd pointer before. We did fix that earlier
by introducing _PAGE_PTE. But then I was thinking it was harmless to
mark pmd none. Now marking it one will still result in the race I
explained above, eventhough the window is much smaller now.
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8a00da39969..03f6e72697d0 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -694,7 +694,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+ pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
/*
* This ensures that generic code that rely on IRQ disabling
* to prevent a parallel THP split work as expected.
WARNING: multiple messages have this Message-ID (diff)
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>,
benh@kernel.crashing.org, paulus@samba.org,
akpm@linux-foundation.org,
Mel Gorman <mgorman@techsingularity.net>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
linux-kernel@vger.kernel.org
Subject: Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Date: Sun, 14 Feb 2016 11:02:47 +0530 [thread overview]
Message-ID: <877fi7ambk.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160209121606.EA46C140B97@ozlabs.org>
Michael Ellerman <mpe@ellerman.id.au> writes:
> On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>>
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>>
>> Consider following race:
>>
>> CPU0 CPU1
>> shrink_page_list()
>> add_to_swap()
>> split_huge_page_to_list()
>> __split_huge_pmd_locked()
>> pmdp_huge_clear_flush_notify()
>> // pmd_none() == true
>> exit_mmap()
>> unmap_vmas()
>> zap_pmd_range()
>> // no action on pmd since pmd_none() == true
>> pmd_populate()
>>
>> As result the THP will not be freed. The leak is detected by check_mm():
>>
>> BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
>>
>> The above required us to not mark pmd none during a pmd split.
>>
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access. We do the
>> pmd_same check in all case after taking plt with
>> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
>> huge_pmd_set_accessed)
>>
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Applied to powerpc fixes, thanks.
>
> https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3
>
Can we apply the below hunk ?. The reason for marking pmd none was to
avoid clearing both _PAGE_USER and _PAGE_PRESENT on the pte. At pmd
level that used to mean a hugepd pointer before. We did fix that earlier
by introducing _PAGE_PTE. But then I was thinking it was harmless to
mark pmd none. Now marking it one will still result in the race I
explained above, eventhough the window is much smaller now.
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8a00da39969..03f6e72697d0 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -694,7 +694,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+ pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
/*
* This ensures that generic code that rely on IRQ disabling
* to prevent a parallel THP split work as expected.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-02-14 5:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-09 1:20 [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update Aneesh Kumar K.V
2016-02-09 1:20 ` Aneesh Kumar K.V
2016-02-09 12:16 ` [V3] " Michael Ellerman
2016-02-09 12:16 ` Michael Ellerman
2016-02-14 5:32 ` Aneesh Kumar K.V [this message]
2016-02-14 5:32 ` Aneesh Kumar K.V
2016-02-15 2:44 ` [PATCH V3] " Balbir Singh
2016-02-15 2:44 ` Balbir Singh
2016-02-15 4:37 ` Aneesh Kumar K.V
2016-02-15 4:37 ` Aneesh Kumar K.V
2016-02-15 4:37 ` Aneesh Kumar K.V
2016-02-15 5:09 ` Balbir Singh
2016-02-15 5:09 ` Balbir Singh
2016-02-15 11:01 ` Aneesh Kumar K.V
2016-02-15 11:01 ` Aneesh Kumar K.V
2016-02-15 11:01 ` Aneesh Kumar K.V
2016-02-16 5:20 ` Balbir Singh
2016-02-16 5:20 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877fi7ambk.fsf@linux.vnet.ibm.com \
--to=aneesh.kumar@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@techsingularity.net \
--cc=mpe@ellerman.id.au \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.