From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Oliver <oohall@gmail.com>, Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux MM <linux-mm@kvack.org>, Ross Zwisler <zwisler@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
Date: Thu, 28 Feb 2019 18:13:28 +0530 [thread overview]
Message-ID: <65e1671d-6896-e2e9-e000-90c5b0484ad2@linux.ibm.com> (raw)
In-Reply-To: <CAOSf1CHjkyX2NTex7dc1AEHXSDcWA_UGYX8NoSyHpb5s_RkwXQ@mail.gmail.com>
On 2/28/19 3:10 PM, Oliver wrote:
> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Add a flag to indicate the ability to do huge page dax mapping. On architecture
>> like ppc64, the hypervisor can disable huge page support in the guest. In
>> such a case, we should not enable huge page dax mapping. This patch adds
>> a flag which the architecture code will update to indicate huge page
>> dax mapping support.
>
> *groan*
>
>> Architectures mostly do transparent_hugepage_flag = 0; if they can't
>> do hugepages. That also takes care of disabling dax hugepage mapping
>> with this change.
>>
>> Without this patch we get the below error with kvm on ppc64.
>>
>> [ 118.849975] lpar: Failed hash pte insert with error -4
>>
>> NOTE: The patch also use
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> to disable dax huge page mapping.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> TODO:
>> * Add Fixes: tag
>>
>> include/linux/huge_mm.h | 4 +++-
>> mm/huge_memory.c | 4 ++++
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 381e872bfde0..01ad5258545e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>> pud_t *pud, pfn_t pfn, bool write);
>> enum transparent_hugepage_flag {
>> TRANSPARENT_HUGEPAGE_FLAG,
>> + TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>> TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>> TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>> @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>> if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>> return true;
>>
>> - if (vma_is_dax(vma))
>> + if (vma_is_dax(vma) &&
>> + (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>> return true;
>
> Forcing PTE sized faults should be fine for fsdax, but it'll break
> devdax. The devdax driver requires the fault size be >= the namespace
> alignment since devdax tries to guarantee hugepage mappings will be
> used and PMD alignment is the default. We can probably have devdax
> fall back to the largest size the hypervisor has made available, but
> it does run contrary to the design. Ah well, I suppose it's better off
> being degraded rather than unusable.
>
Will fix that. I will make PFN_DEFAULT_ALIGNMENT arch specific.
>> if (transparent_hugepage_flags &
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index faf357eaf0ce..43d742fe0341 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>> #endif
>> + (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
>> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>> (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>> @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
>> &transparent_hugepage_flags);
>> clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>> &transparent_hugepage_flags);
>> + clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> + &transparent_hugepage_flags);
>> ret = 1;
>> }
>> out:
>
>> @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>> spinlock_t *ptl;
>>
>> ptl = pmd_lock(mm, pmd);
>> + /* should we check for none here again? */
>
> VM_WARN_ON() maybe? If THP is disabled and we're here then something
> has gone wrong.
I was wondering whether we can end up calling insert_pfn_pmd in parallel
and hence end up having a pmd entry here already. Usually we check for
if (!pmd_none(pmd)) after holding pmd_lock. Was not sure whether there
is anything preventing a parallel fault in case of dax.
>
>> entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>> if (pfn_t_devmap(pfn))
>> entry = pmd_mkdevmap(entry);
>> --
>> 2.20.1
>>
>
WARNING: multiple messages have this Message-ID (diff)
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Oliver <oohall@gmail.com>, Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Jan Kara <jack@suse.cz>, Michael Ellerman <mpe@ellerman.id.au>,
Ross Zwisler <zwisler@kernel.org>, Linux MM <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
Date: Thu, 28 Feb 2019 18:13:28 +0530 [thread overview]
Message-ID: <65e1671d-6896-e2e9-e000-90c5b0484ad2@linux.ibm.com> (raw)
In-Reply-To: <CAOSf1CHjkyX2NTex7dc1AEHXSDcWA_UGYX8NoSyHpb5s_RkwXQ@mail.gmail.com>
On 2/28/19 3:10 PM, Oliver wrote:
> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Add a flag to indicate the ability to do huge page dax mapping. On architecture
>> like ppc64, the hypervisor can disable huge page support in the guest. In
>> such a case, we should not enable huge page dax mapping. This patch adds
>> a flag which the architecture code will update to indicate huge page
>> dax mapping support.
>
> *groan*
>
>> Architectures mostly do transparent_hugepage_flag = 0; if they can't
>> do hugepages. That also takes care of disabling dax hugepage mapping
>> with this change.
>>
>> Without this patch we get the below error with kvm on ppc64.
>>
>> [ 118.849975] lpar: Failed hash pte insert with error -4
>>
>> NOTE: The patch also use
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> to disable dax huge page mapping.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> TODO:
>> * Add Fixes: tag
>>
>> include/linux/huge_mm.h | 4 +++-
>> mm/huge_memory.c | 4 ++++
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 381e872bfde0..01ad5258545e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>> pud_t *pud, pfn_t pfn, bool write);
>> enum transparent_hugepage_flag {
>> TRANSPARENT_HUGEPAGE_FLAG,
>> + TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>> TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>> TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>> @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>> if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>> return true;
>>
>> - if (vma_is_dax(vma))
>> + if (vma_is_dax(vma) &&
>> + (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>> return true;
>
> Forcing PTE sized faults should be fine for fsdax, but it'll break
> devdax. The devdax driver requires the fault size be >= the namespace
> alignment since devdax tries to guarantee hugepage mappings will be
> used and PMD alignment is the default. We can probably have devdax
> fall back to the largest size the hypervisor has made available, but
> it does run contrary to the design. Ah well, I suppose it's better off
> being degraded rather than unusable.
>
Will fix that. I will make PFN_DEFAULT_ALIGNMENT arch specific.
>> if (transparent_hugepage_flags &
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index faf357eaf0ce..43d742fe0341 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>> #endif
>> + (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
>> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>> (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>> @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
>> &transparent_hugepage_flags);
>> clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>> &transparent_hugepage_flags);
>> + clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> + &transparent_hugepage_flags);
>> ret = 1;
>> }
>> out:
>
>> @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>> spinlock_t *ptl;
>>
>> ptl = pmd_lock(mm, pmd);
>> + /* should we check for none here again? */
>
> VM_WARN_ON() maybe? If THP is disabled and we're here then something
> has gone wrong.
I was wondering whether we can end up calling insert_pfn_pmd in parallel
and hence end up having a pmd entry here already. Usually we check for
if (!pmd_none(pmd)) after holding pmd_lock. Was not sure whether there
is anything preventing a parallel fault in case of dax.
>
>> entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>> if (pfn_t_devmap(pfn))
>> entry = pmd_mkdevmap(entry);
>> --
>> 2.20.1
>>
>
next prev parent reply other threads:[~2019-02-28 12:46 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-28 8:35 [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Aneesh Kumar K.V
2019-02-28 8:35 ` Aneesh Kumar K.V
2019-02-28 8:35 ` [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default Aneesh Kumar K.V
2019-02-28 8:35 ` Aneesh Kumar K.V
2019-02-28 9:40 ` Jan Kara
2019-02-28 9:40 ` Jan Kara
2019-02-28 12:32 ` Aneesh Kumar K.V
2019-02-28 12:32 ` Aneesh Kumar K.V
2019-02-28 9:40 ` Oliver
2019-02-28 9:40 ` Oliver
2019-02-28 12:43 ` Aneesh Kumar K.V [this message]
2019-02-28 12:43 ` Aneesh Kumar K.V
2019-02-28 16:45 ` Dan Williams
2019-02-28 16:45 ` Dan Williams
2019-03-06 9:17 ` Aneesh Kumar K.V
2019-03-06 9:17 ` Aneesh Kumar K.V
2019-03-06 9:17 ` Aneesh Kumar K.V
2019-03-06 11:44 ` Michal Suchánek
2019-03-06 11:44 ` Michal Suchánek
2019-03-06 11:44 ` Michal Suchánek
2019-03-06 12:45 ` Aneesh Kumar K.V
2019-03-06 12:45 ` Aneesh Kumar K.V
2019-03-06 13:06 ` Kirill A. Shutemov
2019-03-06 13:06 ` Kirill A. Shutemov
2019-03-06 13:06 ` Kirill A. Shutemov
2019-03-13 16:07 ` Dan Williams
2019-03-13 16:07 ` Dan Williams
2019-03-13 16:07 ` Dan Williams
2019-03-19 8:44 ` Kirill A. Shutemov
2019-03-19 8:44 ` Kirill A. Shutemov
2019-03-19 8:44 ` Kirill A. Shutemov
2019-03-19 15:36 ` Dan Williams
2019-03-19 15:36 ` Dan Williams
2019-03-19 15:36 ` Dan Williams
2019-03-13 16:02 ` Dan Williams
2019-03-13 16:02 ` Dan Williams
2019-03-13 16:02 ` Dan Williams
2019-03-14 3:45 ` Aneesh Kumar K.V
2019-03-14 3:45 ` Aneesh Kumar K.V
2019-03-14 3:45 ` Aneesh Kumar K.V
2019-03-14 4:02 ` Dan Williams
2019-03-14 4:02 ` Dan Williams
2019-03-14 4:02 ` Dan Williams
2019-03-20 8:06 ` Aneesh Kumar K.V
2019-03-20 8:06 ` Aneesh Kumar K.V
2019-03-20 8:06 ` Aneesh Kumar K.V
2019-03-20 8:09 ` Aneesh Kumar K.V
2019-03-20 8:09 ` Aneesh Kumar K.V
2019-03-20 15:34 ` Dan Williams
2019-03-20 15:34 ` Dan Williams
2019-03-20 15:34 ` Dan Williams
2019-03-20 20:57 ` Dan Williams
2019-03-20 20:57 ` Dan Williams
2019-03-20 20:57 ` Dan Williams
2019-03-21 3:08 ` Oliver
2019-03-21 3:08 ` Oliver
2019-03-21 3:08 ` Oliver
2019-03-21 3:12 ` Dan Williams
2019-03-21 3:12 ` Dan Williams
2019-03-21 3:12 ` Dan Williams
2019-02-28 9:21 ` [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Jan Kara
2019-02-28 9:21 ` Jan Kara
2019-02-28 12:34 ` Aneesh Kumar K.V
2019-02-28 12:34 ` Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=65e1671d-6896-e2e9-e000-90c5b0484ad2@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=oohall@gmail.com \
--cc=zwisler@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.