* [PATCH v2 0/2] KVM: arm64: Use folio for THP support
@ 2023-09-28 17:32 Vincent Donnefort
2023-09-28 17:32 ` [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment Vincent Donnefort
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Vincent Donnefort @ 2023-09-28 17:32 UTC (permalink / raw)
To: maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy,
Vincent Donnefort
With the introduction of folios for transparent huge pages [1], we
can rely on this support to identify if a page is backed by a huge one,
saving a page table walk.
v1 -> v2:
* Add git reference to [1] in the commit message
* GUP always acted on the Head page under the hood. Update the commit
message.
[1] https://lkml.kernel.org/r/20220504182857.4013401-3-willy@infradead.org
Vincent Donnefort (2):
KVM: arm64: Do not transfer page refcount for THP adjustment
KVM: arm64: Use folio for THP adjustment
arch/arm64/kvm/mmu.c | 79 ++------------------------------------------
1 file changed, 3 insertions(+), 76 deletions(-)
base-commit: ce9ecca0238b140b88f43859b211c9fdfd8e5b70
--
2.42.0.515.g380fc7ccd1-goog
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment
2023-09-28 17:32 [PATCH v2 0/2] KVM: arm64: Use folio for THP support Vincent Donnefort
@ 2023-09-28 17:32 ` Vincent Donnefort
2023-09-29 6:59 ` Gavin Shan
2023-09-28 17:32 ` [PATCH v2 2/2] KVM: arm64: Use folio " Vincent Donnefort
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: Vincent Donnefort @ 2023-09-28 17:32 UTC (permalink / raw)
To: maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy,
Vincent Donnefort
GUP affects a refcount common to all pages forming the THP. There is
therefore no need to move the refcount from a tail to the head page.
Under the hood it decrements and increments the same counter.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 587a104f66c3..de5e5148ef5d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1295,28 +1295,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (sz < PMD_SIZE)
return PAGE_SIZE;
- /*
- * The address we faulted on is backed by a transparent huge
- * page. However, because we map the compound huge page and
- * not the individual tail page, we need to transfer the
- * refcount to the head page. We have to be careful that the
- * THP doesn't start to split while we are adjusting the
- * refcounts.
- *
- * We are sure this doesn't happen, because mmu_invalidate_retry
- * was successful and we are holding the mmu_lock, so if this
- * THP is trying to split, it will be blocked in the mmu
- * notifier before touching any of the pages, specifically
- * before being able to call __split_huge_page_refcount().
- *
- * We can therefore safely transfer the refcount from PG_tail
- * to PG_head and switch the pfn from a tail page to the head
- * page accordingly.
- */
*ipap &= PMD_MASK;
- kvm_release_pfn_clean(pfn);
pfn &= ~(PTRS_PER_PMD - 1);
- get_page(pfn_to_page(pfn));
*pfnp = pfn;
return PMD_SIZE;
--
2.42.0.515.g380fc7ccd1-goog
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-09-28 17:32 [PATCH v2 0/2] KVM: arm64: Use folio for THP support Vincent Donnefort
2023-09-28 17:32 ` [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment Vincent Donnefort
@ 2023-09-28 17:32 ` Vincent Donnefort
2023-09-29 7:24 ` Gavin Shan
` (2 more replies)
2023-09-29 9:12 ` [PATCH v2 0/2] KVM: arm64: Use folio for THP support Marc Zyngier
2023-09-30 17:57 ` Oliver Upton
3 siblings, 3 replies; 17+ messages in thread
From: Vincent Donnefort @ 2023-09-28 17:32 UTC (permalink / raw)
To: maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy,
Vincent Donnefort
Since commit cb196ee1ef39 ("mm/huge_memory: convert
do_huge_pmd_anonymous_page() to use vma_alloc_folio()"), transparent
huge pages use folios. It enables us to check efficiently if a page is
mapped by a block simply looking at the folio size. This is saving a
page table walk.
It is safe to read the folio in this path. We've just increased its
refcount (GUP from __gfn_to_pfn_memslot()). This will prevent attempts
of splitting the huge page.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index de5e5148ef5d..69fcbcc7aca5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -791,51 +791,6 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
return 0;
}
-static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
- /* We shouldn't need any other callback to walk the PT */
- .phys_to_virt = kvm_host_va,
-};
-
-static int get_user_mapping_size(struct kvm *kvm, u64 addr)
-{
- struct kvm_pgtable pgt = {
- .pgd = (kvm_pteref_t)kvm->mm->pgd,
- .ia_bits = vabits_actual,
- .start_level = (KVM_PGTABLE_MAX_LEVELS -
- CONFIG_PGTABLE_LEVELS),
- .mm_ops = &kvm_user_mm_ops,
- };
- unsigned long flags;
- kvm_pte_t pte = 0; /* Keep GCC quiet... */
- u32 level = ~0;
- int ret;
-
- /*
- * Disable IRQs so that we hazard against a concurrent
- * teardown of the userspace page tables (which relies on
- * IPI-ing threads).
- */
- local_irq_save(flags);
- ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level);
- local_irq_restore(flags);
-
- if (ret)
- return ret;
-
- /*
- * Not seeing an error, but not updating level? Something went
- * deeply wrong...
- */
- if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS))
- return -EFAULT;
-
- /* Oops, the userspace PTs are gone... Replay the fault */
- if (!kvm_pte_valid(pte))
- return -EAGAIN;
-
- return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
-}
-
static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
.zalloc_page = stage2_memcache_zalloc_page,
.zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
@@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
*
* Returns the size of the mapping.
*/
-static long
+static unsigned long
transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
unsigned long hva, kvm_pfn_t *pfnp,
phys_addr_t *ipap)
@@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
* block map is contained within the memslot.
*/
if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
- int sz = get_user_mapping_size(kvm, hva);
-
- if (sz < 0)
- return sz;
+ size_t sz = folio_size(pfn_folio(pfn));
if (sz < PMD_SIZE)
return PAGE_SIZE;
@@ -1385,7 +1337,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
kvm_pfn_t pfn;
bool logging_active = memslot_is_logging(memslot);
unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
- long vma_pagesize, fault_granule;
+ unsigned long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
@@ -1530,11 +1482,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
hva, &pfn,
&fault_ipa);
-
- if (vma_pagesize < 0) {
- ret = vma_pagesize;
- goto out_unlock;
- }
}
if (fault_status != ESR_ELx_FSC_PERM && !device && kvm_has_mte(kvm)) {
--
2.42.0.515.g380fc7ccd1-goog
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment
2023-09-28 17:32 ` [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment Vincent Donnefort
@ 2023-09-29 6:59 ` Gavin Shan
2023-09-29 12:47 ` Vincent Donnefort
0 siblings, 1 reply; 17+ messages in thread
From: Gavin Shan @ 2023-09-29 6:59 UTC (permalink / raw)
To: Vincent Donnefort, maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy
On 9/29/23 03:32, Vincent Donnefort wrote:
> GUP affects a refcount common to all pages forming the THP. There is
> therefore no need to move the refcount from a tail to the head page.
> Under the hood it decrements and increments the same counter.
>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>
Reviewed-by: Gavin Shan <gshan@redhat.com>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 587a104f66c3..de5e5148ef5d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1295,28 +1295,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> if (sz < PMD_SIZE)
> return PAGE_SIZE;
>
> - /*
> - * The address we faulted on is backed by a transparent huge
> - * page. However, because we map the compound huge page and
> - * not the individual tail page, we need to transfer the
> - * refcount to the head page. We have to be careful that the
> - * THP doesn't start to split while we are adjusting the
> - * refcounts.
> - *
> - * We are sure this doesn't happen, because mmu_invalidate_retry
> - * was successful and we are holding the mmu_lock, so if this
> - * THP is trying to split, it will be blocked in the mmu
> - * notifier before touching any of the pages, specifically
> - * before being able to call __split_huge_page_refcount().
> - *
> - * We can therefore safely transfer the refcount from PG_tail
> - * to PG_head and switch the pfn from a tail page to the head
> - * page accordingly.
> - */
> *ipap &= PMD_MASK;
> - kvm_release_pfn_clean(pfn);
> pfn &= ~(PTRS_PER_PMD - 1);
> - get_page(pfn_to_page(pfn));
> *pfnp = pfn;
>
> return PMD_SIZE;
The local variable @pfn can be dropped either.
*pfnp &= ~(PTRS_PER_PMD - 1);
Thanks,
Gavin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-09-28 17:32 ` [PATCH v2 2/2] KVM: arm64: Use folio " Vincent Donnefort
@ 2023-09-29 7:24 ` Gavin Shan
2023-09-29 13:07 ` Matthew Wilcox
2023-09-29 23:53 ` Gavin Shan
2023-10-28 9:17 ` Ryan Roberts
2 siblings, 1 reply; 17+ messages in thread
From: Gavin Shan @ 2023-09-29 7:24 UTC (permalink / raw)
To: Vincent Donnefort, maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy
Hi Vincent,
On 9/29/23 03:32, Vincent Donnefort wrote:
> Since commit cb196ee1ef39 ("mm/huge_memory: convert
> do_huge_pmd_anonymous_page() to use vma_alloc_folio()"), transparent
> huge pages use folios. It enables us to check efficiently if a page is
> mapped by a block simply looking at the folio size. This is saving a
> page table walk.
>
> It is safe to read the folio in this path. We've just increased its
> refcount (GUP from __gfn_to_pfn_memslot()). This will prevent attempts
> of splitting the huge page.
>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index de5e5148ef5d..69fcbcc7aca5 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -791,51 +791,6 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> return 0;
> }
>
> -static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> - /* We shouldn't need any other callback to walk the PT */
> - .phys_to_virt = kvm_host_va,
> -};
> -
> -static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> -{
> - struct kvm_pgtable pgt = {
> - .pgd = (kvm_pteref_t)kvm->mm->pgd,
> - .ia_bits = vabits_actual,
> - .start_level = (KVM_PGTABLE_MAX_LEVELS -
> - CONFIG_PGTABLE_LEVELS),
> - .mm_ops = &kvm_user_mm_ops,
> - };
> - unsigned long flags;
> - kvm_pte_t pte = 0; /* Keep GCC quiet... */
> - u32 level = ~0;
> - int ret;
> -
> - /*
> - * Disable IRQs so that we hazard against a concurrent
> - * teardown of the userspace page tables (which relies on
> - * IPI-ing threads).
> - */
> - local_irq_save(flags);
> - ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level);
> - local_irq_restore(flags);
> -
> - if (ret)
> - return ret;
> -
> - /*
> - * Not seeing an error, but not updating level? Something went
> - * deeply wrong...
> - */
> - if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS))
> - return -EFAULT;
> -
> - /* Oops, the userspace PTs are gone... Replay the fault */
> - if (!kvm_pte_valid(pte))
> - return -EAGAIN;
> -
> - return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
> -}
> -
> static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> .zalloc_page = stage2_memcache_zalloc_page,
> .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
> @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
> *
> * Returns the size of the mapping.
> */
> -static long
> +static unsigned long
> transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> unsigned long hva, kvm_pfn_t *pfnp,
> phys_addr_t *ipap)
> @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> * block map is contained within the memslot.
> */
> if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> - int sz = get_user_mapping_size(kvm, hva);
> -
> - if (sz < 0)
> - return sz;
> + size_t sz = folio_size(pfn_folio(pfn));
>
> if (sz < PMD_SIZE)
> return PAGE_SIZE;
Is it possible that a tail page is returned from __gfn_to_pfn_memslot()? folio_size()
doesn't work for a tail page because order 0 is returned for tail pages. It seems it's
possible looking at the following call path.
user_mem_abort
__gfn_to_pfn_memslot // one page
hva_to_pfn
hva_to_pfn_fast
get_user_page_fast_only
get_user_pages_fast_only
internal_get_user_pages_fast
lockless_pages_from_mm
gup_pgd_range // walk page-table for range [start, start + PAGE_SIZE]
gup_p4d_range // start needn't be PMD_SIZE aligned
gup_pud_range
gup_pmd_range
gup_huge_pmd
nth_page
record_subpages
> @@ -1385,7 +1337,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> kvm_pfn_t pfn;
> bool logging_active = memslot_is_logging(memslot);
> unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> - long vma_pagesize, fault_granule;
> + unsigned long vma_pagesize, fault_granule;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> struct kvm_pgtable *pgt;
>
> @@ -1530,11 +1482,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> hva, &pfn,
> &fault_ipa);
> -
> - if (vma_pagesize < 0) {
> - ret = vma_pagesize;
> - goto out_unlock;
> - }
> }
>
> if (fault_status != ESR_ELx_FSC_PERM && !device && kvm_has_mte(kvm)) {
Thanks,
Gavin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] KVM: arm64: Use folio for THP support
2023-09-28 17:32 [PATCH v2 0/2] KVM: arm64: Use folio for THP support Vincent Donnefort
2023-09-28 17:32 ` [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment Vincent Donnefort
2023-09-28 17:32 ` [PATCH v2 2/2] KVM: arm64: Use folio " Vincent Donnefort
@ 2023-09-29 9:12 ` Marc Zyngier
2023-09-30 17:57 ` Oliver Upton
3 siblings, 0 replies; 17+ messages in thread
From: Marc Zyngier @ 2023-09-29 9:12 UTC (permalink / raw)
To: Vincent Donnefort
Cc: oliver.upton, kvmarm, linux-arm-kernel, kernel-team, will, willy
On Thu, 28 Sep 2023 18:32:02 +0100,
Vincent Donnefort <vdonnefort@google.com> wrote:
>
> With the introduction of folios for transparent huge pages [1], we
> can rely on this support to identify if a page is backed by a huge one,
> saving a page table walk.
>
> v1 -> v2:
> * Add git reference to [1] in the commit message
> * GUP always acted on the Head page under the hood. Update the commit
> message.
>
> [1] https://lkml.kernel.org/r/20220504182857.4013401-3-willy@infradead.org
>
> Vincent Donnefort (2):
> KVM: arm64: Do not transfer page refcount for THP adjustment
> KVM: arm64: Use folio for THP adjustment
>
> arch/arm64/kvm/mmu.c | 79 ++------------------------------------------
> 1 file changed, 3 insertions(+), 76 deletions(-)
Reviewed-by: Marc Zyngier <maz@kernel.org>
M.
--
Without deviation from the norm, progress is not possible.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment
2023-09-29 6:59 ` Gavin Shan
@ 2023-09-29 12:47 ` Vincent Donnefort
0 siblings, 0 replies; 17+ messages in thread
From: Vincent Donnefort @ 2023-09-29 12:47 UTC (permalink / raw)
To: Gavin Shan
Cc: maz, oliver.upton, kvmarm, linux-arm-kernel, kernel-team, will,
willy
On Fri, Sep 29, 2023 at 04:59:20PM +1000, Gavin Shan wrote:
> On 9/29/23 03:32, Vincent Donnefort wrote:
> > GUP affects a refcount common to all pages forming the THP. There is
> > therefore no need to move the refcount from a tail to the head page.
> > Under the hood it decrements and increments the same counter.
> >
> > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> >
>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
>
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 587a104f66c3..de5e5148ef5d 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1295,28 +1295,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > if (sz < PMD_SIZE)
> > return PAGE_SIZE;
> > - /*
> > - * The address we faulted on is backed by a transparent huge
> > - * page. However, because we map the compound huge page and
> > - * not the individual tail page, we need to transfer the
> > - * refcount to the head page. We have to be careful that the
> > - * THP doesn't start to split while we are adjusting the
> > - * refcounts.
> > - *
> > - * We are sure this doesn't happen, because mmu_invalidate_retry
> > - * was successful and we are holding the mmu_lock, so if this
> > - * THP is trying to split, it will be blocked in the mmu
> > - * notifier before touching any of the pages, specifically
> > - * before being able to call __split_huge_page_refcount().
> > - *
> > - * We can therefore safely transfer the refcount from PG_tail
> > - * to PG_head and switch the pfn from a tail page to the head
> > - * page accordingly.
> > - */
> > *ipap &= PMD_MASK;
> > - kvm_release_pfn_clean(pfn);
> > pfn &= ~(PTRS_PER_PMD - 1);
> > - get_page(pfn_to_page(pfn));
> > *pfnp = pfn;
> > return PMD_SIZE;
>
> The local variable @pfn can be dropped either.
I would like to keep it for the following patch: pfn_to_folio(pfn);
>
> *pfnp &= ~(PTRS_PER_PMD - 1);
>
> Thanks,
> Gavin
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-09-29 7:24 ` Gavin Shan
@ 2023-09-29 13:07 ` Matthew Wilcox
2023-09-29 23:52 ` Gavin Shan
0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-09-29 13:07 UTC (permalink / raw)
To: Gavin Shan
Cc: Vincent Donnefort, maz, oliver.upton, kvmarm, linux-arm-kernel,
kernel-team, will
On Fri, Sep 29, 2023 at 05:24:00PM +1000, Gavin Shan wrote:
> > + size_t sz = folio_size(pfn_folio(pfn));
>
> Is it possible that a tail page is returned from __gfn_to_pfn_memslot()? folio_size()
> doesn't work for a tail page because order 0 is returned for tail pages. It seems it's
pfn_folio() can't return a tail page. That's the point of folios; they
aren't tail pages. They're either a head page or an order-0 page.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-09-29 13:07 ` Matthew Wilcox
@ 2023-09-29 23:52 ` Gavin Shan
0 siblings, 0 replies; 17+ messages in thread
From: Gavin Shan @ 2023-09-29 23:52 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Vincent Donnefort, maz, oliver.upton, kvmarm, linux-arm-kernel,
kernel-team, will
Hi Matthew,
On 9/29/23 23:07, Matthew Wilcox wrote:
> On Fri, Sep 29, 2023 at 05:24:00PM +1000, Gavin Shan wrote:
>>> + size_t sz = folio_size(pfn_folio(pfn));
>>
>> Is it possible that a tail page is returned from __gfn_to_pfn_memslot()? folio_size()
>> doesn't work for a tail page because order 0 is returned for tail pages. It seems it's
>
> pfn_folio() can't return a tail page. That's the point of folios; they
> aren't tail pages. They're either a head page or an order-0 page.
>
Indeed, page_folio() already returned the head page properly.
Thanks a lot for your confirm.
Thanks,
Gavin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-09-28 17:32 ` [PATCH v2 2/2] KVM: arm64: Use folio " Vincent Donnefort
2023-09-29 7:24 ` Gavin Shan
@ 2023-09-29 23:53 ` Gavin Shan
2023-10-28 9:17 ` Ryan Roberts
2 siblings, 0 replies; 17+ messages in thread
From: Gavin Shan @ 2023-09-29 23:53 UTC (permalink / raw)
To: Vincent Donnefort, maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy
On 9/29/23 03:32, Vincent Donnefort wrote:
> Since commit cb196ee1ef39 ("mm/huge_memory: convert
> do_huge_pmd_anonymous_page() to use vma_alloc_folio()"), transparent
> huge pages use folios. It enables us to check efficiently if a page is
> mapped by a block simply looking at the folio size. This is saving a
> page table walk.
>
> It is safe to read the folio in this path. We've just increased its
> refcount (GUP from __gfn_to_pfn_memslot()). This will prevent attempts
> of splitting the huge page.
>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>
Reviewed-by: Gavin Shan <gshan@redhat.com>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index de5e5148ef5d..69fcbcc7aca5 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -791,51 +791,6 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> return 0;
> }
>
> -static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> - /* We shouldn't need any other callback to walk the PT */
> - .phys_to_virt = kvm_host_va,
> -};
> -
> -static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> -{
> - struct kvm_pgtable pgt = {
> - .pgd = (kvm_pteref_t)kvm->mm->pgd,
> - .ia_bits = vabits_actual,
> - .start_level = (KVM_PGTABLE_MAX_LEVELS -
> - CONFIG_PGTABLE_LEVELS),
> - .mm_ops = &kvm_user_mm_ops,
> - };
> - unsigned long flags;
> - kvm_pte_t pte = 0; /* Keep GCC quiet... */
> - u32 level = ~0;
> - int ret;
> -
> - /*
> - * Disable IRQs so that we hazard against a concurrent
> - * teardown of the userspace page tables (which relies on
> - * IPI-ing threads).
> - */
> - local_irq_save(flags);
> - ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level);
> - local_irq_restore(flags);
> -
> - if (ret)
> - return ret;
> -
> - /*
> - * Not seeing an error, but not updating level? Something went
> - * deeply wrong...
> - */
> - if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS))
> - return -EFAULT;
> -
> - /* Oops, the userspace PTs are gone... Replay the fault */
> - if (!kvm_pte_valid(pte))
> - return -EAGAIN;
> -
> - return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
> -}
> -
> static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> .zalloc_page = stage2_memcache_zalloc_page,
> .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
> @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
> *
> * Returns the size of the mapping.
> */
> -static long
> +static unsigned long
> transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> unsigned long hva, kvm_pfn_t *pfnp,
> phys_addr_t *ipap)
> @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> * block map is contained within the memslot.
> */
> if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> - int sz = get_user_mapping_size(kvm, hva);
> -
> - if (sz < 0)
> - return sz;
> + size_t sz = folio_size(pfn_folio(pfn));
>
> if (sz < PMD_SIZE)
> return PAGE_SIZE;
> @@ -1385,7 +1337,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> kvm_pfn_t pfn;
> bool logging_active = memslot_is_logging(memslot);
> unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> - long vma_pagesize, fault_granule;
> + unsigned long vma_pagesize, fault_granule;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> struct kvm_pgtable *pgt;
>
> @@ -1530,11 +1482,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> hva, &pfn,
> &fault_ipa);
> -
> - if (vma_pagesize < 0) {
> - ret = vma_pagesize;
> - goto out_unlock;
> - }
> }
>
> if (fault_status != ESR_ELx_FSC_PERM && !device && kvm_has_mte(kvm)) {
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] KVM: arm64: Use folio for THP support
2023-09-28 17:32 [PATCH v2 0/2] KVM: arm64: Use folio for THP support Vincent Donnefort
` (2 preceding siblings ...)
2023-09-29 9:12 ` [PATCH v2 0/2] KVM: arm64: Use folio for THP support Marc Zyngier
@ 2023-09-30 17:57 ` Oliver Upton
2023-10-30 20:16 ` Oliver Upton
3 siblings, 1 reply; 17+ messages in thread
From: Oliver Upton @ 2023-09-30 17:57 UTC (permalink / raw)
To: maz, Vincent Donnefort
Cc: Oliver Upton, kvmarm, kernel-team, willy, linux-arm-kernel, will
On Thu, 28 Sep 2023 18:32:02 +0100, Vincent Donnefort wrote:
> With the introduction of folios for transparent huge pages [1], we
> can rely on this support to identify if a page is backed by a huge one,
> saving a page table walk.
>
> v1 -> v2:
> * Add git reference to [1] in the commit message
> * GUP always acted on the Head page under the hood. Update the commit
> message.
>
> [...]
Applied to kvmarm/next, thanks!
[1/2] KVM: arm64: Do not transfer page refcount for THP adjustment
https://git.kernel.org/kvmarm/kvmarm/c/c04bf723ccd6
[2/2] KVM: arm64: Use folio for THP adjustment
https://git.kernel.org/kvmarm/kvmarm/c/bf92834e6f6e
--
Best,
Oliver
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-09-28 17:32 ` [PATCH v2 2/2] KVM: arm64: Use folio " Vincent Donnefort
2023-09-29 7:24 ` Gavin Shan
2023-09-29 23:53 ` Gavin Shan
@ 2023-10-28 9:17 ` Ryan Roberts
2023-10-30 10:40 ` Marc Zyngier
2023-10-30 10:47 ` Vincent Donnefort
2 siblings, 2 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-28 9:17 UTC (permalink / raw)
To: Vincent Donnefort, maz, oliver.upton
Cc: kvmarm, linux-arm-kernel, kernel-team, will, willy
On 28/09/2023 18:32, Vincent Donnefort wrote:
> Since commit cb196ee1ef39 ("mm/huge_memory: convert
> do_huge_pmd_anonymous_page() to use vma_alloc_folio()"), transparent
> huge pages use folios. It enables us to check efficiently if a page is
> mapped by a block simply looking at the folio size. This is saving a
> page table walk.
>
> It is safe to read the folio in this path. We've just increased its
> refcount (GUP from __gfn_to_pfn_memslot()). This will prevent attempts
> of splitting the huge page.
>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index de5e5148ef5d..69fcbcc7aca5 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -791,51 +791,6 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> return 0;
> }
>
> -static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> - /* We shouldn't need any other callback to walk the PT */
> - .phys_to_virt = kvm_host_va,
> -};
> -
> -static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> -{
> - struct kvm_pgtable pgt = {
> - .pgd = (kvm_pteref_t)kvm->mm->pgd,
> - .ia_bits = vabits_actual,
> - .start_level = (KVM_PGTABLE_MAX_LEVELS -
> - CONFIG_PGTABLE_LEVELS),
> - .mm_ops = &kvm_user_mm_ops,
> - };
> - unsigned long flags;
> - kvm_pte_t pte = 0; /* Keep GCC quiet... */
> - u32 level = ~0;
> - int ret;
> -
> - /*
> - * Disable IRQs so that we hazard against a concurrent
> - * teardown of the userspace page tables (which relies on
> - * IPI-ing threads).
> - */
> - local_irq_save(flags);
> - ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level);
> - local_irq_restore(flags);
> -
> - if (ret)
> - return ret;
> -
> - /*
> - * Not seeing an error, but not updating level? Something went
> - * deeply wrong...
> - */
> - if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS))
> - return -EFAULT;
> -
> - /* Oops, the userspace PTs are gone... Replay the fault */
> - if (!kvm_pte_valid(pte))
> - return -EAGAIN;
> -
> - return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
> -}
> -
> static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> .zalloc_page = stage2_memcache_zalloc_page,
> .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
> @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
> *
> * Returns the size of the mapping.
> */
> -static long
> +static unsigned long
> transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> unsigned long hva, kvm_pfn_t *pfnp,
> phys_addr_t *ipap)
> @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> * block map is contained within the memslot.
> */
> if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> - int sz = get_user_mapping_size(kvm, hva);
> -
> - if (sz < 0)
> - return sz;
> + size_t sz = folio_size(pfn_folio(pfn));
Hi,
Sorry this is an extremely late reply - I just noticed this because Marc
mentioned it in another thread.
This doesn't look quite right to me; just because you have a folio of a given
size, that doesn't mean the whole thing is mapped into this particular address
space. For example, you could have a (PMD-sized) THP that gets partially
munmapped - the folio is still PMD-sized but only some of it is mapped and
should be accessible to the process. Or you could have a large file-backed folio
(from a filesystem that supports large folios - e.g. XFS) but the application
only mapped part of the file.
Perhaps I've misunderstood and those edge cases can't happen here for some reason?
Thanks,
Ryan
>
> if (sz < PMD_SIZE)
> return PAGE_SIZE;
> @@ -1385,7 +1337,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> kvm_pfn_t pfn;
> bool logging_active = memslot_is_logging(memslot);
> unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> - long vma_pagesize, fault_granule;
> + unsigned long vma_pagesize, fault_granule;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> struct kvm_pgtable *pgt;
>
> @@ -1530,11 +1482,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> hva, &pfn,
> &fault_ipa);
> -
> - if (vma_pagesize < 0) {
> - ret = vma_pagesize;
> - goto out_unlock;
> - }
> }
>
> if (fault_status != ESR_ELx_FSC_PERM && !device && kvm_has_mte(kvm)) {
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-10-28 9:17 ` Ryan Roberts
@ 2023-10-30 10:40 ` Marc Zyngier
2023-10-30 10:57 ` Ryan Roberts
2023-10-30 10:47 ` Vincent Donnefort
1 sibling, 1 reply; 17+ messages in thread
From: Marc Zyngier @ 2023-10-30 10:40 UTC (permalink / raw)
To: Ryan Roberts
Cc: Vincent Donnefort, oliver.upton, kvmarm, linux-arm-kernel,
kernel-team, will, willy
On Sat, 28 Oct 2023 10:17:17 +0100,
Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/09/2023 18:32, Vincent Donnefort wrote:
> > Since commit cb196ee1ef39 ("mm/huge_memory: convert
> > do_huge_pmd_anonymous_page() to use vma_alloc_folio()"), transparent
> > huge pages use folios. It enables us to check efficiently if a page is
> > mapped by a block simply looking at the folio size. This is saving a
> > page table walk.
> >
> > It is safe to read the folio in this path. We've just increased its
> > refcount (GUP from __gfn_to_pfn_memslot()). This will prevent attempts
> > of splitting the huge page.
> >
> > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index de5e5148ef5d..69fcbcc7aca5 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -791,51 +791,6 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> > return 0;
> > }
> >
> > -static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> > - /* We shouldn't need any other callback to walk the PT */
> > - .phys_to_virt = kvm_host_va,
> > -};
> > -
> > -static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> > -{
> > - struct kvm_pgtable pgt = {
> > - .pgd = (kvm_pteref_t)kvm->mm->pgd,
> > - .ia_bits = vabits_actual,
> > - .start_level = (KVM_PGTABLE_MAX_LEVELS -
> > - CONFIG_PGTABLE_LEVELS),
> > - .mm_ops = &kvm_user_mm_ops,
> > - };
> > - unsigned long flags;
> > - kvm_pte_t pte = 0; /* Keep GCC quiet... */
> > - u32 level = ~0;
> > - int ret;
> > -
> > - /*
> > - * Disable IRQs so that we hazard against a concurrent
> > - * teardown of the userspace page tables (which relies on
> > - * IPI-ing threads).
> > - */
> > - local_irq_save(flags);
> > - ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level);
> > - local_irq_restore(flags);
> > -
> > - if (ret)
> > - return ret;
> > -
> > - /*
> > - * Not seeing an error, but not updating level? Something went
> > - * deeply wrong...
> > - */
> > - if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS))
> > - return -EFAULT;
> > -
> > - /* Oops, the userspace PTs are gone... Replay the fault */
> > - if (!kvm_pte_valid(pte))
> > - return -EAGAIN;
> > -
> > - return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
> > -}
> > -
> > static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> > .zalloc_page = stage2_memcache_zalloc_page,
> > .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
> > @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
> > *
> > * Returns the size of the mapping.
> > */
> > -static long
> > +static unsigned long
> > transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > unsigned long hva, kvm_pfn_t *pfnp,
> > phys_addr_t *ipap)
> > @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > * block map is contained within the memslot.
> > */
> > if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> > - int sz = get_user_mapping_size(kvm, hva);
> > -
> > - if (sz < 0)
> > - return sz;
> > + size_t sz = folio_size(pfn_folio(pfn));
>
> Hi,
>
> Sorry this is an extremely late reply - I just noticed this because Marc
> mentioned it in another thread.
>
> This doesn't look quite right to me; just because you have a folio of a given
> size, that doesn't mean the whole thing is mapped into this particular address
> space. For example, you could have a (PMD-sized) THP that gets partially
> munmapped - the folio is still PMD-sized but only some of it is mapped and
> should be accessible to the process. Or you could have a large file-backed folio
> (from a filesystem that supports large folios - e.g. XFS) but the application
> only mapped part of the file.
>
> Perhaps I've misunderstood and those edge cases can't happen here for some reason?
I went ahead and applied the following hack to the *current* tree,
with this patch:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 482280fe22d7..de365489a62f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1291,6 +1291,10 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
*/
if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
int sz = get_user_mapping_size(kvm, hva);
+ size_t fsz = folio_size(pfn_folio(pfn));
+
+ if (sz != fsz)
+ pr_err("sz = %d fsz = %ld\n", sz, fsz);
if (sz < 0)
return sz;
and sure enough, I see the check firing under *a lot* of memory
pressure:
[84567.458803] sz = 4096 fsz = 2097152
[84620.166018] sz = 4096 fsz = 2097152
So indeed, folio_size() doesn't provide what we need. We absolutely
need to match what is actually mapped in userspace or things may turn
out to be rather ugly should the other pages that are part of the same
folio be allocated somewhere else. Is that even possible?
M.
--
Without deviation from the norm, progress is not possible.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-10-28 9:17 ` Ryan Roberts
2023-10-30 10:40 ` Marc Zyngier
@ 2023-10-30 10:47 ` Vincent Donnefort
2023-10-30 11:01 ` Ryan Roberts
1 sibling, 1 reply; 17+ messages in thread
From: Vincent Donnefort @ 2023-10-30 10:47 UTC (permalink / raw)
To: Ryan Roberts
Cc: maz, oliver.upton, kvmarm, linux-arm-kernel, kernel-team, will,
willy
[...]
> > static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> > .zalloc_page = stage2_memcache_zalloc_page,
> > .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
> > @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
> > *
> > * Returns the size of the mapping.
> > */
> > -static long
> > +static unsigned long
> > transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > unsigned long hva, kvm_pfn_t *pfnp,
> > phys_addr_t *ipap)
> > @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > * block map is contained within the memslot.
> > */
> > if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> > - int sz = get_user_mapping_size(kvm, hva);
> > -
> > - if (sz < 0)
> > - return sz;
> > + size_t sz = folio_size(pfn_folio(pfn));
>
> Hi,
>
> Sorry this is an extremely late reply - I just noticed this because Marc
> mentioned it in another thread.
>
> This doesn't look quite right to me; just because you have a folio of a given
> size, that doesn't mean the whole thing is mapped into this particular address
> space. For example, you could have a (PMD-sized) THP that gets partially
> munmapped - the folio is still PMD-sized but only some of it is mapped and
> should be accessible to the process. Or you could have a large file-backed folio
> (from a filesystem that supports large folios - e.g. XFS) but the application
> only mapped part of the file.
I thought originally this would break the block and the folio with it, but a
quick expriment showed it's not the case.
>
> Perhaps I've misunderstood and those edge cases can't happen here for some reason?
And fault_supports_stage2_huge_mapping() would probably not be enough. we might
end-up with a portion unmapped at stage-1 but mapped at stage-2. :-(
>
> Thanks,
> Ryan
>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-10-30 10:40 ` Marc Zyngier
@ 2023-10-30 10:57 ` Ryan Roberts
0 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-30 10:57 UTC (permalink / raw)
To: Marc Zyngier
Cc: Vincent Donnefort, oliver.upton, kvmarm, linux-arm-kernel,
kernel-team, will, willy
On 30/10/2023 10:40, Marc Zyngier wrote:
> On Sat, 28 Oct 2023 10:17:17 +0100,
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/09/2023 18:32, Vincent Donnefort wrote:
>>> Since commit cb196ee1ef39 ("mm/huge_memory: convert
>>> do_huge_pmd_anonymous_page() to use vma_alloc_folio()"), transparent
>>> huge pages use folios. It enables us to check efficiently if a page is
>>> mapped by a block simply looking at the folio size. This is saving a
>>> page table walk.
>>>
>>> It is safe to read the folio in this path. We've just increased its
>>> refcount (GUP from __gfn_to_pfn_memslot()). This will prevent attempts
>>> of splitting the huge page.
>>>
>>> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index de5e5148ef5d..69fcbcc7aca5 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -791,51 +791,6 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>>> return 0;
>>> }
>>>
>>> -static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
>>> - /* We shouldn't need any other callback to walk the PT */
>>> - .phys_to_virt = kvm_host_va,
>>> -};
>>> -
>>> -static int get_user_mapping_size(struct kvm *kvm, u64 addr)
>>> -{
>>> - struct kvm_pgtable pgt = {
>>> - .pgd = (kvm_pteref_t)kvm->mm->pgd,
>>> - .ia_bits = vabits_actual,
>>> - .start_level = (KVM_PGTABLE_MAX_LEVELS -
>>> - CONFIG_PGTABLE_LEVELS),
>>> - .mm_ops = &kvm_user_mm_ops,
>>> - };
>>> - unsigned long flags;
>>> - kvm_pte_t pte = 0; /* Keep GCC quiet... */
>>> - u32 level = ~0;
>>> - int ret;
>>> -
>>> - /*
>>> - * Disable IRQs so that we hazard against a concurrent
>>> - * teardown of the userspace page tables (which relies on
>>> - * IPI-ing threads).
>>> - */
>>> - local_irq_save(flags);
>>> - ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level);
>>> - local_irq_restore(flags);
>>> -
>>> - if (ret)
>>> - return ret;
>>> -
>>> - /*
>>> - * Not seeing an error, but not updating level? Something went
>>> - * deeply wrong...
>>> - */
>>> - if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS))
>>> - return -EFAULT;
>>> -
>>> - /* Oops, the userspace PTs are gone... Replay the fault */
>>> - if (!kvm_pte_valid(pte))
>>> - return -EAGAIN;
>>> -
>>> - return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
>>> -}
>>> -
>>> static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>> .zalloc_page = stage2_memcache_zalloc_page,
>>> .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
>>> @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>>> *
>>> * Returns the size of the mapping.
>>> */
>>> -static long
>>> +static unsigned long
>>> transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>>> unsigned long hva, kvm_pfn_t *pfnp,
>>> phys_addr_t *ipap)
>>> @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>>> * block map is contained within the memslot.
>>> */
>>> if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
>>> - int sz = get_user_mapping_size(kvm, hva);
>>> -
>>> - if (sz < 0)
>>> - return sz;
>>> + size_t sz = folio_size(pfn_folio(pfn));
>>
>> Hi,
>>
>> Sorry this is an extremely late reply - I just noticed this because Marc
>> mentioned it in another thread.
>>
>> This doesn't look quite right to me; just because you have a folio of a given
>> size, that doesn't mean the whole thing is mapped into this particular address
>> space. For example, you could have a (PMD-sized) THP that gets partially
>> munmapped - the folio is still PMD-sized but only some of it is mapped and
>> should be accessible to the process. Or you could have a large file-backed folio
>> (from a filesystem that supports large folios - e.g. XFS) but the application
>> only mapped part of the file.
>>
>> Perhaps I've misunderstood and those edge cases can't happen here for some reason?
>
> I went ahead and applied the following hack to the *current* tree,
> with this patch:
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 482280fe22d7..de365489a62f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1291,6 +1291,10 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
> */
> if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> int sz = get_user_mapping_size(kvm, hva);
> + size_t fsz = folio_size(pfn_folio(pfn));
> +
> + if (sz != fsz)
> + pr_err("sz = %d fsz = %ld\n", sz, fsz);
>
> if (sz < 0)
> return sz;
>
> and sure enough, I see the check firing under *a lot* of memory
> pressure:
>
> [84567.458803] sz = 4096 fsz = 2097152
> [84620.166018] sz = 4096 fsz = 2097152
>
> So indeed, folio_size() doesn't provide what we need. We absolutely
> need to match what is actually mapped in userspace or things may turn
> out to be rather ugly should the other pages that are part of the same
> folio be allocated somewhere else. Is that even possible?
Yes, I think so.
One such possibility is:
1. user space maps a PMD-sized THP
2. user space does a partial munmap
- some of the PMD-sized folio is now PTE-mapped
- folio is on the deferred split list
3. user space creates a VM covering this memory
- new faultly logic incorrectly determines its a PMD mapping, so PMD maps
the whole folio
4. memory pressure causes folios on the deferred split queue to be split
- Unless you take an _entire_mapcount ref when you map. But even then, I
doubt the deferred split queue will check that because once unamapped it
should be impossible to remap that piece of the folio.
5. The "unampped" pages of the folio we just split gets given back to the
buddy
- I guess if KVM took a ref then the page might not be given back to the
buddy which would at least prevent this becoming a security issue
6. Something else allocates that page...
A less severe scenario might involve mremap, where part of the THP is mremapped
in user space so its not all contiguous, but with your logic, the VM will see it
as contiguous.
>
> M.
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] KVM: arm64: Use folio for THP adjustment
2023-10-30 10:47 ` Vincent Donnefort
@ 2023-10-30 11:01 ` Ryan Roberts
0 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-30 11:01 UTC (permalink / raw)
To: Vincent Donnefort
Cc: maz, oliver.upton, kvmarm, linux-arm-kernel, kernel-team, will,
willy
On 30/10/2023 10:47, Vincent Donnefort wrote:
> [...]
>
>>> static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>> .zalloc_page = stage2_memcache_zalloc_page,
>>> .zalloc_pages_exact = kvm_s2_zalloc_pages_exact,
>>> @@ -1274,7 +1229,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>>> *
>>> * Returns the size of the mapping.
>>> */
>>> -static long
>>> +static unsigned long
>>> transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>>> unsigned long hva, kvm_pfn_t *pfnp,
>>> phys_addr_t *ipap)
>>> @@ -1287,10 +1242,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>>> * block map is contained within the memslot.
>>> */
>>> if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
>>> - int sz = get_user_mapping_size(kvm, hva);
>>> -
>>> - if (sz < 0)
>>> - return sz;
>>> + size_t sz = folio_size(pfn_folio(pfn));
>>
>> Hi,
>>
>> Sorry this is an extremely late reply - I just noticed this because Marc
>> mentioned it in another thread.
>>
>> This doesn't look quite right to me; just because you have a folio of a given
>> size, that doesn't mean the whole thing is mapped into this particular address
>> space. For example, you could have a (PMD-sized) THP that gets partially
>> munmapped - the folio is still PMD-sized but only some of it is mapped and
>> should be accessible to the process. Or you could have a large file-backed folio
>> (from a filesystem that supports large folios - e.g. XFS) but the application
>> only mapped part of the file.
>
> I thought originally this would break the block and the folio with it, but a
> quick expriment showed it's not the case.
That's not how it works unfortunately. Even if we wanted to do this in the
common case (which we might in future for anon memory) there are edge cases
where splitting the folio can fail (GUP). And it wouldn't make sense to split
for file-backed case since those folios may be mapped into other processes.
>
>>
>> Perhaps I've misunderstood and those edge cases can't happen here for some reason?
>
> And fault_supports_stage2_huge_mapping() would probably not be enough. we might
> end-up with a portion unmapped at stage-1 but mapped at stage-2. :-(
Yes. Or even remapped if you consider mremapping a portion of the THP.
>
>>
>> Thanks,
>> Ryan
>>
>>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] KVM: arm64: Use folio for THP support
2023-09-30 17:57 ` Oliver Upton
@ 2023-10-30 20:16 ` Oliver Upton
0 siblings, 0 replies; 17+ messages in thread
From: Oliver Upton @ 2023-10-30 20:16 UTC (permalink / raw)
To: maz, Vincent Donnefort
Cc: kvmarm, kernel-team, willy, linux-arm-kernel, will, ryan.roberts
On Sat, Sep 30, 2023 at 05:57:31PM +0000, Oliver Upton wrote:
> On Thu, 28 Sep 2023 18:32:02 +0100, Vincent Donnefort wrote:
> > With the introduction of folios for transparent huge pages [1], we
> > can rely on this support to identify if a page is backed by a huge one,
> > saving a page table walk.
> >
> > v1 -> v2:
> > * Add git reference to [1] in the commit message
> > * GUP always acted on the Head page under the hood. Update the commit
> > message.
> >
> > [...]
>
> Applied to kvmarm/next, thanks!
>
> [1/2] KVM: arm64: Do not transfer page refcount for THP adjustment
> https://git.kernel.org/kvmarm/kvmarm/c/c04bf723ccd6
> [2/2] KVM: arm64: Use folio for THP adjustment
> https://git.kernel.org/kvmarm/kvmarm/c/bf92834e6f6e
Based on the discussion I've gone ahead and dropped the second patch.
I'm keeping the first since it is a nice cleanup.
Thanks Ryan for catching what would've otherwise been a rather nasty
bug.
--
Thanks,
Oliver
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2023-10-30 20:17 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-28 17:32 [PATCH v2 0/2] KVM: arm64: Use folio for THP support Vincent Donnefort
2023-09-28 17:32 ` [PATCH v2 1/2] KVM: arm64: Do not transfer page refcount for THP adjustment Vincent Donnefort
2023-09-29 6:59 ` Gavin Shan
2023-09-29 12:47 ` Vincent Donnefort
2023-09-28 17:32 ` [PATCH v2 2/2] KVM: arm64: Use folio " Vincent Donnefort
2023-09-29 7:24 ` Gavin Shan
2023-09-29 13:07 ` Matthew Wilcox
2023-09-29 23:52 ` Gavin Shan
2023-09-29 23:53 ` Gavin Shan
2023-10-28 9:17 ` Ryan Roberts
2023-10-30 10:40 ` Marc Zyngier
2023-10-30 10:57 ` Ryan Roberts
2023-10-30 10:47 ` Vincent Donnefort
2023-10-30 11:01 ` Ryan Roberts
2023-09-29 9:12 ` [PATCH v2 0/2] KVM: arm64: Use folio for THP support Marc Zyngier
2023-09-30 17:57 ` Oliver Upton
2023-10-30 20:16 ` Oliver Upton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).