All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Nicholas Piggin" <npiggin@gmail.com>
To: "LEROY Christophe" <christophe.leroy2@cs-soprasteria.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Oscar Salvador" <osalvador@suse.de>,
	"Michael Ellerman" <mpe@ellerman.id.au>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v6 21/23] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
Date: Wed, 26 Jun 2024 11:23:11 +1000	[thread overview]
Message-ID: <D29K0SVIUJDR.2HQTLIJG4R7VG@gmail.com> (raw)
In-Reply-To: <a8f76535-2d5a-4f25-83be-31aab1cd38c4@cs-soprasteria.com>

On Tue Jun 25, 2024 at 3:20 PM AEST, LEROY Christophe wrote:
>
>
> Le 25/06/2024 à 06:49, Nicholas Piggin a écrit :
> > On Tue Jun 25, 2024 at 12:45 AM AEST, Christophe Leroy wrote:
> >> On book3s/64, the only user of hugepd is hash in 4k mode.
> >>
> >> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
> >>
> >> Rework hash-4k to use contiguous PMD and PUD instead.
> >>
> >> In that setup there are only two huge page sizes: 16M and 16G.
> >>
> >> 16M sits at PMD level and 16G at PUD level.
> >>
> >> pte_update doesn't know page size, lets use the same trick as
> >> hpte_need_flush() to get page size from segment properties. That's
> >> not the most efficient way but let's do that until callers of
> >> pte_update() provide page size instead of just a huge flag.
> >>
> >> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> > 
> > [snip]
> > 
> >> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> +					 unsigned long addr,
> >> +					 pte_t *ptep, unsigned long clr,
> >> +					 unsigned long set,
> >> +					 int huge)
> >> +{
> >> +	unsigned long old;
> >> +
> >> +	old = hash__pte_update_one(ptep, clr, set);
> >> +
> >> +	if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && huge) {
> >> +		unsigned int psize = get_slice_psize(mm, addr);
> >> +		int nb, i;
> >> +
> >> +		if (psize == MMU_PAGE_16M)
> >> +			nb = SZ_16M / PMD_SIZE;
> >> +		else if (psize == MMU_PAGE_16G)
> >> +			nb = SZ_16G / PUD_SIZE;
> >> +		else
> >> +			nb = 1;
> >> +
> >> +		WARN_ON_ONCE(nb == 1);	/* Should never happen */
> >> +
> >> +		for (i = 1; i < nb; i++)
> >> +			hash__pte_update_one(ptep + i, clr, set);
> >> +	}
> >>   	/* huge pages use the old page table lock */
> >>   	if (!huge)
> >>   		assert_pte_locked(mm, addr);
> >>   
> >> -	old = be64_to_cpu(old_be);
> >>   	if (old & H_PAGE_HASHPTE)
> >>   		hpte_need_flush(mm, addr, ptep, old, huge);
> >>   
> > 
> > We definitely need a bit more comment and changelog about the atomicity
> > issues here. I think the plan should be all hash-side access just
> > operates on PTE[0], which should avoid that whole race. There could be
> > some cases that don't follow that. Adding some warnings to catch such
> > things could be good too.
>
> That seems to be the case indeed, as we have the following in 
> hash_page_mm():
>
> #ifndef CONFIG_PPC_64K_PAGES
> 	/*
> 	 * If we use 4K pages and our psize is not 4K, then we might
> 	 * be hitting a special driver mapping, and need to align the
> 	 * address before we fetch the PTE.
> 	 *
> 	 * It could also be a hugepage mapping, in which case this is
> 	 * not necessary, but it's not harmful, either.
> 	 */
> 	if (psize != MMU_PAGE_4K)
> 		ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
> #endif /* CONFIG_PPC_64K_PAGES */

Yeah, for that one it works (comment needs updating to say that it
*is* necessary). I think that's the main thing but there's other
possible places where it might not hold -- KVM too, not just the
hash refill.

> > 
> > I'd been meaning to do more on this sooner, sorry. I've started
> > tinkering with adding a bit of debug code. I'll see if I can help with
> > adding a bit of comments.
>
> Yes would we very welcome, I guess you'll send it as followup/fixup 
> patch to the series ?

Yeah, the basic approach I think is good, so it wouldn't be a
big rework.

>
> > 
> > [snip]
> > 
> >> diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c
> >> index 5a2e512e96db..83c3361b358b 100644
> >> --- a/arch/powerpc/mm/book3s64/hugetlbpage.c
> >> +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c
> >> @@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> >>   		/* If PTE permissions don't match, take page fault */
> >>   		if (unlikely(!check_pte_access(access, old_pte)))
> >>   			return 1;
> >> +		/*
> >> +		 * If hash-4k, hugepages use seeral contiguous PxD entries
> >> +		 * so bail out and let mm make the page young or dirty
> >> +		 */
> >> +		if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
> >> +			if (!(old_pte & _PAGE_ACCESSED))
> >> +				return 1;
> >> +			if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY))
> >> +				return 1;
> >> +		}
> >>   
> >>   		/*
> >>   		 * Try to lock the PTE, add ACCESSED and DIRTY if it was
> > 
> > I'm hoping we wouldn't have to do this, if we follow the PTE[0] rule.
>
> But we still need all entries to be updated so that page walker which 
> don't know they must use PTE[0] get the right information ?

Ah yeah. Maybe for ACCESSED|DIRTY we can slightly adjust that rule
and apply it to all PTEs. If we can do that then it takes care of
a few other cases too.

Bug what is the consequence of two pte_update racing? Let's say
page_vma_mkclean_one vs setting dirty. Can you end up with some
PTEs dirty and some not?

Thanks,
Nick

WARNING: multiple messages have this Message-ID (diff)
From: "Nicholas Piggin" <npiggin@gmail.com>
To: "LEROY Christophe" <christophe.leroy2@cs-soprasteria.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Oscar Salvador" <osalvador@suse.de>,
	"Michael Ellerman" <mpe@ellerman.id.au>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH v6 21/23] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
Date: Wed, 26 Jun 2024 11:23:11 +1000	[thread overview]
Message-ID: <D29K0SVIUJDR.2HQTLIJG4R7VG@gmail.com> (raw)
In-Reply-To: <a8f76535-2d5a-4f25-83be-31aab1cd38c4@cs-soprasteria.com>

On Tue Jun 25, 2024 at 3:20 PM AEST, LEROY Christophe wrote:
>
>
> Le 25/06/2024 à 06:49, Nicholas Piggin a écrit :
> > On Tue Jun 25, 2024 at 12:45 AM AEST, Christophe Leroy wrote:
> >> On book3s/64, the only user of hugepd is hash in 4k mode.
> >>
> >> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
> >>
> >> Rework hash-4k to use contiguous PMD and PUD instead.
> >>
> >> In that setup there are only two huge page sizes: 16M and 16G.
> >>
> >> 16M sits at PMD level and 16G at PUD level.
> >>
> >> pte_update doesn't know page size, lets use the same trick as
> >> hpte_need_flush() to get page size from segment properties. That's
> >> not the most efficient way but let's do that until callers of
> >> pte_update() provide page size instead of just a huge flag.
> >>
> >> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> > 
> > [snip]
> > 
> >> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> +					 unsigned long addr,
> >> +					 pte_t *ptep, unsigned long clr,
> >> +					 unsigned long set,
> >> +					 int huge)
> >> +{
> >> +	unsigned long old;
> >> +
> >> +	old = hash__pte_update_one(ptep, clr, set);
> >> +
> >> +	if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && huge) {
> >> +		unsigned int psize = get_slice_psize(mm, addr);
> >> +		int nb, i;
> >> +
> >> +		if (psize == MMU_PAGE_16M)
> >> +			nb = SZ_16M / PMD_SIZE;
> >> +		else if (psize == MMU_PAGE_16G)
> >> +			nb = SZ_16G / PUD_SIZE;
> >> +		else
> >> +			nb = 1;
> >> +
> >> +		WARN_ON_ONCE(nb == 1);	/* Should never happen */
> >> +
> >> +		for (i = 1; i < nb; i++)
> >> +			hash__pte_update_one(ptep + i, clr, set);
> >> +	}
> >>   	/* huge pages use the old page table lock */
> >>   	if (!huge)
> >>   		assert_pte_locked(mm, addr);
> >>   
> >> -	old = be64_to_cpu(old_be);
> >>   	if (old & H_PAGE_HASHPTE)
> >>   		hpte_need_flush(mm, addr, ptep, old, huge);
> >>   
> > 
> > We definitely need a bit more comment and changelog about the atomicity
> > issues here. I think the plan should be all hash-side access just
> > operates on PTE[0], which should avoid that whole race. There could be
> > some cases that don't follow that. Adding some warnings to catch such
> > things could be good too.
>
> That seems to be the case indeed, as we have the following in 
> hash_page_mm():
>
> #ifndef CONFIG_PPC_64K_PAGES
> 	/*
> 	 * If we use 4K pages and our psize is not 4K, then we might
> 	 * be hitting a special driver mapping, and need to align the
> 	 * address before we fetch the PTE.
> 	 *
> 	 * It could also be a hugepage mapping, in which case this is
> 	 * not necessary, but it's not harmful, either.
> 	 */
> 	if (psize != MMU_PAGE_4K)
> 		ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
> #endif /* CONFIG_PPC_64K_PAGES */

Yeah, for that one it works (comment needs updating to say that it
*is* necessary). I think that's the main thing but there's other
possible places where it might not hold -- KVM too, not just the
hash refill.

> > 
> > I'd been meaning to do more on this sooner, sorry. I've started
> > tinkering with adding a bit of debug code. I'll see if I can help with
> > adding a bit of comments.
>
> Yes would we very welcome, I guess you'll send it as followup/fixup 
> patch to the series ?

Yeah, the basic approach I think is good, so it wouldn't be a
big rework.

>
> > 
> > [snip]
> > 
> >> diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c
> >> index 5a2e512e96db..83c3361b358b 100644
> >> --- a/arch/powerpc/mm/book3s64/hugetlbpage.c
> >> +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c
> >> @@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> >>   		/* If PTE permissions don't match, take page fault */
> >>   		if (unlikely(!check_pte_access(access, old_pte)))
> >>   			return 1;
> >> +		/*
> >> +		 * If hash-4k, hugepages use seeral contiguous PxD entries
> >> +		 * so bail out and let mm make the page young or dirty
> >> +		 */
> >> +		if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
> >> +			if (!(old_pte & _PAGE_ACCESSED))
> >> +				return 1;
> >> +			if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY))
> >> +				return 1;
> >> +		}
> >>   
> >>   		/*
> >>   		 * Try to lock the PTE, add ACCESSED and DIRTY if it was
> > 
> > I'm hoping we wouldn't have to do this, if we follow the PTE[0] rule.
>
> But we still need all entries to be updated so that page walker which 
> don't know they must use PTE[0] get the right information ?

Ah yeah. Maybe for ACCESSED|DIRTY we can slightly adjust that rule
and apply it to all PTEs. If we can do that then it takes care of
a few other cases too.

Bug what is the consequence of two pte_update racing? Let's say
page_vma_mkclean_one vs setting dirty. Can you end up with some
PTEs dirty and some not?

Thanks,
Nick


  reply	other threads:[~2024-06-26  1:24 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-24 14:45 [PATCH v6 00/23] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
2024-06-24 14:45 ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 01/23] powerpc/64e: Remove unused IBM HTW code Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 02/23] powerpc/64e: Split out nohash Book3E 64-bit code Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 03/23] powerpc/64e: Drop E500 ifdefs in " Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 04/23] powerpc/64e: Drop MMU_FTR_TYPE_FSL_E checks " Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 05/23] powerpc/64e: Consolidate TLB miss handler patching Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 06/23] powerpc/64e: Drop unused TLB miss handlers Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 07/23] mm: Define __pte_leaf_size() to also take a PMD entry Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 08/23] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 09/23] powerpc/mm: Remove _PAGE_PSIZE Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 10/23] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 11/23] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 12/23] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 13/23] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 14/23] powerpc/8xx: Simplify struct mmu_psize_def Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 15/23] powerpc/e500: Remove enc and ind fields from " Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 16/23] powerpc/e500: Switch to 64 bits PGD on 85xx (32 bits) Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 17/23] powerpc/e500: Encode hugepage size in PTE bits Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 18/23] powerpc/e500: Don't pre-check write access on data TLB error Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 19/23] powerpc/e500: Free r10 for FIND_PTE Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 20/23] powerpc/e500: Use contiguous PMD instead of hugepd Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 21/23] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-25  4:49   ` Nicholas Piggin
2024-06-25  4:49     ` Nicholas Piggin
2024-06-25  5:20     ` LEROY Christophe
2024-06-25  5:20       ` LEROY Christophe
2024-06-26  1:23       ` Nicholas Piggin [this message]
2024-06-26  1:23         ` Nicholas Piggin
2024-07-02 13:26   ` Michael Ellerman
2024-07-02 13:26     ` Michael Ellerman
2024-06-24 14:45 ` [PATCH v6 22/23] powerpc/mm: Remove hugepd leftovers Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy
2024-06-24 14:45 ` [PATCH v6 23/23] mm: Remove CONFIG_ARCH_HAS_HUGEPD Christophe Leroy
2024-06-24 14:45   ` Christophe Leroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D29K0SVIUJDR.2HQTLIJG4R7VG@gmail.com \
    --to=npiggin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=christophe.leroy2@cs-soprasteria.com \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.