public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	catalin.marinas@arm.com, will@kernel.org
Cc: anshuman.khandual@arm.com, quic_zhenhuah@quicinc.com,
	kevin.brodsky@arm.com, yangyicong@hisilicon.com,
	joey.gouly@arm.com, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, david@redhat.com
Subject: Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
Date: Fri, 30 May 2025 14:44:47 +0530	[thread overview]
Message-ID: <832e84a9-4303-4e21-a88b-94395898fa3e@arm.com> (raw)
In-Reply-To: <d2b63b97-232e-4d2e-816b-71fd5b0ffcfa@arm.com>


On 30/05/25 2:10 pm, Ryan Roberts wrote:
> On 30/05/2025 09:20, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.

My "correction" from race->no problem was incorrect after all :) There will
be no race too since the vm_struct object has exclusive access to whatever
table it is clearing.

>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 38fafffe699f..28b7173d8693 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>   	/*
>>   	 * SW table walks can't handle removal of intermediate entries.
>>   	 */
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
>>   }
>>   
>>   #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>   {
>> -	/* See arch_vmap_pud_supported() */
>> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return true;
>>   }
>>   
>>   #endif
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index ea6695d53fb9..798cebd9e147 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>   	}
>>   
>>   	table = pte_offset_kernel(pmdp, addr);
>> +
>> +	/* Synchronize against ptdump_walk_pgd() */
>> +	mmap_read_lock(&init_mm);
>>   	pmd_clear(pmdp);
>> +	mmap_read_unlock(&init_mm);
> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
> exclusive with any read_lock holders) for the duration of the table walk, so it
> will either consistently see the pgtables before or after this removal. It will
> never disappear during the walk, correct?
>
> I guess there is a risk of this showing up as contention with other init_mm
> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
> are called sufficiently rarely that the risk is very small. Let's fix any perf
> problem if/when we see it.

We can avoid all of that by my initial approach - to wrap the lock around CONFIG_PTDUMP_DEBUGFS.
I don't have a strong opinion, just putting it out there.

>
>>   	__flush_tlb_kernel_pgtable(addr);
> And the tlbi doesn't need to be serialized because there is no security issue.
> The walker can be trusted to only dereference memory that it sees as it walks
> the pgtable (obviously).
>
>>   	pte_free_kernel(NULL, table);
>>   	return 1;
>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   		pmd_free_pte_page(pmdp, next);
>>   	} while (pmdp++, next += PMD_SIZE, next != end);
>>   
>> +	/* Synchronize against ptdump_walk_pgd() */
>> +	mmap_read_lock(&init_mm);
>>   	pud_clear(pudp);
>> +	mmap_read_unlock(&init_mm);
> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
> clearing the pud first (under the lock), then the pmds can all be cleared
> without a lock, since the walker won't be able to see the pmds once the pud is
> cleared.

Yes, we can isolate the PMD table in case the caller of pmd_free_pte_page is
pud_free_pmd_page. In this case, vm_struct_1 has exclusive access to the entire
pmd page, hence no race will occur. But, in case of vmap_try_huge_pmd() being the
caller, we cannot drop the locks around pmd_free_pte_page. So we can have something
like

#ifdef CONFIG_PTDUMP_DEBUGFS
static inline void ptdump_synchronize_lock(bool flag)
{
	if (flag)
		mmap_read_lock(&init_mm);
}

and pass false when the caller is pud_free_pmd_page.

>
> Thanks,
> Ryan
>
>>   	__flush_tlb_kernel_pgtable(addr);
>>   	pmd_free(NULL, table);
>>   	return 1;


  parent reply	other threads:[~2025-05-30  9:19 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30  8:20 [PATCH] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-05-30  8:28 ` Dev Jain
2025-05-30  8:40 ` Ryan Roberts
2025-05-30  9:07   ` Anshuman Khandual
2025-05-30  9:14   ` Dev Jain [this message]
2025-05-30  9:47     ` Ryan Roberts
2025-05-30  9:57       ` Dev Jain
2025-05-30 11:50     ` Ryan Roberts
2025-05-30 12:35       ` Will Deacon
2025-05-30 13:11         ` Ryan Roberts
2025-05-30 13:36           ` Will Deacon
2025-05-30 14:07             ` Ryan Roberts
2025-06-05  4:48             ` Dev Jain
2025-06-05  8:16               ` Dev Jain
2025-06-11  9:33                 ` Will Deacon
2025-06-11 12:18                   ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=832e84a9-4303-4e21-a88b-94395898fa3e@arm.com \
    --to=dev.jain@arm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=joey.gouly@arm.com \
    --cc=kevin.brodsky@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=quic_zhenhuah@quicinc.com \
    --cc=ryan.roberts@arm.com \
    --cc=will@kernel.org \
    --cc=yangyicong@hisilicon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox