[PATCH v2 0/2] Fix subsection vmemmap

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Fix subsection vmemmap_populate logic
@ 2024-12-09  9:42 Zhenhua Huang
  2024-12-09  9:42 ` [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned Zhenhua Huang
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-09  9:42 UTC (permalink / raw)
  To: catalin.marinas, will, ardb, ryan.roberts, mark.rutland,
	joey.gouly, dave.hansen, akpm, chenfeiyang, chenhuacai
  Cc: linux-mm, linux-arm-kernel, linux-kernel, Zhenhua Huang

To perform memory hotplug operations, the memmap (aka struct page) will be
updated. For arm64 with 4K page size, the typical granularity is 128M,
which corresponds to a 2M memmap buffer.
Commit c1cc1552616d ("arm64: MMU initialisation")
optimizes this 2M buffer to be mapped with one single PMD entry. However,
commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
which supports 2M subsection hotplug granularity, causes other issues
(refer to the change log of patch #1). The logic is adjusted to populate
with huge pages only if the hotplug address/size is section-aligned.

Changes since v1:
  - Modified change log to make it more clear which was based on Catalin's
    comments.

Zhenhua Huang (2):
  arm64: mm: vmemmap populate to page level if not section aligned
  arm64: mm: implement vmemmap_check_pmd for arm64

 arch/arm64/mm/mmu.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-09  9:42 [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang
@ 2024-12-09  9:42 ` Zhenhua Huang
  2024-12-20 18:30   ` Catalin Marinas
  2024-12-09  9:42 ` [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64 Zhenhua Huang
  2024-12-17  1:47 ` [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang
  2 siblings, 1 reply; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-09  9:42 UTC (permalink / raw)
  To: catalin.marinas, will, ardb, ryan.roberts, mark.rutland,
	joey.gouly, dave.hansen, akpm, chenfeiyang, chenhuacai
  Cc: linux-mm, linux-arm-kernel, linux-kernel, Zhenhua Huang

Commit c1cc1552616d ("arm64: MMU initialisation")
optimizes the vmemmap to populate at the PMD section level. However, if
start or end is not aligned to a section boundary, such as when a
subsection is hot added, populating the entire section is wasteful. For
instance, if only one subsection hot-added, the entire section's struct
page metadata will still be populated.In such cases, it is more effective
to populate at page granularity.

This change also addresses mismatch issues during vmemmap_free(): When
pmd_sect() is true, the entire PMD section is cleared, even if there is
other effective subsection. For example, pagemap1 and pagemap2 are part
of a single PMD entry and they are hot-added sequentially. Then pagemap1
is removed, vmemmap_free() will clear the entire PMD entry, freeing the
struct page metadata for the whole section, even though pagemap2 is still
active.

Fixes: c1cc1552616d ("arm64: MMU initialisation")
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
---
 arch/arm64/mm/mmu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e2739b69e11b..fd59ee44960e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 {
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));

-	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
+	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
+	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
 		return vmemmap_populate_basepages(start, end, node, altmap);
 	else
 		return vmemmap_populate_hugepages(start, end, node, altmap);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-09  9:42 [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang
  2024-12-09  9:42 ` [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned Zhenhua Huang
@ 2024-12-09  9:42 ` Zhenhua Huang
  2024-12-20 18:35   ` Catalin Marinas
  2024-12-17  1:47 ` [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang
  2 siblings, 1 reply; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-09  9:42 UTC (permalink / raw)
  To: catalin.marinas, will, ardb, ryan.roberts, mark.rutland,
	joey.gouly, dave.hansen, akpm, chenfeiyang, chenhuacai
  Cc: linux-mm, linux-arm-kernel, linux-kernel, Zhenhua Huang

vmemmap_check_pmd() is used to determine if needs to populate to base
pages. Implement it for arm64 arch.

Fixes: 2045a3b8911b ("mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()")
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
---
 arch/arm64/mm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index fd59ee44960e..41c7978a92be 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1169,7 +1169,8 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 				unsigned long addr, unsigned long next)
 {
 	vmemmap_verify((pte_t *)pmdp, node, addr, next);
-	return 1;
+
+	return pmd_sect(*pmdp);
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/2] Fix subsection vmemmap_populate logic
  2024-12-09  9:42 [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang
  2024-12-09  9:42 ` [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned Zhenhua Huang
  2024-12-09  9:42 ` [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64 Zhenhua Huang
@ 2024-12-17  1:47 ` Zhenhua Huang
  2 siblings, 0 replies; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-17  1:47 UTC (permalink / raw)
  To: catalin.marinas, will, ardb, ryan.roberts, mark.rutland,
	joey.gouly, dave.hansen, akpm, chenfeiyang, chenhuacai
  Cc: linux-mm, linux-arm-kernel, linux-kernel, Tingwei Zhang



On 2024/12/9 17:42, Zhenhua Huang wrote:
> To perform memory hotplug operations, the memmap (aka struct page) will be
> updated. For arm64 with 4K page size, the typical granularity is 128M,
> which corresponds to a 2M memmap buffer.
> Commit c1cc1552616d ("arm64: MMU initialisation")
> optimizes this 2M buffer to be mapped with one single PMD entry. However,
> commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> which supports 2M subsection hotplug granularity, causes other issues
> (refer to the change log of patch #1). The logic is adjusted to populate
> with huge pages only if the hotplug address/size is section-aligned.
> 

Could any expert help review please?

> Changes since v1:
>    - Modified change log to make it more clear which was based on Catalin's
>      comments.
> 
> Zhenhua Huang (2):
>    arm64: mm: vmemmap populate to page level if not section aligned
>    arm64: mm: implement vmemmap_check_pmd for arm64
> 
>   arch/arm64/mm/mmu.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-09  9:42 ` [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned Zhenhua Huang
@ 2024-12-20 18:30   ` Catalin Marinas
  2024-12-24  9:32     ` Zhenhua Huang
  0 siblings, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2024-12-20 18:30 UTC (permalink / raw)
  To: Zhenhua Huang
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Anshuman Khandual

On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
> Commit c1cc1552616d ("arm64: MMU initialisation")
> optimizes the vmemmap to populate at the PMD section level. However, if
> start or end is not aligned to a section boundary, such as when a
> subsection is hot added, populating the entire section is wasteful. For
> instance, if only one subsection hot-added, the entire section's struct
> page metadata will still be populated.In such cases, it is more effective
> to populate at page granularity.

OK, so from the vmemmap perspective, we waste up to 2MB memory that has
been allocated even if a 2MB hot-plugged subsection required only 32KB
of struct page. I don't mind this much really. I hope all those
subsections are not scattered around to amplify this waste.

> This change also addresses mismatch issues during vmemmap_free(): When
> pmd_sect() is true, the entire PMD section is cleared, even if there is
> other effective subsection. For example, pagemap1 and pagemap2 are part
> of a single PMD entry and they are hot-added sequentially. Then pagemap1
> is removed, vmemmap_free() will clear the entire PMD entry, freeing the
> struct page metadata for the whole section, even though pagemap2 is still
> active.

I think that's the bigger issue. We can't unplug a subsection only.
Looking at unmap_hotplug_pmd_range(), it frees a 2MB vmemmap section but
that may hold struct page for the equivalent of 128MB of memory. So any
struct page accesses for the other subsections will fault.

> Fixes: c1cc1552616d ("arm64: MMU initialisation")

I wouldn't add a fix for the first commit adding arm64 support, we did
not even have memory hotplug at the time (added later in 5.7 by commit
bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
sub-section hotplug"). That commit broke some arm64 assumptions.

> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
> ---
>  arch/arm64/mm/mmu.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e2739b69e11b..fd59ee44960e 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  {
>  	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>  
> -	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
> +	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
> +	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
> +	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>  		return vmemmap_populate_basepages(start, end, node, altmap);
>  	else
>  		return vmemmap_populate_hugepages(start, end, node, altmap);

An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
nuking the whole vmemmap pmd section if it's not empty. Not sure how
easy that is, whether we have the necessary information (I haven't
looked in detail).

A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
that's possible, the problem isn't solved by this patch.

-- 
Catalin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-09  9:42 ` [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64 Zhenhua Huang
@ 2024-12-20 18:35   ` Catalin Marinas
  2024-12-27  2:57     ` Anshuman Khandual
  0 siblings, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2024-12-20 18:35 UTC (permalink / raw)
  To: Zhenhua Huang
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel

On Mon, Dec 09, 2024 at 05:42:27PM +0800, Zhenhua Huang wrote:
> vmemmap_check_pmd() is used to determine if needs to populate to base
> pages. Implement it for arm64 arch.
> 
> Fixes: 2045a3b8911b ("mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()")
> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
> ---
>  arch/arm64/mm/mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index fd59ee44960e..41c7978a92be 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1169,7 +1169,8 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  				unsigned long addr, unsigned long next)
>  {
>  	vmemmap_verify((pte_t *)pmdp, node, addr, next);
> -	return 1;
> +
> +	return pmd_sect(*pmdp);
>  }
>  
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,

Don't we need this patch only if we implement the first one? Please fold
it into the other patch.

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-20 18:30   ` Catalin Marinas
@ 2024-12-24  9:32     ` Zhenhua Huang
  2024-12-24 14:09       ` Catalin Marinas
  0 siblings, 1 reply; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-24  9:32 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Anshuman Khandual

Thanks Catalin for review!
Merry Christmas.

On 2024/12/21 2:30, Catalin Marinas wrote:
> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>> Commit c1cc1552616d ("arm64: MMU initialisation")
>> optimizes the vmemmap to populate at the PMD section level. However, if
>> start or end is not aligned to a section boundary, such as when a
>> subsection is hot added, populating the entire section is wasteful. For
>> instance, if only one subsection hot-added, the entire section's struct
>> page metadata will still be populated.In such cases, it is more effective
>> to populate at page granularity.
> 
> OK, so from the vmemmap perspective, we waste up to 2MB memory that has
> been allocated even if a 2MB hot-plugged subsection required only 32KB
> of struct page. I don't mind this much really. I hope all those
> subsections are not scattered around to amplify this waste.
> 
>> This change also addresses mismatch issues during vmemmap_free(): When
>> pmd_sect() is true, the entire PMD section is cleared, even if there is
>> other effective subsection. For example, pagemap1 and pagemap2 are part
>> of a single PMD entry and they are hot-added sequentially. Then pagemap1
>> is removed, vmemmap_free() will clear the entire PMD entry, freeing the
>> struct page metadata for the whole section, even though pagemap2 is still
>> active.
> 
> I think that's the bigger issue. We can't unplug a subsection only.
> Looking at unmap_hotplug_pmd_range(), it frees a 2MB vmemmap section but
> that may hold struct page for the equivalent of 128MB of memory. So any
> struct page accesses for the other subsections will fault.

Exactly! That's what the patch aims to address.

> 
>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
> 
> I wouldn't add a fix for the first commit adding arm64 support, we did
> not even have memory hotplug at the time (added later in 5.7 by commit
> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
> sub-section hotplug"). That commit broke some arm64 assumptions.

Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") 
because it broke arm64 assumptions ?

> 
>> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
>> ---
>>   arch/arm64/mm/mmu.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index e2739b69e11b..fd59ee44960e 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>   {
>>   	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>   
>> -	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>> +	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>> +	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>> +	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>   		return vmemmap_populate_basepages(start, end, node, altmap);
>>   	else
>>   		return vmemmap_populate_hugepages(start, end, node, altmap);
> 
> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
> nuking the whole vmemmap pmd section if it's not empty. Not sure how
> easy that is, whether we have the necessary information (I haven't
> looked in detail).
> 
> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
> that's possible, the problem isn't solved by this patch.

Indeed, seems there is no guarantee that plug size must be equal to 
unplug size...

I have two ideas:
1. Completely disable this PMD mapping optimization since there is no 
guarantee we must align 128M memory for hotplug ..

2. If we want to take this optimization.
I propose adding one argument to vmemmap_free to indicate if the entire 
section is freed(based on subsection map). Vmemmap_free is a common 
function and might affect other architectures... The process would be:
vmemmap_free
	unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if 
whole section is freed, proceed as usual. Otherwise, *just clear out 
struct page content but do not free*.
	free_empty_tables // will be called only if entire section is freed

On the populate side,
else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
	continue;	//Buffer still exists, just abort..

Could you please comment further whether #2 is feasible ?

> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-24  9:32     ` Zhenhua Huang
@ 2024-12-24 14:09       ` Catalin Marinas
  2024-12-25  9:59         ` Zhenhua Huang
  2024-12-27  7:49         ` Anshuman Khandual
  0 siblings, 2 replies; 25+ messages in thread
From: Catalin Marinas @ 2024-12-24 14:09 UTC (permalink / raw)
  To: Zhenhua Huang
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Anshuman Khandual

On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
> Thanks Catalin for review!
> Merry Christmas.

Merry Christmas to you too!

> On 2024/12/21 2:30, Catalin Marinas wrote:
> > On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
> > > Fixes: c1cc1552616d ("arm64: MMU initialisation")
> > 
> > I wouldn't add a fix for the first commit adding arm64 support, we did
> > not even have memory hotplug at the time (added later in 5.7 by commit
> > bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
> > been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
> > sub-section hotplug"). That commit broke some arm64 assumptions.
> 
> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> because it broke arm64 assumptions ?

Yes, I think that would be better. And a cc stable to 5.4 (the above
commit appeared in 5.3).

> > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > index e2739b69e11b..fd59ee44960e 100644
> > > --- a/arch/arm64/mm/mmu.c
> > > +++ b/arch/arm64/mm/mmu.c
> > > @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > >   {
> > >   	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> > > -	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
> > > +	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
> > > +	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
> > > +	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
> > >   		return vmemmap_populate_basepages(start, end, node, altmap);
> > >   	else
> > >   		return vmemmap_populate_hugepages(start, end, node, altmap);
> > 
> > An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
> > nuking the whole vmemmap pmd section if it's not empty. Not sure how
> > easy that is, whether we have the necessary information (I haven't
> > looked in detail).
> > 
> > A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
> > that's possible, the problem isn't solved by this patch.
> 
> Indeed, seems there is no guarantee that plug size must be equal to unplug
> size...
> 
> I have two ideas:
> 1. Completely disable this PMD mapping optimization since there is no
> guarantee we must align 128M memory for hotplug ..

I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
I think the only advantage here is that we don't allocate a full 2MB
block for vmemmap when only plugging in a sub-section.

> 2. If we want to take this optimization.
> I propose adding one argument to vmemmap_free to indicate if the entire
> section is freed(based on subsection map). Vmemmap_free is a common function
> and might affect other architectures... The process would be:
> vmemmap_free
> 	unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
> whole section is freed, proceed as usual. Otherwise, *just clear out struct
> page content but do not free*.
> 	free_empty_tables // will be called only if entire section is freed
> 
> On the populate side,
> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
> 	continue;	//Buffer still exists, just abort..
> 
> Could you please comment further whether #2 is feasible ?

vmemmap_free() already gets start/end, so it could at least check the
alignment and avoid freeing if it's not unplugging a full section. It
does leave a 2MB vmemmap block in place when freeing the last subsection
but it's safer than freeing valid struct page entries. In addition, it
could query the memory hotplug state with something like
find_memory_block() and figure out whether the section is empty.

Anyway, I'll be off until the new year, maybe I get other ideas by then.

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-24 14:09       ` Catalin Marinas
@ 2024-12-25  9:59         ` Zhenhua Huang
  2024-12-27  7:49         ` Anshuman Khandual
  1 sibling, 0 replies; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-25  9:59 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Anshuman Khandual, Tingwei Zhang



On 2024/12/24 22:09, Catalin Marinas wrote:
> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>> Thanks Catalin for review!
>> Merry Christmas.
> 
> Merry Christmas to you too!
> 
>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>
>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>
>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> because it broke arm64 assumptions ?
> 
> Yes, I think that would be better. And a cc stable to 5.4 (the above
> commit appeared in 5.3).

Got it, Thanks.

> 
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index e2739b69e11b..fd59ee44960e 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>    {
>>>>    	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>> -	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>> +	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>> +	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>> +	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>    		return vmemmap_populate_basepages(start, end, node, altmap);
>>>>    	else
>>>>    		return vmemmap_populate_hugepages(start, end, node, altmap);
>>>
>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>> easy that is, whether we have the necessary information (I haven't
>>> looked in detail).
>>>
>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>> that's possible, the problem isn't solved by this patch.
>>
>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>> size...
>>
>> I have two ideas:
>> 1. Completely disable this PMD mapping optimization since there is no
>> guarantee we must align 128M memory for hotplug ..
> 
> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
> I think the only advantage here is that we don't allocate a full 2MB
> block for vmemmap when only plugging in a sub-section.

Yeah..
In my opinion, w/o subsection hotplugging support, it is definitely 
beneficial. However, w/ subsection hotplugging support, it may lead to 
memory overhead and necessitate special logic in codes since we always 
use a full 2MB block..

> 
>> 2. If we want to take this optimization.
>> I propose adding one argument to vmemmap_free to indicate if the entire
>> section is freed(based on subsection map). Vmemmap_free is a common function
>> and might affect other architectures... The process would be:
>> vmemmap_free
>> 	unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>> page content but do not free*.
>> 	free_empty_tables // will be called only if entire section is freed
>>
>> On the populate side,
>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>> 	continue;	//Buffer still exists, just abort..
>>
>> Could you please comment further whether #2 is feasible ?
> 
> vmemmap_free() already gets start/end, so it could at least check the
> alignment and avoid freeing if it's not unplugging a full section. It
> does leave a 2MB vmemmap block in place when freeing the last subsection
> but it's safer than freeing valid struct page entries. In addition, it
> could query the memory hotplug state with something like
> find_memory_block() and figure out whether the section is empty.
> 

Do you mean that we need not clear struct page entries of subsection 
until the entire section fully unplugged ? That seems feasible.

BTW, You're right, I went through codes again, only export 
is_subsection_map_empty() for query is another option..
	page_to_pfn() to get pfn
	__nr_to_section() to get mem_section
	last call is_subsection_map_empty() we can get subsection hotplug 
status per section
w/ this approach, we need not to do changes for func vmemmap_free

> Anyway, I'll be off until the new year, maybe I get other ideas by then.
> 

Sure, Happy Holiday! I will prepare both of patches and wait for your 
further comments :



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-20 18:35   ` Catalin Marinas
@ 2024-12-27  2:57     ` Anshuman Khandual
  2024-12-30  7:48       ` Zhenhua Huang
  2025-01-02 18:12       ` Catalin Marinas
  0 siblings, 2 replies; 25+ messages in thread
From: Anshuman Khandual @ 2024-12-27  2:57 UTC (permalink / raw)
  To: Catalin Marinas, Zhenhua Huang
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel

On 12/21/24 00:05, Catalin Marinas wrote:
> On Mon, Dec 09, 2024 at 05:42:27PM +0800, Zhenhua Huang wrote:
>> vmemmap_check_pmd() is used to determine if needs to populate to base
>> pages. Implement it for arm64 arch.
>>
>> Fixes: 2045a3b8911b ("mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()")
>> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
>> ---
>>  arch/arm64/mm/mmu.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index fd59ee44960e..41c7978a92be 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1169,7 +1169,8 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>  				unsigned long addr, unsigned long next)
>>  {
>>  	vmemmap_verify((pte_t *)pmdp, node, addr, next);
>> -	return 1;
>> +
>> +	return pmd_sect(*pmdp);

Please change this as pmd_sect(READ_ONCE(*pmdp)) instead.

>>  }
>>  
>>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> 
> Don't we need this patch only if we implement the first one? Please fold
> it into the other patch.

Seems like these patches might not be related.

While creating huge page based vmemmap mapping during vmemmap_populate_hugepages(),
vmemmap_check_pmd() validates if a populated (i.e pmd_none) PMD already represents
a huge mapping and can be skipped there after.

Current implementation for vmemmap_check_pmd() on arm64, unconditionally returns 1
thus asserting that the given populated PMD entry is a huge one indeed, which will
be the case unless something is wrong. vmemmap_verify() only ensures that the node
where the pfn is allocated from is local.

int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
                                unsigned long addr, unsigned long next)
{
        vmemmap_verify((pte_t *)pmdp, node, addr, next);
        return 1;
}

However it does not really check the entry to be a section mapping which it should.
Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
this does not need a "Fixes: " tag.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-24 14:09       ` Catalin Marinas
  2024-12-25  9:59         ` Zhenhua Huang
@ 2024-12-27  7:49         ` Anshuman Khandual
  2024-12-30  7:48           ` Zhenhua Huang
  2025-01-02 18:58           ` Catalin Marinas
  1 sibling, 2 replies; 25+ messages in thread
From: Anshuman Khandual @ 2024-12-27  7:49 UTC (permalink / raw)
  To: Catalin Marinas, Zhenhua Huang
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel

On 12/24/24 19:39, Catalin Marinas wrote:
> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>> Thanks Catalin for review!
>> Merry Christmas.
> 
> Merry Christmas to you too!
> 
>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>
>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>
>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> because it broke arm64 assumptions ?
> 
> Yes, I think that would be better. And a cc stable to 5.4 (the above
> commit appeared in 5.3).

Agreed. This is a problem which needs fixing but not sure if proposed patch
here fixes that problem.

> 
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index e2739b69e11b..fd59ee44960e 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>   {
>>>>   	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>> -	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>> +	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>> +	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>> +	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>   		return vmemmap_populate_basepages(start, end, node, altmap);
>>>>   	else
>>>>   		return vmemmap_populate_hugepages(start, end, node, altmap);
>>>
>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>> easy that is, whether we have the necessary information (I haven't
>>> looked in detail).
>>>
>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>> that's possible, the problem isn't solved by this patch.

Believe this is possible after sub-section hotplug and hotremove support.

>>
>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>> size...
>>
>> I have two ideas:
>> 1. Completely disable this PMD mapping optimization since there is no
>> guarantee we must align 128M memory for hotplug ..
> 
> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
> I think the only advantage here is that we don't allocate a full 2MB
> block for vmemmap when only plugging in a sub-section.

Agreed, that will be the right fix for the problem which can be back ported.
We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
vmemmap for all non-boot memory sections, that can be hot-unplugged.

Something like the following ? [untested]

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 216519663961..56b9c6891f46 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
                struct vmem_altmap *altmap)
 {
+       unsigned long start_pfn;
+       struct mem_section *ms;
+
        WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
 
-       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+       start_pfn = page_to_pfn((struct page *)start);
+       ms = __pfn_to_section(start_pfn);
+
+       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
                return vmemmap_populate_basepages(start, end, node, altmap);
        else
                return vmemmap_populate_hugepages(start, end, node, altmap);
@@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
 int arch_add_memory(int nid, u64 start, u64 size,
                    struct mhp_params *params)
 {
+       unsigned long start_pfn = page_to_pfn((struct page *)start);
+       struct mem_section *ms = __pfn_to_section(start_pfn);
        int ret, flags = NO_EXEC_MAPPINGS;
 
        VM_BUG_ON(!mhp_range_allowed(start, size, true));
 
+       if (!early_section(ms))
+               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
        if (can_set_direct_map())
                flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 

> 
>> 2. If we want to take this optimization.
>> I propose adding one argument to vmemmap_free to indicate if the entire
>> section is freed(based on subsection map). Vmemmap_free is a common function
>> and might affect other architectures... The process would be:
>> vmemmap_free
>> 	unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>> page content but do not free*.
>> 	free_empty_tables // will be called only if entire section is freed
>>
>> On the populate side,
>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>> 	continue;	//Buffer still exists, just abort..
>>
>> Could you please comment further whether #2 is feasible ?
> 
> vmemmap_free() already gets start/end, so it could at least check the
> alignment and avoid freeing if it's not unplugging a full section. It

unmap_hotplug_pmd_range()
{
	do {
		if (pmd_sect(pmd)) {
			pmd_clear(pmdp);
			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
                        if (free_mapped)
                                free_hotplug_page_range(pmd_page(pmd),
                                                        PMD_SIZE, altmap);
		}
	} while ()
}

Do you mean clearing the PMD entry but not freeing the mapped page for vmemmap ?
In that case should the hot-unplug fail or not ? If we free the pfns (successful
hot-unplug), then leaving behind entire PMD entry for covering the remaining sub
sections, is going to be problematic as it still maps the removed pfns as well !

> does leave a 2MB vmemmap block in place when freeing the last subsection
> but it's safer than freeing valid struct page entries. In addition, it
> could query the memory hotplug state with something like
> find_memory_block() and figure out whether the section is empty.

I guess there are two potential solutions, if unmap_hotplug_pmd_range() were to
handle sub-section removal.

1) Skip pmd_clear() when entire section is not covered

a. pmd_clear() only if all but the current subsection have been removed earlier
   via is_subsection_map_empty() or something similar.

b. Skip pmd_clear() if the entire section covering that PMD is not being removed
   but that might be problematic, as it still maps potentially unavailable pfns,
   which are now hot-unplugged out.

2) Break PMD into base pages

a. pmd_clear() only if all but the current subsection have been removed earlier
   via is_subsection_map_empty() or something similar.

b. Break entire PMD into base page mappings and remove entries corresponding to
   the subsection being removed. Although the BBM sequence needs to be followed
   while making sure that no other part of the kernel is accessing subsections,
   that are mapped via the erstwhile PMD but currently not being removed.

> 
> Anyway, I'll be off until the new year, maybe I get other ideas by then.
> 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-27  7:49         ` Anshuman Khandual
@ 2024-12-30  7:48           ` Zhenhua Huang
  2024-12-31  5:52             ` Zhenhua Huang
  2025-01-02  3:51             ` Anshuman Khandual
  2025-01-02 18:58           ` Catalin Marinas
  1 sibling, 2 replies; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-30  7:48 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang

Hi Anshuman,

On 2024/12/27 15:49, Anshuman Khandual wrote:
> On 12/24/24 19:39, Catalin Marinas wrote:
>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>> Thanks Catalin for review!
>>> Merry Christmas.
>>
>> Merry Christmas to you too!
>>
>>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>>
>>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>>
>>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>> because it broke arm64 assumptions ?
>>
>> Yes, I think that would be better. And a cc stable to 5.4 (the above
>> commit appeared in 5.3).
> 
> Agreed. This is a problem which needs fixing but not sure if proposed patch
> here fixes that problem.
> 
>>
>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>> index e2739b69e11b..fd59ee44960e 100644
>>>>> --- a/arch/arm64/mm/mmu.c
>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>>    {
>>>>>    	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>> -	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>>> +	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>>> +	!IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>>> +	!IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>>    		return vmemmap_populate_basepages(start, end, node, altmap);
>>>>>    	else
>>>>>    		return vmemmap_populate_hugepages(start, end, node, altmap);
>>>>
>>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>>> easy that is, whether we have the necessary information (I haven't
>>>> looked in detail).
>>>>
>>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>>> that's possible, the problem isn't solved by this patch.
> 
> Believe this is possible after sub-section hotplug and hotremove support.
> 
>>>
>>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>>> size...
>>>
>>> I have two ideas:
>>> 1. Completely disable this PMD mapping optimization since there is no
>>> guarantee we must align 128M memory for hotplug ..
>>
>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>> I think the only advantage here is that we don't allocate a full 2MB
>> block for vmemmap when only plugging in a sub-section.
> 
> Agreed, that will be the right fix for the problem which can be back ported.
> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as

Thanks Anshuman, yeah.. we must handle linear mapping as well.

> vmemmap for all non-boot memory sections, that can be hot-unplugged.
> 
> Something like the following ? [untested]
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 216519663961..56b9c6891f46 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>                  struct vmem_altmap *altmap)
>   {
> +       unsigned long start_pfn;
> +       struct mem_section *ms;
> +
>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>   
> -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
> +       start_pfn = page_to_pfn((struct page *)start);
> +       ms = __pfn_to_section(start_pfn);
> +
> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))

LGTM. I will follow your and Catalin's suggestion to prepare further 
patches, Thanks!

>                  return vmemmap_populate_basepages(start, end, node, altmap);
>          else
>                  return vmemmap_populate_hugepages(start, end, node, altmap);
> @@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
>   int arch_add_memory(int nid, u64 start, u64 size,
>                      struct mhp_params *params)
>   {
> +       unsigned long start_pfn = page_to_pfn((struct page *)start);
> +       struct mem_section *ms = __pfn_to_section(start_pfn);
>          int ret, flags = NO_EXEC_MAPPINGS;
>   
>          VM_BUG_ON(!mhp_range_allowed(start, size, true));
>   
> +       if (!early_section(ms))
> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

However, here comes another doubt, given that the subsection size is 2M, 
shouldn't we have ability to support PMD SECTION MAPPING if 
CONFIG_ARM64_4K_PAGES? This might be the optimization we want to maintain?

Should we remove NO_BLOCK_MAPPINGS and add more constraints to avoid 
pud_set_huge if CONFIG_ARM64_4K_PAGES ?

> +
>          if (can_set_direct_map())
>                  flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>   
> 
>>
>>> 2. If we want to take this optimization.
>>> I propose adding one argument to vmemmap_free to indicate if the entire
>>> section is freed(based on subsection map). Vmemmap_free is a common function
>>> and might affect other architectures... The process would be:
>>> vmemmap_free
>>> 	unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>>> page content but do not free*.
>>> 	free_empty_tables // will be called only if entire section is freed
>>>
>>> On the populate side,
>>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>>> 	continue;	//Buffer still exists, just abort..
>>>
>>> Could you please comment further whether #2 is feasible ?
>>
>> vmemmap_free() already gets start/end, so it could at least check the
>> alignment and avoid freeing if it's not unplugging a full section. It
> 
> unmap_hotplug_pmd_range()
> {
> 	do {
> 		if (pmd_sect(pmd)) {
> 			pmd_clear(pmdp);
> 			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>                          if (free_mapped)
>                                  free_hotplug_page_range(pmd_page(pmd),
>                                                          PMD_SIZE, altmap);
> 		}
> 	} while ()
> }
> 
> Do you mean clearing the PMD entry but not freeing the mapped page for vmemmap ?
> In that case should the hot-unplug fail or not ? If we free the pfns (successful
> hot-unplug), then leaving behind entire PMD entry for covering the remaining sub
> sections, is going to be problematic as it still maps the removed pfns as well !

Could you please help me to understand in which scenarios this might 
cause issue? I assume we won't touch these struct page further?

> 
>> does leave a 2MB vmemmap block in place when freeing the last subsection
>> but it's safer than freeing valid struct page entries. In addition, it
>> could query the memory hotplug state with something like
>> find_memory_block() and figure out whether the section is empty.
> 
> I guess there are two potential solutions, if unmap_hotplug_pmd_range() were to
> handle sub-section removal.
> 
> 1) Skip pmd_clear() when entire section is not covered
> 
> a. pmd_clear() only if all but the current subsection have been removed earlier
>     via is_subsection_map_empty() or something similar.
> 
> b. Skip pmd_clear() if the entire section covering that PMD is not being removed
>     but that might be problematic, as it still maps potentially unavailable pfns,
>     which are now hot-unplugged out.
> 
> 2) Break PMD into base pages
> 
> a. pmd_clear() only if all but the current subsection have been removed earlier
>     via is_subsection_map_empty() or something similar.
> 
> b. Break entire PMD into base page mappings and remove entries corresponding to
>     the subsection being removed. Although the BBM sequence needs to be followed
>     while making sure that no other part of the kernel is accessing subsections,
>     that are mapped via the erstwhile PMD but currently not being removed.
> 
>>
>> Anyway, I'll be off until the new year, maybe I get other ideas by then.
>>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-27  2:57     ` Anshuman Khandual
@ 2024-12-30  7:48       ` Zhenhua Huang
  2024-12-31  6:59         ` Anshuman Khandual
  2025-01-02 18:12       ` Catalin Marinas
  1 sibling, 1 reply; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-30  7:48 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel



On 2024/12/27 10:57, Anshuman Khandual wrote:
> However it does not really check the entry to be a section mapping which it should.
> Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
> case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
> original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
> this does not need a "Fixes: " tag.

Hi Anshuman,

I agree, will remove "Fixes: " tag in next patchset


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-30  7:48           ` Zhenhua Huang
@ 2024-12-31  5:52             ` Zhenhua Huang
  2025-01-02  3:16               ` Anshuman Khandual
  2025-01-02  3:51             ` Anshuman Khandual
  1 sibling, 1 reply; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-31  5:52 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang



On 2024/12/30 15:48, Zhenhua Huang wrote:
> Hi Anshuman,
> 
> On 2024/12/27 15:49, Anshuman Khandual wrote:
>> On 12/24/24 19:39, Catalin Marinas wrote:
>>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>>> Thanks Catalin for review!
>>>> Merry Christmas.
>>>
>>> Merry Christmas to you too!
>>>
>>>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>>>
>>>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this 
>>>>> hasn't
>>>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>>>
>>>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>> because it broke arm64 assumptions ?
>>>
>>> Yes, I think that would be better. And a cc stable to 5.4 (the above
>>> commit appeared in 5.3).
>>
>> Agreed. This is a problem which needs fixing but not sure if proposed 
>> patch
>> here fixes that problem.
>>
>>>
>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>> index e2739b69e11b..fd59ee44960e 100644
>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long 
>>>>>> start, unsigned long end, int node,
>>>>>>    {
>>>>>>        WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>>> -    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>>>> +    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)start), 
>>>>>> PAGES_PER_SECTION) ||
>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>>>            return vmemmap_populate_basepages(start, end, node, 
>>>>>> altmap);
>>>>>>        else
>>>>>>            return vmemmap_populate_hugepages(start, end, node, 
>>>>>> altmap);
>>>>>
>>>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>>>> easy that is, whether we have the necessary information (I haven't
>>>>> looked in detail).
>>>>>
>>>>> A potential issue - can we hotplug 128MB of RAM and only unplug 
>>>>> 2MB? If
>>>>> that's possible, the problem isn't solved by this patch.
>>
>> Believe this is possible after sub-section hotplug and hotremove support.
>>
>>>>
>>>> Indeed, seems there is no guarantee that plug size must be equal to 
>>>> unplug
>>>> size...
>>>>
>>>> I have two ideas:
>>>> 1. Completely disable this PMD mapping optimization since there is no
>>>> guarantee we must align 128M memory for hotplug ..
>>>
>>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>>> I think the only advantage here is that we don't allocate a full 2MB
>>> block for vmemmap when only plugging in a sub-section.
>>
>> Agreed, that will be the right fix for the problem which can be back 
>> ported.
>> We will have to prevent PMD/PUD/CONT mappings for both linear and as 
>> well as
> 
> Thanks Anshuman, yeah.. we must handle linear mapping as well.
> 
>> vmemmap for all non-boot memory sections, that can be hot-unplugged.
>>
>> Something like the following ? [untested]
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 216519663961..56b9c6891f46 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, 
>> int node,
>>   int __meminit vmemmap_populate(unsigned long start, unsigned long 
>> end, int node,
>>                  struct vmem_altmap *altmap)
>>   {
>> +       unsigned long start_pfn;
>> +       struct mem_section *ms;
>> +
>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>> -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>> +       start_pfn = page_to_pfn((struct page *)start);
>> +       ms = __pfn_to_section(start_pfn);
>> +
>> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
> 
> LGTM. I will follow your and Catalin's suggestion to prepare further 
> patches, Thanks!
> 
>>                  return vmemmap_populate_basepages(start, end, node, 
>> altmap);
>>          else
>>                  return vmemmap_populate_hugepages(start, end, node, 
>> altmap);
>> @@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
>>   int arch_add_memory(int nid, u64 start, u64 size,
>>                      struct mhp_params *params)
>>   {
>> +       unsigned long start_pfn = page_to_pfn((struct page *)start);
>> +       struct mem_section *ms = __pfn_to_section(start_pfn);
>>          int ret, flags = NO_EXEC_MAPPINGS;
>>          VM_BUG_ON(!mhp_range_allowed(start, size, true));
>> +       if (!early_section(ms))
>> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> 
> However, here comes another doubt, given that the subsection size is 2M, 
> shouldn't we have ability to support PMD SECTION MAPPING if 
> CONFIG_ARM64_4K_PAGES? This might be the optimization we want to maintain?
> 
> Should we remove NO_BLOCK_MAPPINGS and add more constraints to avoid 
> pud_set_huge if CONFIG_ARM64_4K_PAGES ?
> 

BTW, shall we remove the check for !early_section since arch_add_memory 
is only called during hotplugging case? Correct me please if I'm mistaken :)
The idea is like(not fully tested):

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e2739b69e11b..9afeb35673a3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -45,6 +45,7 @@
  #define NO_BLOCK_MAPPINGS      BIT(0)
  #define NO_CONT_MAPPINGS       BIT(1)
  #define NO_EXEC_MAPPINGS       BIT(2)  /* assumes FEAT_HPDS is not used */
+#define NO_PUD_BLOCK_MAPPINGS  BIT(3)  /* Hotplug case: do not want 
block mapping for PUD */

  u64 kimage_voffset __ro_after_init;
  EXPORT_SYMBOL(kimage_voffset);
@@ -356,10 +357,12 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned 
long addr, unsigned long end,

                 /*
                  * For 4K granule only, attempt to put down a 1GB block
+                * Hotplug case: do not attempt 1GB block
                  */
                 if (pud_sect_supported() &&
                    ((addr | next | phys) & ~PUD_MASK) == 0 &&
-                   (flags & NO_BLOCK_MAPPINGS) == 0) {
+                   (flags & NO_BLOCK_MAPPINGS) == 0 &&
+                   (flags & NO_PUD_BLOCK_MAPPINGS) == 0) {
                         pud_set_huge(pudp, phys, prot);

                         /*
@@ -1175,9 +1178,16 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int 
node,
  int __meminit vmemmap_populate(unsigned long start, unsigned long end, 
int node,
                 struct vmem_altmap *altmap)
  {
+       unsigned long start_pfn;
+       struct mem_section *ms;
+
         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));

-       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+       start_pfn = page_to_pfn((struct page *)start);
+       ms = __pfn_to_section(start_pfn);
+
+       /* hotplugged section not support hugepages */
+       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
                 return vmemmap_populate_basepages(start, end, node, 
altmap);
         else
                 return vmemmap_populate_hugepages(start, end, node, 
altmap);
@@ -1342,6 +1352,16 @@ int arch_add_memory(int nid, u64 start, u64 size,

         VM_BUG_ON(!mhp_range_allowed(start, size, true));

+       if (IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+       /*
+        * As per subsection granule is 2M, allow PMD block mapping in
+        * case 4K PAGES.
+        * Other cases forbid section mapping.
+        */
+               flags |= NO_PUD_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+       else
+               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
         if (can_set_direct_map())
                 flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;



>> +
>>          if (can_set_direct_map())
>>                  flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>>>
>>>> 2. If we want to take this optimization.
>>>> I propose adding one argument to vmemmap_free to indicate if the entire
>>>> section is freed(based on subsection map). Vmemmap_free is a common 
>>>> function
>>>> and might affect other architectures... The process would be:
>>>> vmemmap_free
>>>>     unmap_hotplug_range //In unmap_hotplug_pmd_range() as you 
>>>> mentioned：if
>>>> whole section is freed, proceed as usual. Otherwise, *just clear out 
>>>> struct
>>>> page content but do not free*.
>>>>     free_empty_tables // will be called only if entire section is freed
>>>>
>>>> On the populate side,
>>>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this 
>>>> function
>>>>     continue;    //Buffer still exists, just abort..
>>>>
>>>> Could you please comment further whether #2 is feasible ?
>>>
>>> vmemmap_free() already gets start/end, so it could at least check the
>>> alignment and avoid freeing if it's not unplugging a full section. It
>>
>> unmap_hotplug_pmd_range()
>> {
>>     do {
>>         if (pmd_sect(pmd)) {
>>             pmd_clear(pmdp);
>>             flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>                          if (free_mapped)
>>                                  free_hotplug_page_range(pmd_page(pmd),
>>                                                          PMD_SIZE, 
>> altmap);
>>         }
>>     } while ()
>> }
>>
>> Do you mean clearing the PMD entry but not freeing the mapped page for 
>> vmemmap ?
>> In that case should the hot-unplug fail or not ? If we free the pfns 
>> (successful
>> hot-unplug), then leaving behind entire PMD entry for covering the 
>> remaining sub
>> sections, is going to be problematic as it still maps the removed pfns 
>> as well !
> 
> Could you please help me to understand in which scenarios this might 
> cause issue? I assume we won't touch these struct page further?
> 
>>
>>> does leave a 2MB vmemmap block in place when freeing the last subsection
>>> but it's safer than freeing valid struct page entries. In addition, it
>>> could query the memory hotplug state with something like
>>> find_memory_block() and figure out whether the section is empty.
>>
>> I guess there are two potential solutions, if 
>> unmap_hotplug_pmd_range() were to
>> handle sub-section removal.
>>
>> 1) Skip pmd_clear() when entire section is not covered
>>
>> a. pmd_clear() only if all but the current subsection have been 
>> removed earlier
>>     via is_subsection_map_empty() or something similar.
>>
>> b. Skip pmd_clear() if the entire section covering that PMD is not 
>> being removed
>>     but that might be problematic, as it still maps potentially 
>> unavailable pfns,
>>     which are now hot-unplugged out.
>>
>> 2) Break PMD into base pages
>>
>> a. pmd_clear() only if all but the current subsection have been 
>> removed earlier
>>     via is_subsection_map_empty() or something similar.
>>
>> b. Break entire PMD into base page mappings and remove entries 
>> corresponding to
>>     the subsection being removed. Although the BBM sequence needs to 
>> be followed
>>     while making sure that no other part of the kernel is accessing 
>> subsections,
>>     that are mapped via the erstwhile PMD but currently not being 
>> removed.
>>
>>>
>>> Anyway, I'll be off until the new year, maybe I get other ideas by then.
>>>
> 
> 



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-30  7:48       ` Zhenhua Huang
@ 2024-12-31  6:59         ` Anshuman Khandual
  2024-12-31  7:18           ` Zhenhua Huang
  0 siblings, 1 reply; 25+ messages in thread
From: Anshuman Khandual @ 2024-12-31  6:59 UTC (permalink / raw)
  To: Zhenhua Huang, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel

A very small nit regarding the subject line. The callback
vmemmap_check_pmd() is already present on arm64 platform
which is rather incomplete. Something like this might be
better.

arm64/mm: Test for pmd_sect() in vmemmap_check_pmd()

On 12/30/24 13:18, Zhenhua Huang wrote:
> 
> 
> On 2024/12/27 10:57, Anshuman Khandual wrote:
>> However it does not really check the entry to be a section mapping which it should.
>> Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
>> case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
>> original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
>> this does not need a "Fixes: " tag.
> 
> Hi Anshuman,
> 
> I agree, will remove "Fixes: " tag in next patchset

Could you please send a V3 of this patch separately instead
and not part of this series as they are not really related.

But after implementing the following changes

1) Use READ_ONCE() as indicated earlier
2) Drop the "Fixes: " tag
3) Update the commit message explaining why pmd_sect() is
   required here and how the originally commit missed that
4) Update the subject line


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-31  6:59         ` Anshuman Khandual
@ 2024-12-31  7:18           ` Zhenhua Huang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhenhua Huang @ 2024-12-31  7:18 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel



On 2024/12/31 14:59, Anshuman Khandual wrote:
> A very small nit regarding the subject line. The callback
> vmemmap_check_pmd() is already present on arm64 platform
> which is rather incomplete. Something like this might be
> better.
> 
> arm64/mm: Test for pmd_sect() in vmemmap_check_pmd()
> 
> On 12/30/24 13:18, Zhenhua Huang wrote:
>>
>>
>> On 2024/12/27 10:57, Anshuman Khandual wrote:
>>> However it does not really check the entry to be a section mapping which it should.
>>> Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
>>> case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
>>> original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
>>> this does not need a "Fixes: " tag.
>>
>> Hi Anshuman,
>>
>> I agree, will remove "Fixes: " tag in next patchset
> 
> Could you please send a V3 of this patch separately instead
> and not part of this series as they are not really related.

Sure, Thanks Anshuman. Will gather information and update after coming 
back from Holiday!

> 
> But after implementing the following changes
> 
> 1) Use READ_ONCE() as indicated earlier
> 2) Drop the "Fixes: " tag
> 3) Update the commit message explaining why pmd_sect() is
>     required here and how the originally commit missed that
> 4) Update the subject line



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-31  5:52             ` Zhenhua Huang
@ 2025-01-02  3:16               ` Anshuman Khandual
  2025-01-02  9:07                 ` Zhenhua Huang
  0 siblings, 1 reply; 25+ messages in thread
From: Anshuman Khandual @ 2025-01-02  3:16 UTC (permalink / raw)
  To: Zhenhua Huang, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang



On 12/31/24 11:22, Zhenhua Huang wrote:
> 
> 
> On 2024/12/30 15:48, Zhenhua Huang wrote:
>> Hi Anshuman,
>>
>> On 2024/12/27 15:49, Anshuman Khandual wrote:
>>> On 12/24/24 19:39, Catalin Marinas wrote:
>>>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>>>> Thanks Catalin for review!
>>>>> Merry Christmas.
>>>>
>>>> Merry Christmas to you too!
>>>>
>>>>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>>>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>>>>
>>>>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>>>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>>>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>>>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>>>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>>>>
>>>>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>>> because it broke arm64 assumptions ?
>>>>
>>>> Yes, I think that would be better. And a cc stable to 5.4 (the above
>>>> commit appeared in 5.3).
>>>
>>> Agreed. This is a problem which needs fixing but not sure if proposed patch
>>> here fixes that problem.
>>>
>>>>
>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>> index e2739b69e11b..fd59ee44960e 100644
>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>>>>    {
>>>>>>>        WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>>>> -    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>>>>> +    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>>>>            return vmemmap_populate_basepages(start, end, node, altmap);
>>>>>>>        else
>>>>>>>            return vmemmap_populate_hugepages(start, end, node, altmap);
>>>>>>
>>>>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>>>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>>>>> easy that is, whether we have the necessary information (I haven't
>>>>>> looked in detail).
>>>>>>
>>>>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>>>>> that's possible, the problem isn't solved by this patch.
>>>
>>> Believe this is possible after sub-section hotplug and hotremove support.
>>>
>>>>>
>>>>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>>>>> size...
>>>>>
>>>>> I have two ideas:
>>>>> 1. Completely disable this PMD mapping optimization since there is no
>>>>> guarantee we must align 128M memory for hotplug ..
>>>>
>>>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>>>> I think the only advantage here is that we don't allocate a full 2MB
>>>> block for vmemmap when only plugging in a sub-section.
>>>
>>> Agreed, that will be the right fix for the problem which can be back ported.
>>> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
>>
>> Thanks Anshuman, yeah.. we must handle linear mapping as well.
>>
>>> vmemmap for all non-boot memory sections, that can be hot-unplugged.
>>>
>>> Something like the following ? [untested]
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 216519663961..56b9c6891f46 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>                  struct vmem_altmap *altmap)
>>>   {
>>> +       unsigned long start_pfn;
>>> +       struct mem_section *ms;
>>> +
>>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>> -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>> +       start_pfn = page_to_pfn((struct page *)start);
>>> +       ms = __pfn_to_section(start_pfn);
>>> +
>>> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
>>
>> LGTM. I will follow your and Catalin's suggestion to prepare further patches, Thanks!
>>
>>>                  return vmemmap_populate_basepages(start, end, node, altmap);
>>>          else
>>>                  return vmemmap_populate_hugepages(start, end, node, altmap);
>>> @@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
>>>   int arch_add_memory(int nid, u64 start, u64 size,
>>>                      struct mhp_params *params)
>>>   {
>>> +       unsigned long start_pfn = page_to_pfn((struct page *)start);
>>> +       struct mem_section *ms = __pfn_to_section(start_pfn);
>>>          int ret, flags = NO_EXEC_MAPPINGS;
>>>          VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>> +       if (!early_section(ms))
>>> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> However, here comes another doubt, given that the subsection size is 2M, shouldn't we have ability to support PMD SECTION MAPPING if CONFIG_ARM64_4K_PAGES? This might be the optimization we want to maintain?
>>
>> Should we remove NO_BLOCK_MAPPINGS and add more constraints to avoid pud_set_huge if CONFIG_ARM64_4K_PAGES ?
>>
> 
> BTW, shall we remove the check for !early_section since arch_add_memory is only called during hotplugging case? Correct me please if I'm mistaken :)

While this is true, still might be a good idea to keep the early_section()
check in place just to be extra careful here. Otherwise an WARN_ON() might
be needed.

> The idea is like(not fully tested):
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e2739b69e11b..9afeb35673a3 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -45,6 +45,7 @@
>  #define NO_BLOCK_MAPPINGS      BIT(0)
>  #define NO_CONT_MAPPINGS       BIT(1)
>  #define NO_EXEC_MAPPINGS       BIT(2)  /* assumes FEAT_HPDS is not used */
> +#define NO_PUD_BLOCK_MAPPINGS  BIT(3)  /* Hotplug case: do not want block mapping for PUD */

Since block mappings get created either at PMD or PUD, the existing flag
NO_BLOCK_MAPPINGS should be split into two i.e NO_PMD_BLOCK_MAPPINGS and
NO_PUD_BLOCK_MAPPINGS (possibly expanding into P4D later). Although all
block mappings can still be prevented using the existing flag which can
be derived from the new ones.

#define NO_BLOCK_MAPPINGS (NO_PMD_BLOCK_MAPPINGS | NO_PUD_BLOCK_MAPPINGS)

> 
>  u64 kimage_voffset __ro_after_init;
>  EXPORT_SYMBOL(kimage_voffset);
> @@ -356,10 +357,12 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
> 
>                 /*
>                  * For 4K granule only, attempt to put down a 1GB block
> +                * Hotplug case: do not attempt 1GB block
>                  */
>                 if (pud_sect_supported() &&
>                    ((addr | next | phys) & ~PUD_MASK) == 0 &&
> -                   (flags & NO_BLOCK_MAPPINGS) == 0) {
> +                   (flags & NO_BLOCK_MAPPINGS) == 0 &&
> +                   (flags & NO_PUD_BLOCK_MAPPINGS) == 0) {

After flags being split as suggested above, only the PUD block mapping
flag need to be checked here, and similarly the PMU block mapping flag
needs to be checked in alloc_init_pmd().

>                         pud_set_huge(pudp, phys, prot);
> 
>                         /*
> @@ -1175,9 +1178,16 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>                 struct vmem_altmap *altmap)
>  {
> +       unsigned long start_pfn;
> +       struct mem_section *ms;
> +
>         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> 
> -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
> +       start_pfn = page_to_pfn((struct page *)start);
> +       ms = __pfn_to_section(start_pfn);
> +
> +       /* hotplugged section not support hugepages */
> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
>                 return vmemmap_populate_basepages(start, end, node, altmap);
>         else
>                 return vmemmap_populate_hugepages(start, end, node, altmap);
> @@ -1342,6 +1352,16 @@ int arch_add_memory(int nid, u64 start, u64 size,
> 
>         VM_BUG_ON(!mhp_range_allowed(start, size, true));
> 
> +       if (IS_ENABLED(CONFIG_ARM64_4K_PAGES))
> +       /*
> +        * As per subsection granule is 2M, allow PMD block mapping in
> +        * case 4K PAGES.
> +        * Other cases forbid section mapping.
> +        */
> +               flags |= NO_PUD_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +       else
> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
>         if (can_set_direct_map())
>                 flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> 

Basically vmmemap will not allow PMD or PUD block mapping for non boot
memory as a 2MB sized sub-section hot removal involves tearing down a
sub PMD i.e (512 * sizeof(struct page)) VA range, which is currently
not supported in unmap_hotplug_range().

Although linear mapping could still support PMD block mapping as hot
removing a 2MB sized sub-section will tear down an entire PMD entry.

Fair enough but seems like this should be done after the fix patch
which prevents all block mappings for early section memory as that
will be an easy back port. But will leave this to upto Catalin/Will
to decide.

> 
> 
>>> +
>>>          if (can_set_direct_map())
>>>                  flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>>
>>>>
>>>>> 2. If we want to take this optimization.
>>>>> I propose adding one argument to vmemmap_free to indicate if the entire
>>>>> section is freed(based on subsection map). Vmemmap_free is a common function
>>>>> and might affect other architectures... The process would be:
>>>>> vmemmap_free
>>>>>     unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>>>>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>>>>> page content but do not free*.
>>>>>     free_empty_tables // will be called only if entire section is freed
>>>>>
>>>>> On the populate side,
>>>>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>>>>>     continue;    //Buffer still exists, just abort..
>>>>>
>>>>> Could you please comment further whether #2 is feasible ?
>>>>
>>>> vmemmap_free() already gets start/end, so it could at least check the
>>>> alignment and avoid freeing if it's not unplugging a full section. It
>>>
>>> unmap_hotplug_pmd_range()
>>> {
>>>     do {
>>>         if (pmd_sect(pmd)) {
>>>             pmd_clear(pmdp);
>>>             flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>                          if (free_mapped)
>>>                                  free_hotplug_page_range(pmd_page(pmd),
>>>                                                          PMD_SIZE, altmap);
>>>         }
>>>     } while ()
>>> }
>>>
>>> Do you mean clearing the PMD entry but not freeing the mapped page for vmemmap ?
>>> In that case should the hot-unplug fail or not ? If we free the pfns (successful
>>> hot-unplug), then leaving behind entire PMD entry for covering the remaining sub
>>> sections, is going to be problematic as it still maps the removed pfns as well !
>>
>> Could you please help me to understand in which scenarios this might cause issue? I assume we won't touch these struct page further?
>>
>>>
>>>> does leave a 2MB vmemmap block in place when freeing the last subsection
>>>> but it's safer than freeing valid struct page entries. In addition, it
>>>> could query the memory hotplug state with something like
>>>> find_memory_block() and figure out whether the section is empty.
>>>
>>> I guess there are two potential solutions, if unmap_hotplug_pmd_range() were to
>>> handle sub-section removal.
>>>
>>> 1) Skip pmd_clear() when entire section is not covered
>>>
>>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>>     via is_subsection_map_empty() or something similar.
>>>
>>> b. Skip pmd_clear() if the entire section covering that PMD is not being removed
>>>     but that might be problematic, as it still maps potentially unavailable pfns,
>>>     which are now hot-unplugged out.
>>>
>>> 2) Break PMD into base pages
>>>
>>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>>     via is_subsection_map_empty() or something similar.
>>>
>>> b. Break entire PMD into base page mappings and remove entries corresponding to
>>>     the subsection being removed. Although the BBM sequence needs to be followed
>>>     while making sure that no other part of the kernel is accessing subsections,
>>>     that are mapped via the erstwhile PMD but currently not being removed.
>>>
>>>>
>>>> Anyway, I'll be off until the new year, maybe I get other ideas by then.
>>>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-30  7:48           ` Zhenhua Huang
  2024-12-31  5:52             ` Zhenhua Huang
@ 2025-01-02  3:51             ` Anshuman Khandual
  2025-01-02  9:13               ` Zhenhua Huang
  1 sibling, 1 reply; 25+ messages in thread
From: Anshuman Khandual @ 2025-01-02  3:51 UTC (permalink / raw)
  To: Zhenhua Huang, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang

On 12/30/24 13:18, Zhenhua Huang wrote:
> Hi Anshuman,
> 
> On 2024/12/27 15:49, Anshuman Khandual wrote:
>> On 12/24/24 19:39, Catalin Marinas wrote:
>>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>>> Thanks Catalin for review!
>>>> Merry Christmas.
>>>
>>> Merry Christmas to you too!
>>>
>>>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>>>
>>>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>>>
>>>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>> because it broke arm64 assumptions ?
>>>
>>> Yes, I think that would be better. And a cc stable to 5.4 (the above
>>> commit appeared in 5.3).
>>
>> Agreed. This is a problem which needs fixing but not sure if proposed patch
>> here fixes that problem.
>>
>>>
>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>> index e2739b69e11b..fd59ee44960e 100644
>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>>>    {
>>>>>>        WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>>> -    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>>>> +    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>>>            return vmemmap_populate_basepages(start, end, node, altmap);
>>>>>>        else
>>>>>>            return vmemmap_populate_hugepages(start, end, node, altmap);
>>>>>
>>>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>>>> easy that is, whether we have the necessary information (I haven't
>>>>> looked in detail).
>>>>>
>>>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>>>> that's possible, the problem isn't solved by this patch.
>>
>> Believe this is possible after sub-section hotplug and hotremove support.
>>
>>>>
>>>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>>>> size...
>>>>
>>>> I have two ideas:
>>>> 1. Completely disable this PMD mapping optimization since there is no
>>>> guarantee we must align 128M memory for hotplug ..
>>>
>>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>>> I think the only advantage here is that we don't allocate a full 2MB
>>> block for vmemmap when only plugging in a sub-section.
>>
>> Agreed, that will be the right fix for the problem which can be back ported.
>> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
> 
> Thanks Anshuman, yeah.. we must handle linear mapping as well.
> 
>> vmemmap for all non-boot memory sections, that can be hot-unplugged.
>>
>> Something like the following ? [untested]
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 216519663961..56b9c6891f46 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>                  struct vmem_altmap *altmap)
>>   {
>> +       unsigned long start_pfn;
>> +       struct mem_section *ms;
>> +
>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>   -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>> +       start_pfn = page_to_pfn((struct page *)start);
>> +       ms = __pfn_to_section(start_pfn);
>> +
>> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
> 
> LGTM. I will follow your and Catalin's suggestion to prepare further patches, Thanks!
> 
>>                  return vmemmap_populate_basepages(start, end, node, altmap);
>>          else
>>                  return vmemmap_populate_hugepages(start, end, node, altmap);
>> @@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
>>   int arch_add_memory(int nid, u64 start, u64 size,
>>                      struct mhp_params *params)
>>   {
>> +       unsigned long start_pfn = page_to_pfn((struct page *)start);
>> +       struct mem_section *ms = __pfn_to_section(start_pfn);
>>          int ret, flags = NO_EXEC_MAPPINGS;
>>            VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>   +       if (!early_section(ms))
>> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> 
> However, here comes another doubt, given that the subsection size is 2M, shouldn't we have ability to support PMD SECTION MAPPING if CONFIG_ARM64_4K_PAGES? This might be the optimization we want to maintain?> 
> Should we remove NO_BLOCK_MAPPINGS and add more constraints to avoid pud_set_huge if CONFIG_ARM64_4K_PAGES ?

I guess this has been covered on the thread.

> 
>> +
>>          if (can_set_direct_map())
>>                  flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>  
>>>
>>>> 2. If we want to take this optimization.
>>>> I propose adding one argument to vmemmap_free to indicate if the entire
>>>> section is freed(based on subsection map). Vmemmap_free is a common function
>>>> and might affect other architectures... The process would be:
>>>> vmemmap_free
>>>>     unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>>>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>>>> page content but do not free*.
>>>>     free_empty_tables // will be called only if entire section is freed
>>>>
>>>> On the populate side,
>>>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>>>>     continue;    //Buffer still exists, just abort..
>>>>
>>>> Could you please comment further whether #2 is feasible ?
>>>
>>> vmemmap_free() already gets start/end, so it could at least check the
>>> alignment and avoid freeing if it's not unplugging a full section. It
>>
>> unmap_hotplug_pmd_range()
>> {
>>     do {
>>         if (pmd_sect(pmd)) {
>>             pmd_clear(pmdp);
>>             flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>                          if (free_mapped)
>>                                  free_hotplug_page_range(pmd_page(pmd),
>>                                                          PMD_SIZE, altmap);
>>         }
>>     } while ()
>> }
>>
>> Do you mean clearing the PMD entry but not freeing the mapped page for vmemmap ?
>> In that case should the hot-unplug fail or not ? If we free the pfns (successful
>> hot-unplug), then leaving behind entire PMD entry for covering the remaining sub
>> sections, is going to be problematic as it still maps the removed pfns as well !
> 
> Could you please help me to understand in which scenarios this might cause issue? I assume we won't touch these struct page further?

Regardless of whether the non-present pfns are accessed or not from the cpu, having
page table mappings covering them might probably create corresponding TLB entries ?
IIUC from an arch perspective, this seems undesirable and possibly some what risky.

> 
>>
>>> does leave a 2MB vmemmap block in place when freeing the last subsection
>>> but it's safer than freeing valid struct page entries. In addition, it
>>> could query the memory hotplug state with something like
>>> find_memory_block() and figure out whether the section is empty.
>>
>> I guess there are two potential solutions, if unmap_hotplug_pmd_range() were to
>> handle sub-section removal.
>>
>> 1) Skip pmd_clear() when entire section is not covered
>>
>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>     via is_subsection_map_empty() or something similar.
>>
>> b. Skip pmd_clear() if the entire section covering that PMD is not being removed
>>     but that might be problematic, as it still maps potentially unavailable pfns,
>>     which are now hot-unplugged out.
>>
>> 2) Break PMD into base pages
>>
>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>     via is_subsection_map_empty() or something similar.
>>
>> b. Break entire PMD into base page mappings and remove entries corresponding to
>>     the subsection being removed. Although the BBM sequence needs to be followed
>>     while making sure that no other part of the kernel is accessing subsections,
>>     that are mapped via the erstwhile PMD but currently not being removed.
>>
>>>
>>> Anyway, I'll be off until the new year, maybe I get other ideas by then.
>>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2025-01-02  3:16               ` Anshuman Khandual
@ 2025-01-02  9:07                 ` Zhenhua Huang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhenhua Huang @ 2025-01-02  9:07 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang



On 2025/1/2 11:16, Anshuman Khandual wrote:
> 
> 
> On 12/31/24 11:22, Zhenhua Huang wrote:
>>
>>
>> On 2024/12/30 15:48, Zhenhua Huang wrote:
>>> Hi Anshuman,
>>>
>>> On 2024/12/27 15:49, Anshuman Khandual wrote:
>>>> On 12/24/24 19:39, Catalin Marinas wrote:
>>>>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>>>>> Thanks Catalin for review!
>>>>>> Merry Christmas.
>>>>>
>>>>> Merry Christmas to you too!
>>>>>
>>>>>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>>>>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>>>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>>>>>
>>>>>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>>>>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>>>>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>>>>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>>>>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>>>>>
>>>>>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>>>> because it broke arm64 assumptions ?
>>>>>
>>>>> Yes, I think that would be better. And a cc stable to 5.4 (the above
>>>>> commit appeared in 5.3).
>>>>
>>>> Agreed. This is a problem which needs fixing but not sure if proposed patch
>>>> here fixes that problem.
>>>>
>>>>>
>>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>>> index e2739b69e11b..fd59ee44960e 100644
>>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>>>>>     {
>>>>>>>>         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>>>>> -    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>>>>>> +    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>>>>>             return vmemmap_populate_basepages(start, end, node, altmap);
>>>>>>>>         else
>>>>>>>>             return vmemmap_populate_hugepages(start, end, node, altmap);
>>>>>>>
>>>>>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>>>>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>>>>>> easy that is, whether we have the necessary information (I haven't
>>>>>>> looked in detail).
>>>>>>>
>>>>>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>>>>>> that's possible, the problem isn't solved by this patch.
>>>>
>>>> Believe this is possible after sub-section hotplug and hotremove support.
>>>>
>>>>>>
>>>>>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>>>>>> size...
>>>>>>
>>>>>> I have two ideas:
>>>>>> 1. Completely disable this PMD mapping optimization since there is no
>>>>>> guarantee we must align 128M memory for hotplug ..
>>>>>
>>>>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>>>>> I think the only advantage here is that we don't allocate a full 2MB
>>>>> block for vmemmap when only plugging in a sub-section.
>>>>
>>>> Agreed, that will be the right fix for the problem which can be back ported.
>>>> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
>>>
>>> Thanks Anshuman, yeah.. we must handle linear mapping as well.
>>>
>>>> vmemmap for all non-boot memory sections, that can be hot-unplugged.
>>>>
>>>> Something like the following ? [untested]
>>>>
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 216519663961..56b9c6891f46 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>>>    int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>                   struct vmem_altmap *altmap)
>>>>    {
>>>> +       unsigned long start_pfn;
>>>> +       struct mem_section *ms;
>>>> +
>>>>           WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>> -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>> +       start_pfn = page_to_pfn((struct page *)start);
>>>> +       ms = __pfn_to_section(start_pfn);
>>>> +
>>>> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
>>>
>>> LGTM. I will follow your and Catalin's suggestion to prepare further patches, Thanks!
>>>
>>>>                   return vmemmap_populate_basepages(start, end, node, altmap);
>>>>           else
>>>>                   return vmemmap_populate_hugepages(start, end, node, altmap);
>>>> @@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
>>>>    int arch_add_memory(int nid, u64 start, u64 size,
>>>>                       struct mhp_params *params)
>>>>    {
>>>> +       unsigned long start_pfn = page_to_pfn((struct page *)start);
>>>> +       struct mem_section *ms = __pfn_to_section(start_pfn);
>>>>           int ret, flags = NO_EXEC_MAPPINGS;
>>>>           VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>>> +       if (!early_section(ms))
>>>> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>>
>>> However, here comes another doubt, given that the subsection size is 2M, shouldn't we have ability to support PMD SECTION MAPPING if CONFIG_ARM64_4K_PAGES? This might be the optimization we want to maintain?
>>>
>>> Should we remove NO_BLOCK_MAPPINGS and add more constraints to avoid pud_set_huge if CONFIG_ARM64_4K_PAGES ?
>>>
>>
>> BTW, shall we remove the check for !early_section since arch_add_memory is only called during hotplugging case? Correct me please if I'm mistaken :)
> 
> While this is true, still might be a good idea to keep the early_section()
> check in place just to be extra careful here. Otherwise an WARN_ON() might
> be needed.

Make sense. I would like to add some comments and WARN_ON() if 
early_section().

> 
>> The idea is like(not fully tested):
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index e2739b69e11b..9afeb35673a3 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -45,6 +45,7 @@
>>   #define NO_BLOCK_MAPPINGS      BIT(0)
>>   #define NO_CONT_MAPPINGS       BIT(1)
>>   #define NO_EXEC_MAPPINGS       BIT(2)  /* assumes FEAT_HPDS is not used */
>> +#define NO_PUD_BLOCK_MAPPINGS  BIT(3)  /* Hotplug case: do not want block mapping for PUD */
> 
> Since block mappings get created either at PMD or PUD, the existing flag
> NO_BLOCK_MAPPINGS should be split into two i.e NO_PMD_BLOCK_MAPPINGS and
> NO_PUD_BLOCK_MAPPINGS (possibly expanding into P4D later). Although all
> block mappings can still be prevented using the existing flag which can
> be derived from the new ones.
> 
> #define NO_BLOCK_MAPPINGS (NO_PMD_BLOCK_MAPPINGS | NO_PUD_BLOCK_MAPPINGS)

Thanks, it's more clear.

> 
>>
>>   u64 kimage_voffset __ro_after_init;
>>   EXPORT_SYMBOL(kimage_voffset);
>> @@ -356,10 +357,12 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>
>>                  /*
>>                   * For 4K granule only, attempt to put down a 1GB block
>> +                * Hotplug case: do not attempt 1GB block
>>                   */
>>                  if (pud_sect_supported() &&
>>                     ((addr | next | phys) & ~PUD_MASK) == 0 &&
>> -                   (flags & NO_BLOCK_MAPPINGS) == 0) {
>> +                   (flags & NO_BLOCK_MAPPINGS) == 0 &&
>> +                   (flags & NO_PUD_BLOCK_MAPPINGS) == 0) {
> 
> After flags being split as suggested above, only the PUD block mapping
> flag need to be checked here, and similarly the PMU block mapping flag
> needs to be checked in alloc_init_pmd().
> 
>>                          pud_set_huge(pudp, phys, prot);
>>
>>                          /*
>> @@ -1175,9 +1178,16 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>                  struct vmem_altmap *altmap)
>>   {
>> +       unsigned long start_pfn;
>> +       struct mem_section *ms;
>> +
>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>
>> -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>> +       start_pfn = page_to_pfn((struct page *)start);
>> +       ms = __pfn_to_section(start_pfn);
>> +
>> +       /* hotplugged section not support hugepages */
>> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
>>                  return vmemmap_populate_basepages(start, end, node, altmap);
>>          else
>>                  return vmemmap_populate_hugepages(start, end, node, altmap);
>> @@ -1342,6 +1352,16 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>
>>          VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>
>> +       if (IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>> +       /*
>> +        * As per subsection granule is 2M, allow PMD block mapping in
>> +        * case 4K PAGES.
>> +        * Other cases forbid section mapping.
>> +        */
>> +               flags |= NO_PUD_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>> +       else
>> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>> +
>>          if (can_set_direct_map())
>>                  flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
> 
> Basically vmmemap will not allow PMD or PUD block mapping for non boot
> memory as a 2MB sized sub-section hot removal involves tearing down a
> sub PMD i.e (512 * sizeof(struct page)) VA range, which is currently
> not supported in unmap_hotplug_range().
> 
> Although linear mapping could still support PMD block mapping as hot
> removing a 2MB sized sub-section will tear down an entire PMD entry.
> 
> Fair enough but seems like this should be done after the fix patch
> which prevents all block mappings for early section memory as that

s/early section/non early section ?
Sure, I will wait for Catalin/Will's comments.

> will be an easy back port. But will leave this to upto Catalin/Will
> to decide.
> 
>>
>>
>>>> +
>>>>           if (can_set_direct_map())
>>>>                   flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>>>
>>>>>
>>>>>> 2. If we want to take this optimization.
>>>>>> I propose adding one argument to vmemmap_free to indicate if the entire
>>>>>> section is freed(based on subsection map). Vmemmap_free is a common function
>>>>>> and might affect other architectures... The process would be:
>>>>>> vmemmap_free
>>>>>>      unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>>>>>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>>>>>> page content but do not free*.
>>>>>>      free_empty_tables // will be called only if entire section is freed
>>>>>>
>>>>>> On the populate side,
>>>>>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>>>>>>      continue;    //Buffer still exists, just abort..
>>>>>>
>>>>>> Could you please comment further whether #2 is feasible ?
>>>>>
>>>>> vmemmap_free() already gets start/end, so it could at least check the
>>>>> alignment and avoid freeing if it's not unplugging a full section. It
>>>>
>>>> unmap_hotplug_pmd_range()
>>>> {
>>>>      do {
>>>>          if (pmd_sect(pmd)) {
>>>>              pmd_clear(pmdp);
>>>>              flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>>                           if (free_mapped)
>>>>                                   free_hotplug_page_range(pmd_page(pmd),
>>>>                                                           PMD_SIZE, altmap);
>>>>          }
>>>>      } while ()
>>>> }
>>>>
>>>> Do you mean clearing the PMD entry but not freeing the mapped page for vmemmap ?
>>>> In that case should the hot-unplug fail or not ? If we free the pfns (successful
>>>> hot-unplug), then leaving behind entire PMD entry for covering the remaining sub
>>>> sections, is going to be problematic as it still maps the removed pfns as well !
>>>
>>> Could you please help me to understand in which scenarios this might cause issue? I assume we won't touch these struct page further?
>>>
>>>>
>>>>> does leave a 2MB vmemmap block in place when freeing the last subsection
>>>>> but it's safer than freeing valid struct page entries. In addition, it
>>>>> could query the memory hotplug state with something like
>>>>> find_memory_block() and figure out whether the section is empty.
>>>>
>>>> I guess there are two potential solutions, if unmap_hotplug_pmd_range() were to
>>>> handle sub-section removal.
>>>>
>>>> 1) Skip pmd_clear() when entire section is not covered
>>>>
>>>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>>>      via is_subsection_map_empty() or something similar.
>>>>
>>>> b. Skip pmd_clear() if the entire section covering that PMD is not being removed
>>>>      but that might be problematic, as it still maps potentially unavailable pfns,
>>>>      which are now hot-unplugged out.
>>>>
>>>> 2) Break PMD into base pages
>>>>
>>>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>>>      via is_subsection_map_empty() or something similar.
>>>>
>>>> b. Break entire PMD into base page mappings and remove entries corresponding to
>>>>      the subsection being removed. Although the BBM sequence needs to be followed
>>>>      while making sure that no other part of the kernel is accessing subsections,
>>>>      that are mapped via the erstwhile PMD but currently not being removed.
>>>>
>>>>>
>>>>> Anyway, I'll be off until the new year, maybe I get other ideas by then.
>>>>>
>>>
>>>
>>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2025-01-02  3:51             ` Anshuman Khandual
@ 2025-01-02  9:13               ` Zhenhua Huang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhenhua Huang @ 2025-01-02  9:13 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang



On 2025/1/2 11:51, Anshuman Khandual wrote:
> On 12/30/24 13:18, Zhenhua Huang wrote:
>> Hi Anshuman,
>>
>> On 2024/12/27 15:49, Anshuman Khandual wrote:
>>> On 12/24/24 19:39, Catalin Marinas wrote:
>>>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>>>> Thanks Catalin for review!
>>>>> Merry Christmas.
>>>>
>>>> Merry Christmas to you too!
>>>>
>>>>> On 2024/12/21 2:30, Catalin Marinas wrote:
>>>>>> On Mon, Dec 09, 2024 at 05:42:26PM +0800, Zhenhua Huang wrote:
>>>>>>> Fixes: c1cc1552616d ("arm64: MMU initialisation")
>>>>>>
>>>>>> I wouldn't add a fix for the first commit adding arm64 support, we did
>>>>>> not even have memory hotplug at the time (added later in 5.7 by commit
>>>>>> bbd6ec605c0f ("arm64/mm: Enable memory hot remove")). IIUC, this hasn't
>>>>>> been a problem until commit ba72b4c8cf60 ("mm/sparsemem: support
>>>>>> sub-section hotplug"). That commit broke some arm64 assumptions.
>>>>>
>>>>> Shall we add ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>>> because it broke arm64 assumptions ?
>>>>
>>>> Yes, I think that would be better. And a cc stable to 5.4 (the above
>>>> commit appeared in 5.3).
>>>
>>> Agreed. This is a problem which needs fixing but not sure if proposed patch
>>> here fixes that problem.
>>>
>>>>
>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>> index e2739b69e11b..fd59ee44960e 100644
>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>> @@ -1177,7 +1177,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>>>>     {
>>>>>>>         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>>>> -    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>>>>>> +    if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)start), PAGES_PER_SECTION) ||
>>>>>>> +    !IS_ALIGNED(page_to_pfn((struct page *)end), PAGES_PER_SECTION))
>>>>>>>             return vmemmap_populate_basepages(start, end, node, altmap);
>>>>>>>         else
>>>>>>>             return vmemmap_populate_hugepages(start, end, node, altmap);
>>>>>>
>>>>>> An alternative would be to fix unmap_hotplug_pmd_range() etc. to avoid
>>>>>> nuking the whole vmemmap pmd section if it's not empty. Not sure how
>>>>>> easy that is, whether we have the necessary information (I haven't
>>>>>> looked in detail).
>>>>>>
>>>>>> A potential issue - can we hotplug 128MB of RAM and only unplug 2MB? If
>>>>>> that's possible, the problem isn't solved by this patch.
>>>
>>> Believe this is possible after sub-section hotplug and hotremove support.
>>>
>>>>>
>>>>> Indeed, seems there is no guarantee that plug size must be equal to unplug
>>>>> size...
>>>>>
>>>>> I have two ideas:
>>>>> 1. Completely disable this PMD mapping optimization since there is no
>>>>> guarantee we must align 128M memory for hotplug ..
>>>>
>>>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>>>> I think the only advantage here is that we don't allocate a full 2MB
>>>> block for vmemmap when only plugging in a sub-section.
>>>
>>> Agreed, that will be the right fix for the problem which can be back ported.
>>> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
>>
>> Thanks Anshuman, yeah.. we must handle linear mapping as well.
>>
>>> vmemmap for all non-boot memory sections, that can be hot-unplugged.
>>>
>>> Something like the following ? [untested]
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 216519663961..56b9c6891f46 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1171,9 +1171,15 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>>    int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>                   struct vmem_altmap *altmap)
>>>    {
>>> +       unsigned long start_pfn;
>>> +       struct mem_section *ms;
>>> +
>>>           WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>    -       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>>> +       start_pfn = page_to_pfn((struct page *)start);
>>> +       ms = __pfn_to_section(start_pfn);
>>> +
>>> +       if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) || !early_section(ms))
>>
>> LGTM. I will follow your and Catalin's suggestion to prepare further patches, Thanks!
>>
>>>                   return vmemmap_populate_basepages(start, end, node, altmap);
>>>           else
>>>                   return vmemmap_populate_hugepages(start, end, node, altmap);
>>> @@ -1334,10 +1340,15 @@ struct range arch_get_mappable_range(void)
>>>    int arch_add_memory(int nid, u64 start, u64 size,
>>>                       struct mhp_params *params)
>>>    {
>>> +       unsigned long start_pfn = page_to_pfn((struct page *)start);
>>> +       struct mem_section *ms = __pfn_to_section(start_pfn);
>>>           int ret, flags = NO_EXEC_MAPPINGS;
>>>             VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>>    +       if (!early_section(ms))
>>> +               flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> However, here comes another doubt, given that the subsection size is 2M, shouldn't we have ability to support PMD SECTION MAPPING if CONFIG_ARM64_4K_PAGES? This might be the optimization we want to maintain?>
>> Should we remove NO_BLOCK_MAPPINGS and add more constraints to avoid pud_set_huge if CONFIG_ARM64_4K_PAGES ?
> 
> I guess this has been covered on the thread.

Yeah, the example I shared on separated thread, we can wrap up the 
discussion here :)

> 
>>
>>> +
>>>           if (can_set_direct_map())
>>>                   flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>>   
>>>>
>>>>> 2. If we want to take this optimization.
>>>>> I propose adding one argument to vmemmap_free to indicate if the entire
>>>>> section is freed(based on subsection map). Vmemmap_free is a common function
>>>>> and might affect other architectures... The process would be:
>>>>> vmemmap_free
>>>>>      unmap_hotplug_range //In unmap_hotplug_pmd_range() as you mentioned：if
>>>>> whole section is freed, proceed as usual. Otherwise, *just clear out struct
>>>>> page content but do not free*.
>>>>>      free_empty_tables // will be called only if entire section is freed
>>>>>
>>>>> On the populate side,
>>>>> else if (vmemmap_check_pmd(pmd, node, addr, next)) //implement this function
>>>>>      continue;    //Buffer still exists, just abort..
>>>>>
>>>>> Could you please comment further whether #2 is feasible ?
>>>>
>>>> vmemmap_free() already gets start/end, so it could at least check the
>>>> alignment and avoid freeing if it's not unplugging a full section. It
>>>
>>> unmap_hotplug_pmd_range()
>>> {
>>>      do {
>>>          if (pmd_sect(pmd)) {
>>>              pmd_clear(pmdp);
>>>              flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>                           if (free_mapped)
>>>                                   free_hotplug_page_range(pmd_page(pmd),
>>>                                                           PMD_SIZE, altmap);
>>>          }
>>>      } while ()
>>> }
>>>
>>> Do you mean clearing the PMD entry but not freeing the mapped page for vmemmap ?
>>> In that case should the hot-unplug fail or not ? If we free the pfns (successful
>>> hot-unplug), then leaving behind entire PMD entry for covering the remaining sub
>>> sections, is going to be problematic as it still maps the removed pfns as well !
>>
>> Could you please help me to understand in which scenarios this might cause issue? I assume we won't touch these struct page further?
> 
> Regardless of whether the non-present pfns are accessed or not from the cpu, having
> page table mappings covering them might probably create corresponding TLB entries ?
> IIUC from an arch perspective, this seems undesirable and possibly some what risky.
> 

Got it.. I know it's tricky indeed.

>>
>>>
>>>> does leave a 2MB vmemmap block in place when freeing the last subsection
>>>> but it's safer than freeing valid struct page entries. In addition, it
>>>> could query the memory hotplug state with something like
>>>> find_memory_block() and figure out whether the section is empty.
>>>
>>> I guess there are two potential solutions, if unmap_hotplug_pmd_range() were to
>>> handle sub-section removal.
>>>
>>> 1) Skip pmd_clear() when entire section is not covered
>>>
>>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>>      via is_subsection_map_empty() or something similar.
>>>
>>> b. Skip pmd_clear() if the entire section covering that PMD is not being removed
>>>      but that might be problematic, as it still maps potentially unavailable pfns,
>>>      which are now hot-unplugged out.
>>>
>>> 2) Break PMD into base pages
>>>
>>> a. pmd_clear() only if all but the current subsection have been removed earlier
>>>      via is_subsection_map_empty() or something similar.
>>>
>>> b. Break entire PMD into base page mappings and remove entries corresponding to
>>>      the subsection being removed. Although the BBM sequence needs to be followed
>>>      while making sure that no other part of the kernel is accessing subsections,
>>>      that are mapped via the erstwhile PMD but currently not being removed.
>>>
>>>>
>>>> Anyway, I'll be off until the new year, maybe I get other ideas by then.
>>>>
>>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2024-12-27  2:57     ` Anshuman Khandual
  2024-12-30  7:48       ` Zhenhua Huang
@ 2025-01-02 18:12       ` Catalin Marinas
  2025-01-03  2:43         ` Zhenhua Huang
  1 sibling, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2025-01-02 18:12 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Zhenhua Huang, will, ardb, ryan.roberts, mark.rutland, joey.gouly,
	dave.hansen, akpm, chenfeiyang, chenhuacai, linux-mm,
	linux-arm-kernel, linux-kernel

On Fri, Dec 27, 2024 at 08:27:18AM +0530, Anshuman Khandual wrote:
> On 12/21/24 00:05, Catalin Marinas wrote:
> > On Mon, Dec 09, 2024 at 05:42:27PM +0800, Zhenhua Huang wrote:
> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index fd59ee44960e..41c7978a92be 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -1169,7 +1169,8 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
> >>  				unsigned long addr, unsigned long next)
> >>  {
> >>  	vmemmap_verify((pte_t *)pmdp, node, addr, next);
> >> -	return 1;
> >> +
> >> +	return pmd_sect(*pmdp);
> 
> Please change this as pmd_sect(READ_ONCE(*pmdp)) instead.
> 
> >>  }
> >>  
> >>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > 
> > Don't we need this patch only if we implement the first one? Please fold
> > it into the other patch.
> 
> Seems like these patches might not be related.
> 
> While creating huge page based vmemmap mapping during vmemmap_populate_hugepages(),
> vmemmap_check_pmd() validates if a populated (i.e pmd_none) PMD already represents
> a huge mapping and can be skipped there after.
> 
> Current implementation for vmemmap_check_pmd() on arm64, unconditionally returns 1
> thus asserting that the given populated PMD entry is a huge one indeed, which will
> be the case unless something is wrong. vmemmap_verify() only ensures that the node
> where the pfn is allocated from is local.
> 
> int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>                                 unsigned long addr, unsigned long next)
> {
>         vmemmap_verify((pte_t *)pmdp, node, addr, next);
>         return 1;
> }
> 
> However it does not really check the entry to be a section mapping which it should.
> Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
> case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
> original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
> this does not need a "Fixes: " tag.

I did not say the patch is wrong, only that it wouldn't be needed unless
we have the other patch in this series. However, if we do apply the
other patch, we definitely need this change, so keeping them together
would make it easier to backport.

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2024-12-27  7:49         ` Anshuman Khandual
  2024-12-30  7:48           ` Zhenhua Huang
@ 2025-01-02 18:58           ` Catalin Marinas
  2025-01-03  2:01             ` Zhenhua Huang
  1 sibling, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2025-01-02 18:58 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Zhenhua Huang, will, ardb, ryan.roberts, mark.rutland, joey.gouly,
	dave.hansen, akpm, chenfeiyang, chenhuacai, linux-mm,
	linux-arm-kernel, linux-kernel

On Fri, Dec 27, 2024 at 01:19:30PM +0530, Anshuman Khandual wrote:
> On 12/24/24 19:39, Catalin Marinas wrote:
> > On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
> >> I have two ideas:
> >> 1. Completely disable this PMD mapping optimization since there is no
> >> guarantee we must align 128M memory for hotplug ..
> > 
> > I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
> > I think the only advantage here is that we don't allocate a full 2MB
> > block for vmemmap when only plugging in a sub-section.
> 
> Agreed, that will be the right fix for the problem which can be back ported.
> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
> vmemmap for all non-boot memory sections, that can be hot-unplugged.

I haven't read your diffs below yet but why not allow PMD mappings of
the hotplugged memory (not vmemmap)? SUBSECTION_SHIFT is 21, so we can
have 2MB blocks in a 4KB page configuration.

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned
  2025-01-02 18:58           ` Catalin Marinas
@ 2025-01-03  2:01             ` Zhenhua Huang
  0 siblings, 0 replies; 25+ messages in thread
From: Zhenhua Huang @ 2025-01-03  2:01 UTC (permalink / raw)
  To: Catalin Marinas, Anshuman Khandual
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel, Tingwei Zhang



On 2025/1/3 2:58, Catalin Marinas wrote:
> On Fri, Dec 27, 2024 at 01:19:30PM +0530, Anshuman Khandual wrote:
>> On 12/24/24 19:39, Catalin Marinas wrote:
>>> On Tue, Dec 24, 2024 at 05:32:06PM +0800, Zhenhua Huang wrote:
>>>> I have two ideas:
>>>> 1. Completely disable this PMD mapping optimization since there is no
>>>> guarantee we must align 128M memory for hotplug ..
>>>
>>> I'd be in favour of this, at least if CONFIG_MEMORY_HOTPLUG is enabled.
>>> I think the only advantage here is that we don't allocate a full 2MB
>>> block for vmemmap when only plugging in a sub-section.
>>
>> Agreed, that will be the right fix for the problem which can be back ported.
>> We will have to prevent PMD/PUD/CONT mappings for both linear and as well as
>> vmemmap for all non-boot memory sections, that can be hot-unplugged.
> 
> I haven't read your diffs below yet but why not allow PMD mappings of
> the hotplugged memory (not vmemmap)? SUBSECTION_SHIFT is 21, so we can
> have 2MB blocks in a 4KB page configuration.
> 
Oh, Understood. We discussed the topic and I shared the same concern, 
Anshuman actually preferred to wait for your decision.

See discussion:
https://lore.kernel.org/lkml/3aadda40-c973-4703-8ed8-9cf2d3eb70a0@quicinc.com/

I see your favor now and will post V3 based on that.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2025-01-02 18:12       ` Catalin Marinas
@ 2025-01-03  2:43         ` Zhenhua Huang
  2025-01-03 17:58           ` Catalin Marinas
  0 siblings, 1 reply; 25+ messages in thread
From: Zhenhua Huang @ 2025-01-03  2:43 UTC (permalink / raw)
  To: Catalin Marinas, Anshuman Khandual
  Cc: will, ardb, ryan.roberts, mark.rutland, joey.gouly, dave.hansen,
	akpm, chenfeiyang, chenhuacai, linux-mm, linux-arm-kernel,
	linux-kernel



On 2025/1/3 2:12, Catalin Marinas wrote:
> On Fri, Dec 27, 2024 at 08:27:18AM +0530, Anshuman Khandual wrote:
>> On 12/21/24 00:05, Catalin Marinas wrote:
>>> On Mon, Dec 09, 2024 at 05:42:27PM +0800, Zhenhua Huang wrote:
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index fd59ee44960e..41c7978a92be 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1169,7 +1169,8 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>>>   				unsigned long addr, unsigned long next)
>>>>   {
>>>>   	vmemmap_verify((pte_t *)pmdp, node, addr, next);
>>>> -	return 1;
>>>> +
>>>> +	return pmd_sect(*pmdp);
>>
>> Please change this as pmd_sect(READ_ONCE(*pmdp)) instead.
>>
>>>>   }
>>>>   
>>>>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>
>>> Don't we need this patch only if we implement the first one? Please fold
>>> it into the other patch.
>>
>> Seems like these patches might not be related.
>>
>> While creating huge page based vmemmap mapping during vmemmap_populate_hugepages(),
>> vmemmap_check_pmd() validates if a populated (i.e pmd_none) PMD already represents
>> a huge mapping and can be skipped there after.
>>
>> Current implementation for vmemmap_check_pmd() on arm64, unconditionally returns 1
>> thus asserting that the given populated PMD entry is a huge one indeed, which will
>> be the case unless something is wrong. vmemmap_verify() only ensures that the node
>> where the pfn is allocated from is local.
>>
>> int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>                                  unsigned long addr, unsigned long next)
>> {
>>          vmemmap_verify((pte_t *)pmdp, node, addr, next);
>>          return 1;
>> }
>>
>> However it does not really check the entry to be a section mapping which it should.
>> Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
>> case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
>> original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
>> this does not need a "Fixes: " tag.
> 
> I did not say the patch is wrong, only that it wouldn't be needed unless
> we have the other patch in this series. However, if we do apply the
> other patch, we definitely need this change, so keeping them together
> would make it easier to backport.

Hi Catalin,

Based on our current discussion on patchset #1, we will prohibit 
hugepages(vmemmap mapping) for all hotplugging sections...The flow:
vmemmap_populate
	vmemmap_populate_hugepages
		vmemmap_check_pmd

will *only* be called for non-early sections. Therefore, with patchset 
#1, I don't see the patch as essential.. Would it be acceptable if we do 
not backport this patch?  Anshuman's suggestion seems reasonable to me 
and I separated the patch out:
https://lore.kernel.org/lkml/20250102074047.674156-1-quic_zhenhuah@quicinc.com/

Please share your comments and correct me if I'm mistaken :)


> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64
  2025-01-03  2:43         ` Zhenhua Huang
@ 2025-01-03 17:58           ` Catalin Marinas
  0 siblings, 0 replies; 25+ messages in thread
From: Catalin Marinas @ 2025-01-03 17:58 UTC (permalink / raw)
  To: Zhenhua Huang
  Cc: Anshuman Khandual, will, ardb, ryan.roberts, mark.rutland,
	joey.gouly, dave.hansen, akpm, chenfeiyang, chenhuacai, linux-mm,
	linux-arm-kernel, linux-kernel

On Fri, Jan 03, 2025 at 10:43:51AM +0800, Zhenhua Huang wrote:
> On 2025/1/3 2:12, Catalin Marinas wrote:
> > On Fri, Dec 27, 2024 at 08:27:18AM +0530, Anshuman Khandual wrote:
> > > On 12/21/24 00:05, Catalin Marinas wrote:
> > > > On Mon, Dec 09, 2024 at 05:42:27PM +0800, Zhenhua Huang wrote:
> > > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > > > index fd59ee44960e..41c7978a92be 100644
> > > > > --- a/arch/arm64/mm/mmu.c
> > > > > +++ b/arch/arm64/mm/mmu.c
> > > > > @@ -1169,7 +1169,8 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
> > > > >   				unsigned long addr, unsigned long next)
> > > > >   {
> > > > >   	vmemmap_verify((pte_t *)pmdp, node, addr, next);
> > > > > -	return 1;
> > > > > +
> > > > > +	return pmd_sect(*pmdp);
> > > 
> > > Please change this as pmd_sect(READ_ONCE(*pmdp)) instead.
> > > 
> > > > >   }
> > > > >   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > > > 
> > > > Don't we need this patch only if we implement the first one? Please fold
> > > > it into the other patch.
> > > 
> > > Seems like these patches might not be related.
> > > 
> > > While creating huge page based vmemmap mapping during vmemmap_populate_hugepages(),
> > > vmemmap_check_pmd() validates if a populated (i.e pmd_none) PMD already represents
> > > a huge mapping and can be skipped there after.
> > > 
> > > Current implementation for vmemmap_check_pmd() on arm64, unconditionally returns 1
> > > thus asserting that the given populated PMD entry is a huge one indeed, which will
> > > be the case unless something is wrong. vmemmap_verify() only ensures that the node
> > > where the pfn is allocated from is local.
> > > 
> > > int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
> > >                                  unsigned long addr, unsigned long next)
> > > {
> > >          vmemmap_verify((pte_t *)pmdp, node, addr, next);
> > >          return 1;
> > > }
> > > 
> > > However it does not really check the entry to be a section mapping which it should.
> > > Returning pmd_sect(READ_ONCE(*pmdp)) is the right thing, which should have been the
> > > case from the beginning when vmemmap_check_pmd() was added. I guess because arm64's
> > > original vmemmap_populate() checked only for vmemmap_verify() as well. So probably
> > > this does not need a "Fixes: " tag.
> > 
> > I did not say the patch is wrong, only that it wouldn't be needed unless
> > we have the other patch in this series. However, if we do apply the
> > other patch, we definitely need this change, so keeping them together
> > would make it easier to backport.
> 
> Hi Catalin,
> 
> Based on our current discussion on patchset #1, we will prohibit
> hugepages(vmemmap mapping) for all hotplugging sections...The flow:
> vmemmap_populate
> 	vmemmap_populate_hugepages
> 		vmemmap_check_pmd
> 
> will *only* be called for non-early sections. Therefore, with patchset #1, I
> don't see the patch as essential.. Would it be acceptable if we do not
> backport this patch?  Anshuman's suggestion seems reasonable to me and I
> separated the patch out:
> https://lore.kernel.org/lkml/20250102074047.674156-1-quic_zhenhuah@quicinc.com/

Ah, ok, so if you only call vmemmap_populate_basepages() for hotplugged
memory, the vmemmap_check_pmd() won't even be called. So yeah, in this
case there won't be any dependency on this change. If we somehow end up
with a mix of vmemmap basepages and hugepages for hotplugged memory, we
probably need to update vmemmap_check_pmd() as well (and backport
together).

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-01-03 18:01 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09  9:42 [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang
2024-12-09  9:42 ` [PATCH v2 1/2] arm64: mm: vmemmap populate to page level if not section aligned Zhenhua Huang
2024-12-20 18:30   ` Catalin Marinas
2024-12-24  9:32     ` Zhenhua Huang
2024-12-24 14:09       ` Catalin Marinas
2024-12-25  9:59         ` Zhenhua Huang
2024-12-27  7:49         ` Anshuman Khandual
2024-12-30  7:48           ` Zhenhua Huang
2024-12-31  5:52             ` Zhenhua Huang
2025-01-02  3:16               ` Anshuman Khandual
2025-01-02  9:07                 ` Zhenhua Huang
2025-01-02  3:51             ` Anshuman Khandual
2025-01-02  9:13               ` Zhenhua Huang
2025-01-02 18:58           ` Catalin Marinas
2025-01-03  2:01             ` Zhenhua Huang
2024-12-09  9:42 ` [PATCH v2 2/2] arm64: mm: implement vmemmap_check_pmd for arm64 Zhenhua Huang
2024-12-20 18:35   ` Catalin Marinas
2024-12-27  2:57     ` Anshuman Khandual
2024-12-30  7:48       ` Zhenhua Huang
2024-12-31  6:59         ` Anshuman Khandual
2024-12-31  7:18           ` Zhenhua Huang
2025-01-02 18:12       ` Catalin Marinas
2025-01-03  2:43         ` Zhenhua Huang
2025-01-03 17:58           ` Catalin Marinas
2024-12-17  1:47 ` [PATCH v2 0/2] Fix subsection vmemmap_populate logic Zhenhua Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).