[PATCH v5 0/6] add mTHP support for anonymous shmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/6] add mTHP support for anonymous shmem
@ 2024-06-11 10:11 Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
all of shmem, not only anonymous shmem, but support will be added iteratively.
Therefore, this patch set starts with support for anonymous shmem.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size, which
is set to "inherit". This ensures backward compatibility with the anonymous
shmem enabled of the top level, meanwhile also allows independent control of
anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

[1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/

Changes from v4:
 - Fix the unused variable warning reported by kernel test robot.
 - Drop the 'anon' prefix for variables and functions, per Daniel.

Changes from v3:
 - Drop 'force' and 'deny' testing options for each mTHP.
 - Use new helper update_mmu_tlb_range(), per Lance.
 - Update documentation to drop "anonymous thp" terminology, per David.
 - Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
   reported by kernel test robot.
 - Fix the highest mTHP order in shmem_get_unmapped_area().
 - Update some commit message.

Changes from v2:
 - Rebased to mm/mm-unstable.
 - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.

Changes from v1:
 - Drop the patch that re-arranges the position of highest_order() and
   next_order(), per Ryan.
 - Modify the finish_fault() to fix VA alignment issue, per Ryan and
   David.
 - Fix some building issues, reported by Lance and kernel test robot.
 - Update some commit message.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (6):
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  23 ++
 include/linux/huge_mm.h                    |  23 ++
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  57 +++-
 mm/shmem.c                                 | 344 ++++++++++++++++++---
 5 files changed, 403 insertions(+), 61 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
@ 2024-06-11 10:11 ` Baolin Wang
  2024-06-11 14:38   ` Zi Yan
  2024-06-12 13:40   ` Kefeng Wang
  2024-06-11 10:11 ` [PATCH v5 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Add large folio mapping establishment support for finish_fault() as a
preparation, to support multi-size THP allocation of anonymous shmem pages
in the following patches.

Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating
the RSS unintentionally, and we can discuss what size of mapping to build
when extending mTHP to control non-anon shmem in the future.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..72775ee99ff3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
+	struct folio *folio;
 	vm_fault_t ret;
 	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
 		      !(vma->vm_flags & VM_SHARED);
+	int type, nr_pages;
+	unsigned long addr = vmf->address;
 
 	/* Did we COW the page? */
 	if (is_cow)
@@ -4864,24 +4867,58 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return VM_FAULT_OOM;
 	}
 
+	folio = page_folio(page);
+	nr_pages = folio_nr_pages(folio);
+
+	/*
+	 * Using per-page fault to maintain the uffd semantics, and same
+	 * approach also applies to non-anonymous-shmem faults to avoid
+	 * inflating the RSS of the process.
+	 */
+	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
+		nr_pages = 1;
+	} else if (nr_pages > 1) {
+		pgoff_t idx = folio_page_idx(folio, page);
+		/* The page offset of vmf->address within the VMA. */
+		pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
+
+		/*
+		 * Fallback to per-page fault in case the folio size in page
+		 * cache beyond the VMA limits.
+		 */
+		if (unlikely(vma_off < idx ||
+			     vma_off + (nr_pages - idx) > vma_pages(vma))) {
+			nr_pages = 1;
+		} else {
+			/* Now we can set mappings for the whole large folio. */
+			addr = vmf->address - idx * PAGE_SIZE;
+			page = &folio->page;
+		}
+	}
+
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				      vmf->address, &vmf->ptl);
+				       addr, &vmf->ptl);
 	if (!vmf->pte)
 		return VM_FAULT_NOPAGE;
 
 	/* Re-check under ptl */
-	if (likely(!vmf_pte_changed(vmf))) {
-		struct folio *folio = page_folio(page);
-		int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
-
-		set_pte_range(vmf, folio, page, 1, vmf->address);
-		add_mm_counter(vma->vm_mm, type, 1);
-		ret = 0;
-	} else {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
+		update_mmu_tlb(vma, addr, vmf->pte);
+		ret = VM_FAULT_NOPAGE;
+		goto unlock;
+	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+		update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
 		ret = VM_FAULT_NOPAGE;
+		goto unlock;
 	}
 
+	folio_ref_add(folio, nr_pages - 1);
+	set_pte_range(vmf, folio, page, nr_pages, addr);
+	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
+	add_mm_counter(vma->vm_mm, type, nr_pages);
+	ret = 0;
+
+unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
 }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-11 10:11 ` [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
@ 2024-06-11 14:38   ` Zi Yan
  2024-06-12  9:28     ` Baolin Wang
  2024-06-12 13:40   ` Kefeng Wang
  1 sibling, 1 reply; 45+ messages in thread
From: Zi Yan @ 2024-06-11 14:38 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On 11 Jun 2024, at 6:11, Baolin Wang wrote:

> Add large folio mapping establishment support for finish_fault() as a
> preparation, to support multi-size THP allocation of anonymous shmem pages
> in the following patches.
>
> Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating
> the RSS unintentionally, and we can discuss what size of mapping to build
> when extending mTHP to control non-anon shmem in the future.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 10 deletions(-)
>
LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-11 14:38   ` Zi Yan
@ 2024-06-12  9:28     ` Baolin Wang
  0 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-12  9:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel



On 2024/6/11 22:38, Zi Yan wrote:
> On 11 Jun 2024, at 6:11, Baolin Wang wrote:
> 
>> Add large folio mapping establishment support for finish_fault() as a
>> preparation, to support multi-size THP allocation of anonymous shmem pages
>> in the following patches.
>>
>> Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating
>> the RSS unintentionally, and we can discuss what size of mapping to build
>> when extending mTHP to control non-anon shmem in the future.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 47 insertions(+), 10 deletions(-)
>>
> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Thanks Zi for reviewing.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-11 10:11 ` [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
  2024-06-11 14:38   ` Zi Yan
@ 2024-06-12 13:40   ` Kefeng Wang
  2024-06-13  0:51     ` Baolin Wang
  1 sibling, 1 reply; 45+ messages in thread
From: Kefeng Wang @ 2024-06-12 13:40 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ying.huang, 21cnbao, ryan.roberts, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/6/11 18:11, Baolin Wang wrote:
> Add large folio mapping establishment support for finish_fault() as a
> preparation, to support multi-size THP allocation of anonymous shmem pages
> in the following patches.
> 
> Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating
> the RSS unintentionally, and we can discuss what size of mapping to build
> when extending mTHP to control non-anon shmem in the future.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index eef4e482c0c2..72775ee99ff3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	struct page *page;
> +	struct folio *folio;
>   	vm_fault_t ret;
>   	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>   		      !(vma->vm_flags & VM_SHARED);
> +	int type, nr_pages;
> +	unsigned long addr = vmf->address;
>   
>   	/* Did we COW the page? */
>   	if (is_cow)
> @@ -4864,24 +4867,58 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   			return VM_FAULT_OOM;
>   	}
>   
> +	folio = page_folio(page);
> +	nr_pages = folio_nr_pages(folio);
> +
> +	/*
> +	 * Using per-page fault to maintain the uffd semantics, and same
> +	 * approach also applies to non-anonymous-shmem faults to avoid
> +	 * inflating the RSS of the process.
> +	 */
> +	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
> +		nr_pages = 1;
> +	} else if (nr_pages > 1) {
> +		pgoff_t idx = folio_page_idx(folio, page);
> +		/* The page offset of vmf->address within the VMA. */
> +		pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> +
						vma->vm_pgoff

> +		/*
> +		 * Fallback to per-page fault in case the folio size in page
> +		 * cache beyond the VMA limits.
> +		 */
> +		if (unlikely(vma_off < idx ||
> +			     vma_off + (nr_pages - idx) > vma_pages(vma))) {
> +			nr_pages = 1;
> +		} else {
> +			/* Now we can set mappings for the whole large folio. */
> +			addr = vmf->address - idx * PAGE_SIZE;

			addr -= idx * PAGE_SIZE;

> +			page = &folio->page;
> +		}
> +	}
> +
>   	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -				      vmf->address, &vmf->ptl);
> +				       addr, &vmf->ptl);

no newline now,

>   	if (!vmf->pte)
>   		return VM_FAULT_NOPAGE;
>   
>   	/* Re-check under ptl */
> -	if (likely(!vmf_pte_changed(vmf))) {
> -		struct folio *folio = page_folio(page);
> -		int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> -
> -		set_pte_range(vmf, folio, page, 1, vmf->address);
> -		add_mm_counter(vma->vm_mm, type, 1);
> -		ret = 0;
> -	} else {
> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> +	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> +		update_mmu_tlb(vma, addr, vmf->pte);
> +		ret = VM_FAULT_NOPAGE;
> +		goto unlock;
> +	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> +		update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
>   		ret = VM_FAULT_NOPAGE;
> +		goto unlock;
>   	}

We may add a vmf_pte_range_changed(), but separate it.

Some very small nits, up to you,

Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>

>   
> +	folio_ref_add(folio, nr_pages - 1);
> +	set_pte_range(vmf, folio, page, nr_pages, addr);
> +	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> +	add_mm_counter(vma->vm_mm, type, nr_pages);
> +	ret = 0;
> +
> +unlock:
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   	return ret;
>   }


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-12 13:40   ` Kefeng Wang
@ 2024-06-13  0:51     ` Baolin Wang
  0 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-13  0:51 UTC (permalink / raw)
  To: Kefeng Wang, akpm, hughd
  Cc: willy, david, ying.huang, 21cnbao, ryan.roberts, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/6/12 21:40, Kefeng Wang wrote:
> 
> 
> On 2024/6/11 18:11, Baolin Wang wrote:
>> Add large folio mapping establishment support for finish_fault() as a
>> preparation, to support multi-size THP allocation of anonymous shmem 
>> pages
>> in the following patches.
>>
>> Keep the same behavior (per-page fault) for non-anon shmem to avoid 
>> inflating
>> the RSS unintentionally, and we can discuss what size of mapping to build
>> when extending mTHP to control non-anon shmem in the future.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 47 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eef4e482c0c2..72775ee99ff3 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>   {
>>       struct vm_area_struct *vma = vmf->vma;
>>       struct page *page;
>> +    struct folio *folio;
>>       vm_fault_t ret;
>>       bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>                 !(vma->vm_flags & VM_SHARED);
>> +    int type, nr_pages;
>> +    unsigned long addr = vmf->address;
>>       /* Did we COW the page? */
>>       if (is_cow)
>> @@ -4864,24 +4867,58 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>               return VM_FAULT_OOM;
>>       }
>> +    folio = page_folio(page);
>> +    nr_pages = folio_nr_pages(folio);
>> +
>> +    /*
>> +     * Using per-page fault to maintain the uffd semantics, and same
>> +     * approach also applies to non-anonymous-shmem faults to avoid
>> +     * inflating the RSS of the process.
>> +     */
>> +    if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
>> +        nr_pages = 1;
>> +    } else if (nr_pages > 1) {
>> +        pgoff_t idx = folio_page_idx(folio, page);
>> +        /* The page offset of vmf->address within the VMA. */
>> +        pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
>> +
>                          vma->vm_pgoff
> 
>> +        /*
>> +         * Fallback to per-page fault in case the folio size in page
>> +         * cache beyond the VMA limits.
>> +         */
>> +        if (unlikely(vma_off < idx ||
>> +                 vma_off + (nr_pages - idx) > vma_pages(vma))) {
>> +            nr_pages = 1;
>> +        } else {
>> +            /* Now we can set mappings for the whole large folio. */
>> +            addr = vmf->address - idx * PAGE_SIZE;
> 
>              addr -= idx * PAGE_SIZE;
> 
>> +            page = &folio->page;
>> +        }
>> +    }
>> +
>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> -                      vmf->address, &vmf->ptl);
>> +                       addr, &vmf->ptl);
> 
> no newline now,
> 
>>       if (!vmf->pte)
>>           return VM_FAULT_NOPAGE;
>>       /* Re-check under ptl */
>> -    if (likely(!vmf_pte_changed(vmf))) {
>> -        struct folio *folio = page_folio(page);
>> -        int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> -
>> -        set_pte_range(vmf, folio, page, 1, vmf->address);
>> -        add_mm_counter(vma->vm_mm, type, 1);
>> -        ret = 0;
>> -    } else {
>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +    if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>> +        update_mmu_tlb(vma, addr, vmf->pte);
>> +        ret = VM_FAULT_NOPAGE;
>> +        goto unlock;
>> +    } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>> +        update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
>>           ret = VM_FAULT_NOPAGE;
>> +        goto unlock;
>>       }
> 
> We may add a vmf_pte_range_changed(), but separate it.
> 
> Some very small nits, up to you,
> 
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Thanks for reviewing. If a new version is needed, then I will clean up 
these coding style issues.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
@ 2024-06-11 10:11 ` Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/shmem.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6868c0af3a69..ae358efc397a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1647,7 +1647,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 			return ERR_PTR(-E2BIG);
 
 		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
-		if (!folio)
+		if (!folio && pages == HPAGE_PMD_NR)
 			count_vm_event(THP_FILE_FALLBACK);
 	} else {
 		pages = 1;
@@ -1665,7 +1665,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 		if (xa_find(&mapping->i_pages, &index,
 				index + pages - 1, XA_PRESENT)) {
 			error = -EEXIST;
-		} else if (huge) {
+		} else if (pages == HPAGE_PMD_NR) {
 			count_vm_event(THP_FILE_FALLBACK);
 			count_vm_event(THP_FILE_FALLBACK_CHARGE);
 		}
@@ -2031,7 +2031,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		folio = shmem_alloc_and_add_folio(huge_gfp,
 				inode, index, fault_mm, true);
 		if (!IS_ERR(folio)) {
-			count_vm_event(THP_FILE_ALLOC);
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_FILE_ALLOC);
 			goto alloced;
 		}
 		if (PTR_ERR(folio) == -EEXIST)
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
@ 2024-06-11 10:11 ` Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 4/6] mm: shmem: add mTHP support " Baolin Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set to:
"always", "inherit (to inherit the top level setting)", "within_size", "advise",
"never". An 'inherit' option is added to ensure compatibility with these
global settings, and the options 'force' and 'deny' are dropped, which are
rather testing artifacts from the old ages.

By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.

In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice versa.
That means now we will avoid using non-PMD sized THP to override the global
huge allocation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 23 ++++++
 include/linux/huge_mm.h                    | 10 +++
 mm/huge_memory.c                           | 11 +--
 mm/shmem.c                                 | 96 ++++++++++++++++++++++
 4 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index d414d3f5592a..b76d15e408b3 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,29 @@ deny
 force
     Force the huge option on for all - very useful for testing;
 
+Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to control
+mTHP allocation: '/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
+and its value for each mTHP is essentially consistent with the global setting.
+An 'inherit' option is added to ensure compatibility with these global settings.
+Conversely, the options 'force' and 'deny' are dropped, which are rather testing
+artifacts from the old ages.
+always
+    Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+    have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+    Do not allocate <size> huge pages;
+
+within_size
+    Only allocate <size> huge page if it will be fully within i_size.
+    Also respect fadvise()/madvise() hints;
+
+advise
+    Only allocate <size> huge pages if requested with fadvise()/madvise();
+
 Need of application restart
 ===========================
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 020e2344eb86..fac21548c5de 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,7 @@
 #include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
+#include <linux/kobject.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute thpsize_shmem_enabled_attr;
 
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
@@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
 
+struct thpsize {
+	struct kobject kobj;
+	struct list_head node;
+	int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
 enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e49f402d7c7..1360a1903b66 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
 static DEFINE_SPINLOCK(huge_anon_orders_lock);
 static LIST_HEAD(thpsize_list);
 
-struct thpsize {
-	struct kobject kobj;
-	struct list_head node;
-	int order;
-};
-
-#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
-
 static ssize_t thpsize_enabled_show(struct kobject *kobj,
 				    struct kobj_attribute *attr, char *buf)
 {
@@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
 
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+#ifdef CONFIG_SHMEM
+	&thpsize_shmem_enabled_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index ae358efc397a..bb19ea2770e3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,14 @@ struct shmem_options {
 #define SHMEM_SEEN_QUOTA 32
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long huge_shmem_orders_always __read_mostly;
+static unsigned long huge_shmem_orders_madvise __read_mostly;
+static unsigned long huge_shmem_orders_inherit __read_mostly;
+static unsigned long huge_shmem_orders_within_size __read_mostly;
+static DEFINE_SPINLOCK(huge_shmem_orders_lock);
+#endif
+
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -4672,6 +4680,12 @@ void __init shmem_init(void)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
+
+	/*
+	 * Default to setting PMD-sized THP to inherit the global setting and
+	 * disable all other multi-size THPs.
+	 */
+	huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
 #endif
 	return;
 
@@ -4731,6 +4745,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
 		return -EINVAL;
 
+	/* Do not override huge allocation policy with non-PMD sized mTHP */
+	if (huge == SHMEM_HUGE_FORCE &&
+	    huge_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
+		return -EINVAL;
+
 	shmem_huge = huge;
 	if (shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
@@ -4738,6 +4757,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 }
 
 struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
+
+static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_shmem_orders_always))
+		output = "[always] inherit within_size advise never";
+	else if (test_bit(order, &huge_shmem_orders_inherit))
+		output = "always [inherit] within_size advise never";
+	else if (test_bit(order, &huge_shmem_orders_within_size))
+		output = "always inherit [within_size] advise never";
+	else if (test_bit(order, &huge_shmem_orders_madvise))
+		output = "always inherit within_size [advise] never";
+	else
+		output = "always inherit within_size advise [never]";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always")) {
+		spin_lock(&huge_shmem_orders_lock);
+		clear_bit(order, &huge_shmem_orders_inherit);
+		clear_bit(order, &huge_shmem_orders_madvise);
+		clear_bit(order, &huge_shmem_orders_within_size);
+		set_bit(order, &huge_shmem_orders_always);
+		spin_unlock(&huge_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "inherit")) {
+		/* Do not override huge allocation policy with non-PMD sized mTHP */
+		if (shmem_huge == SHMEM_HUGE_FORCE &&
+		    order != HPAGE_PMD_ORDER)
+			return -EINVAL;
+
+		spin_lock(&huge_shmem_orders_lock);
+		clear_bit(order, &huge_shmem_orders_always);
+		clear_bit(order, &huge_shmem_orders_madvise);
+		clear_bit(order, &huge_shmem_orders_within_size);
+		set_bit(order, &huge_shmem_orders_inherit);
+		spin_unlock(&huge_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "within_size")) {
+		spin_lock(&huge_shmem_orders_lock);
+		clear_bit(order, &huge_shmem_orders_always);
+		clear_bit(order, &huge_shmem_orders_inherit);
+		clear_bit(order, &huge_shmem_orders_madvise);
+		set_bit(order, &huge_shmem_orders_within_size);
+		spin_unlock(&huge_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "madvise")) {
+		spin_lock(&huge_shmem_orders_lock);
+		clear_bit(order, &huge_shmem_orders_always);
+		clear_bit(order, &huge_shmem_orders_inherit);
+		clear_bit(order, &huge_shmem_orders_within_size);
+		set_bit(order, &huge_shmem_orders_madvise);
+		spin_unlock(&huge_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "never")) {
+		spin_lock(&huge_shmem_orders_lock);
+		clear_bit(order, &huge_shmem_orders_always);
+		clear_bit(order, &huge_shmem_orders_inherit);
+		clear_bit(order, &huge_shmem_orders_within_size);
+		clear_bit(order, &huge_shmem_orders_madvise);
+		spin_unlock(&huge_shmem_orders_lock);
+	} else {
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+struct kobj_attribute thpsize_shmem_enabled_attr =
+	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
 #else /* !CONFIG_SHMEM */
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (2 preceding siblings ...)
  2024-06-11 10:11 ` [PATCH v5 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
@ 2024-06-11 10:11 ` Baolin Wang
  2024-07-03 17:25   ` Ryan Roberts
  2024-06-11 10:11 ` [PATCH v5 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous shmem, in order to enjoy
the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
to support all shmem/tmpfs scenarios in the future, especially for the
shmem mmap() case.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size,
which is set to "inherit". This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows independent
control of anonymous shmem enabled for each mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  10 +++
 mm/shmem.c              | 187 +++++++++++++++++++++++++++++++++-------
 2 files changed, 167 insertions(+), 30 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fac21548c5de..909cfc67521d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void)
 {
 	return false;
 }
+
+static inline int highest_order(unsigned long orders)
+{
+	return 0;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/shmem.c b/mm/shmem.c
index bb19ea2770e3..e849c88452b2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 	return result;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
+	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
+	unsigned long vm_flags = vma->vm_flags;
+	/*
+	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+	 * are enabled for this vma.
+	 */
+	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+	loff_t i_size;
+	int order;
+
+	if ((vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return 0;
+
+	/* If the hardware/firmware marked hugepage support disabled. */
+	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+		return 0;
+
+	/*
+	 * Following the 'deny' semantics of the top level, force the huge
+	 * option off from all mounts.
+	 */
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return 0;
+
+	/*
+	 * Only allow inherit orders if the top-level value is 'force', which
+	 * means non-PMD sized THP can not override 'huge' mount option now.
+	 */
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return READ_ONCE(huge_shmem_orders_inherit);
+
+	/* Allow mTHP that will be fully within i_size. */
+	order = highest_order(within_size_orders);
+	while (within_size_orders) {
+		index = round_up(index + 1, order);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >> PAGE_SHIFT >= index) {
+			mask |= within_size_orders;
+			break;
+		}
+
+		order = next_order(&within_size_orders, order);
+	}
+
+	if (vm_flags & VM_HUGEPAGE)
+		mask |= READ_ONCE(huge_shmem_orders_madvise);
+
+	if (global_huge)
+		mask |= READ_ONCE(huge_shmem_orders_inherit);
+
+	return orders & mask;
+}
+
+static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					   struct address_space *mapping, pgoff_t index,
+					   unsigned long orders)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pages;
+	int order;
+
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	if (!orders)
+		return 0;
+
+	/* Find the highest order that can add into the page cache */
+	order = highest_order(orders);
+	while (orders) {
+		pages = 1UL << order;
+		index = round_down(index, pages);
+		if (!xa_find(&mapping->i_pages, &index,
+			     index + pages - 1, XA_PRESENT))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static unsigned long shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	return 0;
+}
+
+static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					   struct address_space *mapping, pgoff_t index,
+					   unsigned long orders)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 		struct shmem_inode_info *info, pgoff_t index)
 {
@@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 	return folio;
 }
 
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
-		struct inode *inode, pgoff_t index,
-		struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+		gfp_t gfp, struct inode *inode, pgoff_t index,
+		struct mm_struct *fault_mm, unsigned long orders)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct folio *folio;
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+	unsigned long suitable_orders = 0;
+	struct folio *folio = NULL;
 	long pages;
-	int error;
+	int error, order;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-		huge = false;
+		orders = 0;
 
-	if (huge) {
-		pages = HPAGE_PMD_NR;
-		index = round_down(index, HPAGE_PMD_NR);
+	if (orders > 0) {
+		if (vma && vma_is_anon_shmem(vma)) {
+			suitable_orders = shmem_suitable_orders(inode, vmf,
+							mapping, index, orders);
+		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
+			pages = HPAGE_PMD_NR;
+			suitable_orders = BIT(HPAGE_PMD_ORDER);
+			index = round_down(index, HPAGE_PMD_NR);
 
-		/*
-		 * Check for conflict before waiting on a huge allocation.
-		 * Conflict might be that a huge page has just been allocated
-		 * and added to page cache by a racing thread, or that there
-		 * is already at least one small page in the huge extent.
-		 * Be careful to retry when appropriate, but not forever!
-		 * Elsewhere -EEXIST would be the right code, but not here.
-		 */
-		if (xa_find(&mapping->i_pages, &index,
-				index + HPAGE_PMD_NR - 1, XA_PRESENT))
-			return ERR_PTR(-E2BIG);
+			/*
+			 * Check for conflict before waiting on a huge allocation.
+			 * Conflict might be that a huge page has just been allocated
+			 * and added to page cache by a racing thread, or that there
+			 * is already at least one small page in the huge extent.
+			 * Be careful to retry when appropriate, but not forever!
+			 * Elsewhere -EEXIST would be the right code, but not here.
+			 */
+			if (xa_find(&mapping->i_pages, &index,
+				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
+				return ERR_PTR(-E2BIG);
+		}
 
-		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
-		if (!folio && pages == HPAGE_PMD_NR)
-			count_vm_event(THP_FILE_FALLBACK);
+		order = highest_order(suitable_orders);
+		while (suitable_orders) {
+			pages = 1UL << order;
+			index = round_down(index, pages);
+			folio = shmem_alloc_folio(gfp, order, info, index);
+			if (folio)
+				goto allocated;
+
+			if (pages == HPAGE_PMD_NR)
+				count_vm_event(THP_FILE_FALLBACK);
+			order = next_order(&suitable_orders, order);
+		}
 	} else {
 		pages = 1;
 		folio = shmem_alloc_folio(gfp, 0, info, index);
@@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
+allocated:
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
@@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	struct mm_struct *fault_mm;
 	struct folio *folio;
 	int error;
-	bool alloced;
+	bool alloced, huge;
+	unsigned long orders = 0;
 
 	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
 		return -EINVAL;
@@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (shmem_is_huge(inode, index, false, fault_mm,
-			  vma ? vma->vm_flags : 0)) {
+	huge = shmem_is_huge(inode, index, false, fault_mm,
+			     vma ? vma->vm_flags : 0);
+	/* Find hugepage orders that are allowed for anonymous shmem. */
+	if (vma && vma_is_anon_shmem(vma))
+		orders = shmem_allowable_huge_orders(inode, vma, index, huge);
+	else if (huge)
+		orders = BIT(HPAGE_PMD_ORDER);
+
+	if (orders > 0) {
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
 		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-		folio = shmem_alloc_and_add_folio(huge_gfp,
-				inode, index, fault_mm, true);
+		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+				inode, index, fault_mm, orders);
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
@@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 			goto repeat;
 	}
 
-	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
 	if (IS_ERR(folio)) {
 		error = PTR_ERR(folio);
 		if (error == -EEXIST)
@@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 alloced:
 	alloced = true;
-	if (folio_test_pmd_mappable(folio) &&
+	if (folio_test_large(folio) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 					folio_next_index(folio) - 1) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-06-11 10:11 ` [PATCH v5 4/6] mm: shmem: add mTHP support " Baolin Wang
@ 2024-07-03 17:25   ` Ryan Roberts
  2024-07-04 11:15     ` Baolin Wang
  0 siblings, 1 reply; 45+ messages in thread
From: Ryan Roberts @ 2024-07-03 17:25 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 11/06/2024 11:11, Baolin Wang wrote:
> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
> can allow THP to be configured through the sysfs interface located at
> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Users expect to apply the mTHP rule for
> all anonymous pages, including the anonymous shmem, in order to enjoy
> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
> to support all shmem/tmpfs scenarios in the future, especially for the
> shmem mmap() case.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size,
> which is set to "inherit". This ensures backward compatibility with the
> anonymous shmem enabled of the top level, meanwhile also allows independent
> control of anonymous shmem enabled for each mTHP.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

[...]

Hi Baolin,

I'm currently trying to fix a bug where khugepaged is not started if only shmem
is enabled for THP. See discussion at [1]. It's been broken like this forever.

Assuming anon and shmem THP are initially both disabled:

# echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
<khugepaged is stopped here>

Then shemem THP is enabled:

# echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
<khugepaged is not started, this is a bug>

As part of investigating the fix, I stumbled upon this patch, which I remember
reviewing an early version of but I've been out for a while and missed the
latter versions. See below for comments and questions; the answers to which will
help me figure out how to fix the above...

[1] https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/


>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
> +				struct vm_area_struct *vma, pgoff_t index,
> +				bool global_huge)
> +{
> +	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
> +	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
> +	unsigned long vm_flags = vma->vm_flags;
> +	/*
> +	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
> +	 * are enabled for this vma.
> +	 */
> +	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
> +	loff_t i_size;
> +	int order;
> +
> +	if ((vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +		return 0;
> +
> +	/* If the hardware/firmware marked hugepage support disabled. */
> +	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> +		return 0;
> +
> +	/*
> +	 * Following the 'deny' semantics of the top level, force the huge
> +	 * option off from all mounts.
> +	 */
> +	if (shmem_huge == SHMEM_HUGE_DENY)
> +		return 0;

I don't quite get this, I don't think its the desirable behaviour. This is
saying that if the top-level shemem_enabled control is set to 'deny', then all
mTHP sizes, regardless of their control's setting are disabled?

The anon controls don't work like that; you can set the top-level control to
anything you like, but its only consumed if the per-size controls are inheriting it.

So for the deny case, wouldn't it be better to allow that as an option on all
the per-size controls (and implicitly let it be inherrited from the top-level)?

> +
> +	/*
> +	 * Only allow inherit orders if the top-level value is 'force', which
> +	 * means non-PMD sized THP can not override 'huge' mount option now.
> +	 */
> +	if (shmem_huge == SHMEM_HUGE_FORCE)
> +		return READ_ONCE(huge_shmem_orders_inherit);

I vaguely recall that we originally discussed that trying to set 'force' on the
top level control while any per-size controls were set to 'inherit' would be an
error, and trying to set 'force' on any per-size control except the PMD-size
would be an error?

I don't really understand this logic. Shouldn't we just be looking at the
per-size control settings (or the top-level control as a proxy for every
per-size control that has 'inherit' set)?

Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
PMD-size control for decisions.

I'm also really struggling with the concept of shmem_is_huge() existing along
side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
shmem_allowable_huge_orders()?

> +
> +	/* Allow mTHP that will be fully within i_size. */
> +	order = highest_order(within_size_orders);
> +	while (within_size_orders) {
> +		index = round_up(index + 1, order);
> +		i_size = round_up(i_size_read(inode), PAGE_SIZE);
> +		if (i_size >> PAGE_SHIFT >= index) {
> +			mask |= within_size_orders;
> +			break;
> +		}
> +
> +		order = next_order(&within_size_orders, order);
> +	}
> +
> +	if (vm_flags & VM_HUGEPAGE)
> +		mask |= READ_ONCE(huge_shmem_orders_madvise);
> +
> +	if (global_huge)

Perhaps I've misunderstood global_huge, but I think its just the return value
from shmem_is_huge()? But you're also using shmem_huge directly in this
function. I find it all rather confusing.

Sorry if all this was discussed and agreed in my absence! But I think this needs
to be sorted in order to do the bug fix for khugepaged properly.

Thanks,
Ryan


> +		mask |= READ_ONCE(huge_shmem_orders_inherit);
> +
> +	return orders & mask;
> +}
> +
> +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
> +					   struct address_space *mapping, pgoff_t index,
> +					   unsigned long orders)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long pages;
> +	int order;
> +
> +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +	if (!orders)
> +		return 0;
> +
> +	/* Find the highest order that can add into the page cache */
> +	order = highest_order(orders);
> +	while (orders) {
> +		pages = 1UL << order;
> +		index = round_down(index, pages);
> +		if (!xa_find(&mapping->i_pages, &index,
> +			     index + pages - 1, XA_PRESENT))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	return orders;
> +}
> +#else
> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
> +				struct vm_area_struct *vma, pgoff_t index,
> +				bool global_huge)
> +{
> +	return 0;
> +}
> +
> +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
> +					   struct address_space *mapping, pgoff_t index,
> +					   unsigned long orders)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>  		struct shmem_inode_info *info, pgoff_t index)
>  {
> @@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>  	return folio;
>  }
>  
> -static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
> -		struct inode *inode, pgoff_t index,
> -		struct mm_struct *fault_mm, bool huge)
> +static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> +		gfp_t gfp, struct inode *inode, pgoff_t index,
> +		struct mm_struct *fault_mm, unsigned long orders)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> -	struct folio *folio;
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> +	unsigned long suitable_orders = 0;
> +	struct folio *folio = NULL;
>  	long pages;
> -	int error;
> +	int error, order;
>  
>  	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> -		huge = false;
> +		orders = 0;
>  
> -	if (huge) {
> -		pages = HPAGE_PMD_NR;
> -		index = round_down(index, HPAGE_PMD_NR);
> +	if (orders > 0) {
> +		if (vma && vma_is_anon_shmem(vma)) {
> +			suitable_orders = shmem_suitable_orders(inode, vmf,
> +							mapping, index, orders);
> +		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
> +			pages = HPAGE_PMD_NR;
> +			suitable_orders = BIT(HPAGE_PMD_ORDER);
> +			index = round_down(index, HPAGE_PMD_NR);
>  
> -		/*
> -		 * Check for conflict before waiting on a huge allocation.
> -		 * Conflict might be that a huge page has just been allocated
> -		 * and added to page cache by a racing thread, or that there
> -		 * is already at least one small page in the huge extent.
> -		 * Be careful to retry when appropriate, but not forever!
> -		 * Elsewhere -EEXIST would be the right code, but not here.
> -		 */
> -		if (xa_find(&mapping->i_pages, &index,
> -				index + HPAGE_PMD_NR - 1, XA_PRESENT))
> -			return ERR_PTR(-E2BIG);
> +			/*
> +			 * Check for conflict before waiting on a huge allocation.
> +			 * Conflict might be that a huge page has just been allocated
> +			 * and added to page cache by a racing thread, or that there
> +			 * is already at least one small page in the huge extent.
> +			 * Be careful to retry when appropriate, but not forever!
> +			 * Elsewhere -EEXIST would be the right code, but not here.
> +			 */
> +			if (xa_find(&mapping->i_pages, &index,
> +				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
> +				return ERR_PTR(-E2BIG);
> +		}
>  
> -		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
> -		if (!folio && pages == HPAGE_PMD_NR)
> -			count_vm_event(THP_FILE_FALLBACK);
> +		order = highest_order(suitable_orders);
> +		while (suitable_orders) {
> +			pages = 1UL << order;
> +			index = round_down(index, pages);
> +			folio = shmem_alloc_folio(gfp, order, info, index);
> +			if (folio)
> +				goto allocated;
> +
> +			if (pages == HPAGE_PMD_NR)
> +				count_vm_event(THP_FILE_FALLBACK);
> +			order = next_order(&suitable_orders, order);
> +		}
>  	} else {
>  		pages = 1;
>  		folio = shmem_alloc_folio(gfp, 0, info, index);
> @@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
>  	if (!folio)
>  		return ERR_PTR(-ENOMEM);
>  
> +allocated:
>  	__folio_set_locked(folio);
>  	__folio_set_swapbacked(folio);
>  
> @@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	struct mm_struct *fault_mm;
>  	struct folio *folio;
>  	int error;
> -	bool alloced;
> +	bool alloced, huge;
> +	unsigned long orders = 0;
>  
>  	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
>  		return -EINVAL;
> @@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  		return 0;
>  	}
>  
> -	if (shmem_is_huge(inode, index, false, fault_mm,
> -			  vma ? vma->vm_flags : 0)) {
> +	huge = shmem_is_huge(inode, index, false, fault_mm,
> +			     vma ? vma->vm_flags : 0);
> +	/* Find hugepage orders that are allowed for anonymous shmem. */
> +	if (vma && vma_is_anon_shmem(vma))
> +		orders = shmem_allowable_huge_orders(inode, vma, index, huge);
> +	else if (huge)
> +		orders = BIT(HPAGE_PMD_ORDER);
> +
> +	if (orders > 0) {
>  		gfp_t huge_gfp;
>  
>  		huge_gfp = vma_thp_gfp_mask(vma);
>  		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
> -		folio = shmem_alloc_and_add_folio(huge_gfp,
> -				inode, index, fault_mm, true);
> +		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
> +				inode, index, fault_mm, orders);
>  		if (!IS_ERR(folio)) {
>  			if (folio_test_pmd_mappable(folio))
>  				count_vm_event(THP_FILE_ALLOC);
> @@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  			goto repeat;
>  	}
>  
> -	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
> +	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
>  	if (IS_ERR(folio)) {
>  		error = PTR_ERR(folio);
>  		if (error == -EEXIST)
> @@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  
>  alloced:
>  	alloced = true;
> -	if (folio_test_pmd_mappable(folio) &&
> +	if (folio_test_large(folio) &&
>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>  					folio_next_index(folio) - 1) {
>  		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-03 17:25   ` Ryan Roberts
@ 2024-07-04 11:15     ` Baolin Wang
  2024-07-04 13:58       ` Ryan Roberts
  2024-07-04 14:46       ` Bang Li
  0 siblings, 2 replies; 45+ messages in thread
From: Baolin Wang @ 2024-07-04 11:15 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/7/4 01:25, Ryan Roberts wrote:
> On 11/06/2024 11:11, Baolin Wang wrote:
>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>> can allow THP to be configured through the sysfs interface located at
>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>> all anonymous pages, including the anonymous shmem, in order to enjoy
>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>> to support all shmem/tmpfs scenarios in the future, especially for the
>> shmem mmap() case.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have almost the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option and dropping the testing options 'force' and
>> 'deny'. By default all sizes will be set to "never" except PMD size,
>> which is set to "inherit". This ensures backward compatibility with the
>> anonymous shmem enabled of the top level, meanwhile also allows independent
>> control of anonymous shmem enabled for each mTHP.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> [...]
> 
> Hi Baolin,
> 
> I'm currently trying to fix a bug where khugepaged is not started if only shmem
> is enabled for THP. See discussion at [1]. It's been broken like this forever.
> 
> Assuming anon and shmem THP are initially both disabled:
> 
> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> <khugepaged is stopped here>
> 
> Then shemem THP is enabled:
> 
> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> <khugepaged is not started, this is a bug>

Thanks Ryan. Yes, this is a real problem.

> As part of investigating the fix, I stumbled upon this patch, which I remember
> reviewing an early version of but I've been out for a while and missed the
> latter versions. See below for comments and questions; the answers to which will
> help me figure out how to fix the above...
> 
> [1] https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
> 
> 
>>   
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>> +				struct vm_area_struct *vma, pgoff_t index,
>> +				bool global_huge)
>> +{
>> +	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>> +	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
>> +	unsigned long vm_flags = vma->vm_flags;
>> +	/*
>> +	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>> +	 * are enabled for this vma.
>> +	 */
>> +	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>> +	loff_t i_size;
>> +	int order;
>> +
>> +	if ((vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return 0;
>> +
>> +	/* If the hardware/firmware marked hugepage support disabled. */
>> +	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>> +		return 0;
>> +
>> +	/*
>> +	 * Following the 'deny' semantics of the top level, force the huge
>> +	 * option off from all mounts.
>> +	 */
>> +	if (shmem_huge == SHMEM_HUGE_DENY)
>> +		return 0;
> 
> I don't quite get this, I don't think its the desirable behaviour. This is
> saying that if the top-level shemem_enabled control is set to 'deny', then all
> mTHP sizes, regardless of their control's setting are disabled?
> 
> The anon controls don't work like that; you can set the top-level control to
> anything you like, but its only consumed if the per-size controls are inheriting it.

IMO, 'deny' option is not similar like 'never' option.

> 
> So for the deny case, wouldn't it be better to allow that as an option on all
> the per-size controls (and implicitly let it be inherrited from the top-level)?

 From 'deny' option's semantics:
"disables huge on shm_mnt and all mounts, for emergency use;"

It is usually used for testing to shut down all huge pages from the old 
ages. Moreover, mTHP interfaces will be used as a huge order filter to 
support tmpfs, so I think it will make life easier to disable all huge 
orders for testing or emergency use, as well as to maintain the original 
semantics.

>> +
>> +	/*
>> +	 * Only allow inherit orders if the top-level value is 'force', which
>> +	 * means non-PMD sized THP can not override 'huge' mount option now.
>> +	 */
>> +	if (shmem_huge == SHMEM_HUGE_FORCE)
>> +		return READ_ONCE(huge_shmem_orders_inherit);
> 
> I vaguely recall that we originally discussed that trying to set 'force' on the
> top level control while any per-size controls were set to 'inherit' would be an
> error, and trying to set 'force' on any per-size control except the PMD-size
> would be an error?

Right.

> I don't really understand this logic. Shouldn't we just be looking at the
> per-size control settings (or the top-level control as a proxy for every
> per-size control that has 'inherit' set)?

‘force’ will apply the huge orders for anon shmem and tmpfs, so now we 
only allow pmd-mapped THP to be forced. We should not look at per-size 
control settings for tmpfs now (mTHP for tmpfs will be discussed in future).

> 
> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
> PMD-size control for decisions.
> 
> I'm also really struggling with the concept of shmem_is_huge() existing along
> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
> shmem_allowable_huge_orders()?

I understood. But now they serve different purposes: shmem_is_huge() 
will be used to check the huge orders for the top level, for *tmpfs* and 
anon shmem; whereas shmem_allowable_huge_orders() will only be used to 
check the per-size huge orders for anon shmem (excluding tmpfs now). 
However, as I plan to add mTHP support for tmpfs, I think we can perform 
some cleanups.

>> +	/* Allow mTHP that will be fully within i_size. */
>> +	order = highest_order(within_size_orders);
>> +	while (within_size_orders) {
>> +		index = round_up(index + 1, order);
>> +		i_size = round_up(i_size_read(inode), PAGE_SIZE);
>> +		if (i_size >> PAGE_SHIFT >= index) {
>> +			mask |= within_size_orders;
>> +			break;
>> +		}
>> +
>> +		order = next_order(&within_size_orders, order);
>> +	}
>> +
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		mask |= READ_ONCE(huge_shmem_orders_madvise);
>> +
>> +	if (global_huge)
> 
> Perhaps I've misunderstood global_huge, but I think its just the return value
> from shmem_is_huge()? But you're also using shmem_huge directly in this

Yes.

> function. I find it all rather confusing.

I think I have explained why need these logics as above. Since mTHP 
support for shmem has just started (tmpfs is still in progress). I will 
make it more clear in the following patches.

> Sorry if all this was discussed and agreed in my absence! But I think this needs
> to be sorted in order to do the bug fix for khugepaged properly.

No worries. Thanks for your input.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 11:15     ` Baolin Wang
@ 2024-07-04 13:58       ` Ryan Roberts
  2024-07-04 14:01         ` David Hildenbrand
                           ` (2 more replies)
  2024-07-04 14:46       ` Bang Li
  1 sibling, 3 replies; 45+ messages in thread
From: Ryan Roberts @ 2024-07-04 13:58 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 04/07/2024 12:15, Baolin Wang wrote:
> 
> 
> On 2024/7/4 01:25, Ryan Roberts wrote:
>> On 11/06/2024 11:11, Baolin Wang wrote:
>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>> can allow THP to be configured through the sysfs interface located at
>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>
>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>> configured through the sysfs interface, and can only use the PMD-mapped
>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>> shmem mmap() case.
>>>
>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>> which can have almost the same values as the top-level
>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>> additional "inherit" option and dropping the testing options 'force' and
>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>> which is set to "inherit". This ensures backward compatibility with the
>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>> control of anonymous shmem enabled for each mTHP.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> [...]
>>
>> Hi Baolin,
>>
>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>
>> Assuming anon and shmem THP are initially both disabled:
>>
>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> <khugepaged is stopped here>
>>
>> Then shemem THP is enabled:
>>
>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>> <khugepaged is not started, this is a bug>
> 
> Thanks Ryan. Yes, this is a real problem.
> 
>> As part of investigating the fix, I stumbled upon this patch, which I remember
>> reviewing an early version of but I've been out for a while and missed the
>> latter versions. See below for comments and questions; the answers to which will
>> help me figure out how to fix the above...
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>
>>
>>>   +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>> +                struct vm_area_struct *vma, pgoff_t index,
>>> +                bool global_huge)
>>> +{
>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>> +    unsigned long within_size_orders =
>>> READ_ONCE(huge_shmem_orders_within_size);
>>> +    unsigned long vm_flags = vma->vm_flags;
>>> +    /*
>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>> +     * are enabled for this vma.
>>> +     */
>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>> +    loff_t i_size;
>>> +    int order;
>>> +
>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> +        return 0;
>>> +
>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>> +        return 0;
>>> +
>>> +    /*
>>> +     * Following the 'deny' semantics of the top level, force the huge
>>> +     * option off from all mounts.
>>> +     */
>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>> +        return 0;
>>
>> I don't quite get this, I don't think its the desirable behaviour. This is
>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>> mTHP sizes, regardless of their control's setting are disabled?
>>
>> The anon controls don't work like that; you can set the top-level control to
>> anything you like, but its only consumed if the per-size controls are
>> inheriting it.
> 
> IMO, 'deny' option is not similar like 'never' option.
> 
>>
>> So for the deny case, wouldn't it be better to allow that as an option on all
>> the per-size controls (and implicitly let it be inherrited from the top-level)?
> 
> From 'deny' option's semantics:
> "disables huge on shm_mnt and all mounts, for emergency use;"

Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
except the PMD-size control, 'deny' and 'never' would be the same practically
speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
disable that size for the tmpfs mounts.

I disagree with the current implementation where setting it up so that a
top-level 'deny' overrides whatever is in the per-size controls simply because
it's different to the model implemented for anon THP. That's my 2 cents. If
nobody else agrees then that ok - I'll fix the above bug according to the
current model.

> 
> It is usually used for testing to shut down all huge pages from the old ages.
> Moreover, mTHP interfaces will be used as a huge order filter to support tmpfs,
> so I think it will make life easier to disable all huge orders for testing or
> emergency use, as well as to maintain the original semantics.>
>>> +
>>> +    /*
>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>> +     */
>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>
>> I vaguely recall that we originally discussed that trying to set 'force' on the
>> top level control while any per-size controls were set to 'inherit' would be an
>> error, and trying to set 'force' on any per-size control except the PMD-size
>> would be an error?
> 
> Right.
> 
>> I don't really understand this logic. Shouldn't we just be looking at the
>> per-size control settings (or the top-level control as a proxy for every
>> per-size control that has 'inherit' set)?
> 
> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
> allow pmd-mapped THP to be forced. We should not look at per-size control
> settings for tmpfs now (mTHP for tmpfs will be discussed in future).

But why not just maintain per-size controls and refactor tmpfs to just look at
the PMD-size THP control for now. It can ignore the other sizes. That's much
simpler and cleaner IMHO.

> 
>>
>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>> PMD-size control for decisions.
>>
>> I'm also really struggling with the concept of shmem_is_huge() existing along
>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>> shmem_allowable_huge_orders()?
> 
> I understood. But now they serve different purposes: shmem_is_huge() will be
> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
> whereas shmem_allowable_huge_orders() will only be used to check the per-size
> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add mTHP
> support for tmpfs, I think we can perform some cleanups.
> 
>>> +    /* Allow mTHP that will be fully within i_size. */
>>> +    order = highest_order(within_size_orders);
>>> +    while (within_size_orders) {
>>> +        index = round_up(index + 1, order);
>>> +        i_size = round_up(i_size_read(inode), PAGE_SIZE);
>>> +        if (i_size >> PAGE_SHIFT >= index) {
>>> +            mask |= within_size_orders;
>>> +            break;
>>> +        }
>>> +
>>> +        order = next_order(&within_size_orders, order);
>>> +    }
>>> +
>>> +    if (vm_flags & VM_HUGEPAGE)
>>> +        mask |= READ_ONCE(huge_shmem_orders_madvise);
>>> +
>>> +    if (global_huge)
>>
>> Perhaps I've misunderstood global_huge, but I think its just the return value
>> from shmem_is_huge()? But you're also using shmem_huge directly in this
> 
> Yes.
> 
>> function. I find it all rather confusing.
> 
> I think I have explained why need these logics as above. Since mTHP support for
> shmem has just started (tmpfs is still in progress). I will make it more clear
> in the following patches.

OK as long as you have a plan for the clean up, that's good enough for me.

> 
>> Sorry if all this was discussed and agreed in my absence! But I think this needs
>> to be sorted in order to do the bug fix for khugepaged properly.
> 
> No worries. Thanks for your input.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 13:58       ` Ryan Roberts
@ 2024-07-04 14:01         ` David Hildenbrand
  2024-07-04 15:05         ` Bang Li
  2024-07-05  2:56         ` Baolin Wang
  2 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand @ 2024-07-04 14:01 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, wangkefeng.wang, ying.huang, 21cnbao, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 04.07.24 15:58, Ryan Roberts wrote:
> On 04/07/2024 12:15, Baolin Wang wrote:
>>
>>
>> On 2024/7/4 01:25, Ryan Roberts wrote:
>>> On 11/06/2024 11:11, Baolin Wang wrote:
>>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>>> can allow THP to be configured through the sysfs interface located at
>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>>> shmem mmap() case.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have almost the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option and dropping the testing options 'force' and
>>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>>> which is set to "inherit". This ensures backward compatibility with the
>>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>>> control of anonymous shmem enabled for each mTHP.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>
>>> [...]
>>>
>>> Hi Baolin,
>>>
>>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>>
>>> Assuming anon and shmem THP are initially both disabled:
>>>
>>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> <khugepaged is stopped here>
>>>
>>> Then shemem THP is enabled:
>>>
>>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> <khugepaged is not started, this is a bug>
>>
>> Thanks Ryan. Yes, this is a real problem.
>>
>>> As part of investigating the fix, I stumbled upon this patch, which I remember
>>> reviewing an early version of but I've been out for a while and missed the
>>> latter versions. See below for comments and questions; the answers to which will
>>> help me figure out how to fix the above...
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>>
>>>
>>>>    +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>> +                struct vm_area_struct *vma, pgoff_t index,
>>>> +                bool global_huge)
>>>> +{
>>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>>> +    unsigned long within_size_orders =
>>>> READ_ONCE(huge_shmem_orders_within_size);
>>>> +    unsigned long vm_flags = vma->vm_flags;
>>>> +    /*
>>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>>> +     * are enabled for this vma.
>>>> +     */
>>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>>> +    loff_t i_size;
>>>> +    int order;
>>>> +
>>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>> +        return 0;
>>>> +
>>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>> +        return 0;
>>>> +
>>>> +    /*
>>>> +     * Following the 'deny' semantics of the top level, force the huge
>>>> +     * option off from all mounts.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>>> +        return 0;
>>>
>>> I don't quite get this, I don't think its the desirable behaviour. This is
>>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>>> mTHP sizes, regardless of their control's setting are disabled?
>>>
>>> The anon controls don't work like that; you can set the top-level control to
>>> anything you like, but its only consumed if the per-size controls are
>>> inheriting it.
>>
>> IMO, 'deny' option is not similar like 'never' option.
>>
>>>
>>> So for the deny case, wouldn't it be better to allow that as an option on all
>>> the per-size controls (and implicitly let it be inherrited from the top-level)?
>>
>>  From 'deny' option's semantics:
>> "disables huge on shm_mnt and all mounts, for emergency use;"
> 
> Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
> except the PMD-size control, 'deny' and 'never' would be the same practically
> speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
> and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
> When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
> disable that size for the tmpfs mounts.
> 
> I disagree with the current implementation where setting it up so that a
> top-level 'deny' overrides whatever is in the per-size controls simply because
> it's different to the model implemented for anon THP. That's my 2 cents. If
> nobody else agrees then that ok - I'll fix the above bug according to the
> current model.

IIRC, Hugh said that deny+force are rather legacy artifacts that we 
don't want on sub-controls (and likely we cannot remove them). I agree 
they shouldn't overwrite other toggles. The models we used should better 
match.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 13:58       ` Ryan Roberts
  2024-07-04 14:01         ` David Hildenbrand
@ 2024-07-04 15:05         ` Bang Li
  2024-07-04 15:54           ` Ryan Roberts
  2024-07-05  2:56         ` Baolin Wang
  2 siblings, 1 reply; 45+ messages in thread
From: Bang Li @ 2024-07-04 15:05 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

Hey Ryan,

On 2024/7/4 21:58, Ryan Roberts wrote:
>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>> PMD-size control for decisions.
>>>
>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>> shmem_allowable_huge_orders()?
>> I understood. But now they serve different purposes: shmem_is_huge() will be
>> used to check the huge orders for the top level, for*tmpfs*  and anon shmem;
>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add mTHP
>> support for tmpfs, I think we can perform some cleanups.
>>
>>>> +    /* Allow mTHP that will be fully within i_size. */
>>>> +    order = highest_order(within_size_orders);
>>>> +    while (within_size_orders) {
>>>> +        index = round_up(index + 1, order);
>>>> +        i_size = round_up(i_size_read(inode), PAGE_SIZE);
>>>> +        if (i_size >> PAGE_SHIFT >= index) {
>>>> +            mask |= within_size_orders;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        order = next_order(&within_size_orders, order);
>>>> +    }
>>>> +
>>>> +    if (vm_flags & VM_HUGEPAGE)
>>>> +        mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>> +
>>>> +    if (global_huge)
>>> Perhaps I've misunderstood global_huge, but I think its just the return value
>>> from shmem_is_huge()? But you're also using shmem_huge directly in this
>> Yes.
>>
>>> function. I find it all rather confusing.
>> I think I have explained why need these logics as above. Since mTHP support for
>> shmem has just started (tmpfs is still in progress). I will make it more clear
>> in the following patches.
> OK as long as you have a plan for the clean up, that's good enough for me.

Can I continue to push the following patch [1]? When other types of 
shmem mTHP
are supported, we will perform cleanups uniformly.

[1] 
https://lore.kernel.org/linux-mm/20240702023401.41553-1-libang.li@antgroup.com/

Thanks,
Bang


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 15:05         ` Bang Li
@ 2024-07-04 15:54           ` Ryan Roberts
  0 siblings, 0 replies; 45+ messages in thread
From: Ryan Roberts @ 2024-07-04 15:54 UTC (permalink / raw)
  To: Bang Li, Baolin Wang, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 04/07/2024 16:05, Bang Li wrote:
> Hey Ryan,
> 
> On 2024/7/4 21:58, Ryan Roberts wrote:
>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>>> PMD-size control for decisions.
>>>>
>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>>> shmem_allowable_huge_orders()?
>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>> used to check the huge orders for the top level, for*tmpfs*  and anon shmem;
>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add mTHP
>>> support for tmpfs, I think we can perform some cleanups.
>>>
>>>>> +    /* Allow mTHP that will be fully within i_size. */
>>>>> +    order = highest_order(within_size_orders);
>>>>> +    while (within_size_orders) {
>>>>> +        index = round_up(index + 1, order);
>>>>> +        i_size = round_up(i_size_read(inode), PAGE_SIZE);
>>>>> +        if (i_size >> PAGE_SHIFT >= index) {
>>>>> +            mask |= within_size_orders;
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        order = next_order(&within_size_orders, order);
>>>>> +    }
>>>>> +
>>>>> +    if (vm_flags & VM_HUGEPAGE)
>>>>> +        mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>>> +
>>>>> +    if (global_huge)
>>>> Perhaps I've misunderstood global_huge, but I think its just the return value
>>>> from shmem_is_huge()? But you're also using shmem_huge directly in this
>>> Yes.
>>>
>>>> function. I find it all rather confusing.
>>> I think I have explained why need these logics as above. Since mTHP support for
>>> shmem has just started (tmpfs is still in progress). I will make it more clear
>>> in the following patches.
>> OK as long as you have a plan for the clean up, that's good enough for me.
> 
> Can I continue to push the following patch [1]? When other types of shmem mTHP
> are supported, we will perform cleanups uniformly.

I guess that makes sense.

> 
> [1] https://lore.kernel.org/linux-mm/20240702023401.41553-1-libang.li@antgroup.com/
> 
> Thanks,
> Bang



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 13:58       ` Ryan Roberts
  2024-07-04 14:01         ` David Hildenbrand
  2024-07-04 15:05         ` Bang Li
@ 2024-07-05  2:56         ` Baolin Wang
  2024-07-05  8:55           ` Ryan Roberts
  2 siblings, 1 reply; 45+ messages in thread
From: Baolin Wang @ 2024-07-05  2:56 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/7/4 21:58, Ryan Roberts wrote:
> On 04/07/2024 12:15, Baolin Wang wrote:
>>
>>
>> On 2024/7/4 01:25, Ryan Roberts wrote:
>>> On 11/06/2024 11:11, Baolin Wang wrote:
>>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>>> can allow THP to be configured through the sysfs interface located at
>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>>> shmem mmap() case.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have almost the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option and dropping the testing options 'force' and
>>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>>> which is set to "inherit". This ensures backward compatibility with the
>>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>>> control of anonymous shmem enabled for each mTHP.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>
>>> [...]
>>>
>>> Hi Baolin,
>>>
>>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>>
>>> Assuming anon and shmem THP are initially both disabled:
>>>
>>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> <khugepaged is stopped here>
>>>
>>> Then shemem THP is enabled:
>>>
>>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> <khugepaged is not started, this is a bug>
>>
>> Thanks Ryan. Yes, this is a real problem.
>>
>>> As part of investigating the fix, I stumbled upon this patch, which I remember
>>> reviewing an early version of but I've been out for a while and missed the
>>> latter versions. See below for comments and questions; the answers to which will
>>> help me figure out how to fix the above...
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>>
>>>
>>>>    +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>> +                struct vm_area_struct *vma, pgoff_t index,
>>>> +                bool global_huge)
>>>> +{
>>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>>> +    unsigned long within_size_orders =
>>>> READ_ONCE(huge_shmem_orders_within_size);
>>>> +    unsigned long vm_flags = vma->vm_flags;
>>>> +    /*
>>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>>> +     * are enabled for this vma.
>>>> +     */
>>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>>> +    loff_t i_size;
>>>> +    int order;
>>>> +
>>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>> +        return 0;
>>>> +
>>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>> +        return 0;
>>>> +
>>>> +    /*
>>>> +     * Following the 'deny' semantics of the top level, force the huge
>>>> +     * option off from all mounts.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>>> +        return 0;
>>>
>>> I don't quite get this, I don't think its the desirable behaviour. This is
>>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>>> mTHP sizes, regardless of their control's setting are disabled?
>>>
>>> The anon controls don't work like that; you can set the top-level control to
>>> anything you like, but its only consumed if the per-size controls are
>>> inheriting it.
>>
>> IMO, 'deny' option is not similar like 'never' option.
>>
>>>
>>> So for the deny case, wouldn't it be better to allow that as an option on all
>>> the per-size controls (and implicitly let it be inherrited from the top-level)?
>>
>>  From 'deny' option's semantics:
>> "disables huge on shm_mnt and all mounts, for emergency use;"
> 
> Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
> except the PMD-size control, 'deny' and 'never' would be the same practically
> speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
> and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
> When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
> disable that size for the tmpfs mounts.

Not really. We will not add 'deny' and 'force' testing option for each 
per-size mTHP control as suggested by Hugh in the previous MM meeting[1].

[1] 
https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u

> I disagree with the current implementation where setting it up so that a
> top-level 'deny' overrides whatever is in the per-size controls simply because
> it's different to the model implemented for anon THP. That's my 2 cents. If
> nobody else agrees then that ok - I'll fix the above bug according to the
> current model.

I remember we have customers who use the 'deny' option to forcibly turn 
off the huge page option for all mounts (including shm_mnt). But if 
shmem/tmpfs uses mTHP and users set 'deny', they cannot force all huge 
orders off and they must also do 'echo never > 
/sys/kernel/mm/transparent_hugepage/hugepages-XXXkB/shmem_enabled' to 
disable all huge orders option, which breaks the previous user habits, IMHO.

In additon, I do not think this creates a difference with the anon mTHP 
model, as anon mTHP does not have these shmem special 'deny' and 'force' 
options for testing purposes. In my view, the current 'deny' option 
fulfills the semantic definition of 'For use in emergencies, to force 
the *huge* option off from all mounts'.
>> It is usually used for testing to shut down all huge pages from the old ages.
>> Moreover, mTHP interfaces will be used as a huge order filter to support tmpfs,
>> so I think it will make life easier to disable all huge orders for testing or
>> emergency use, as well as to maintain the original semantics.>
>>>> +
>>>> +    /*
>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>
>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>> top level control while any per-size controls were set to 'inherit' would be an
>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>> would be an error?
>>
>> Right.
>>
>>> I don't really understand this logic. Shouldn't we just be looking at the
>>> per-size control settings (or the top-level control as a proxy for every
>>> per-size control that has 'inherit' set)?
>>
>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>> allow pmd-mapped THP to be forced. We should not look at per-size control
>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
> 
> But why not just maintain per-size controls and refactor tmpfs to just look at
> the PMD-size THP control for now. It can ignore the other sizes. That's much
> simpler and cleaner IMHO.

As I said above, 'force' and 'deny' option will not be added for 
per-size controls.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-05  2:56         ` Baolin Wang
@ 2024-07-05  8:55           ` Ryan Roberts
  0 siblings, 0 replies; 45+ messages in thread
From: Ryan Roberts @ 2024-07-05  8:55 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05/07/2024 03:56, Baolin Wang wrote:
> 
> 
> On 2024/7/4 21:58, Ryan Roberts wrote:
>> On 04/07/2024 12:15, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/4 01:25, Ryan Roberts wrote:
>>>> On 11/06/2024 11:11, Baolin Wang wrote:
>>>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>>>> can allow THP to be configured through the sysfs interface located at
>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>
>>>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>>>> shmem mmap() case.
>>>>>
>>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>>> which can have almost the same values as the top-level
>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>>> additional "inherit" option and dropping the testing options 'force' and
>>>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>>>> which is set to "inherit". This ensures backward compatibility with the
>>>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>>>> control of anonymous shmem enabled for each mTHP.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>
>>>> [...]
>>>>
>>>> Hi Baolin,
>>>>
>>>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>>>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>>>
>>>> Assuming anon and shmem THP are initially both disabled:
>>>>
>>>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>>> <khugepaged is stopped here>
>>>>
>>>> Then shemem THP is enabled:
>>>>
>>>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>> <khugepaged is not started, this is a bug>
>>>
>>> Thanks Ryan. Yes, this is a real problem.
>>>
>>>> As part of investigating the fix, I stumbled upon this patch, which I remember
>>>> reviewing an early version of but I've been out for a while and missed the
>>>> latter versions. See below for comments and questions; the answers to which
>>>> will
>>>> help me figure out how to fix the above...
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>>>
>>>>
>>>>>    +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>>> +                struct vm_area_struct *vma, pgoff_t index,
>>>>> +                bool global_huge)
>>>>> +{
>>>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>>>> +    unsigned long within_size_orders =
>>>>> READ_ONCE(huge_shmem_orders_within_size);
>>>>> +    unsigned long vm_flags = vma->vm_flags;
>>>>> +    /*
>>>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>>>> +     * are enabled for this vma.
>>>>> +     */
>>>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>>>> +    loff_t i_size;
>>>>> +    int order;
>>>>> +
>>>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>>> +        return 0;
>>>>> +
>>>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>>> +        return 0;
>>>>> +
>>>>> +    /*
>>>>> +     * Following the 'deny' semantics of the top level, force the huge
>>>>> +     * option off from all mounts.
>>>>> +     */
>>>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>>>> +        return 0;
>>>>
>>>> I don't quite get this, I don't think its the desirable behaviour. This is
>>>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>>>> mTHP sizes, regardless of their control's setting are disabled?
>>>>
>>>> The anon controls don't work like that; you can set the top-level control to
>>>> anything you like, but its only consumed if the per-size controls are
>>>> inheriting it.
>>>
>>> IMO, 'deny' option is not similar like 'never' option.
>>>
>>>>
>>>> So for the deny case, wouldn't it be better to allow that as an option on all
>>>> the per-size controls (and implicitly let it be inherrited from the top-level)?
>>>
>>>  From 'deny' option's semantics:
>>> "disables huge on shm_mnt and all mounts, for emergency use;"
>>
>> Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
>> except the PMD-size control, 'deny' and 'never' would be the same practically
>> speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
>> and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
>> When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
>> disable that size for the tmpfs mounts.
> 
> Not really. We will not add 'deny' and 'force' testing option for each per-size
> mTHP control as suggested by Hugh in the previous MM meeting[1].
> 
> [1]
> https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u
> 
>> I disagree with the current implementation where setting it up so that a
>> top-level 'deny' overrides whatever is in the per-size controls simply because
>> it's different to the model implemented for anon THP. That's my 2 cents. If
>> nobody else agrees then that ok - I'll fix the above bug according to the
>> current model.
> 
> I remember we have customers who use the 'deny' option to forcibly turn off the
> huge page option for all mounts (including shm_mnt). But if shmem/tmpfs uses
> mTHP and users set 'deny', they cannot force all huge orders off and they must
> also do 'echo never >
> /sys/kernel/mm/transparent_hugepage/hugepages-XXXkB/shmem_enabled' to disable
> all huge orders option, which breaks the previous user habits, IMHO.

But if they have enabled mTHP for shmem in the first place, they must be mTHP
aware. So its not unreasonable to expect them to call:

# echo deny > /sys/kernel/mm/transparent_hugepage/hugepages-XXXkB/shmem_enabled

That's the expectation for anon mTHP anyway. If the user enables mTHP then does:

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

That *does not* disable all mTHP sizes. It only disables the sizes that have
been set to 'inherit'. The model is that a size is only influenced by the
top-level control if it has explicitly requested 'inherit'. But here you have
this special (and weird in my opinion) case where some values of the top-level
control apply unconditionally to the per-size control and some don't.

> 
> In additon, I do not think this creates a difference with the anon mTHP model,
> as anon mTHP does not have these shmem special 'deny' and 'force' options for
> testing purposes. In my view, the current 'deny' option fulfills the semantic
> definition of 'For use in emergencies, to force the *huge* option off from all
> mounts'.

OK. My opinion is logged :) But I'm not hearing anyone joining in support of
that opinion. I have a better understanding of the intent of your model as a
result of this discussion so thanks for that. As long as the documentation is
clear on this behaviour, let's leave it as is.

>>> It is usually used for testing to shut down all huge pages from the old ages.
>>> Moreover, mTHP interfaces will be used as a huge order filter to support tmpfs,
>>> so I think it will make life easier to disable all huge orders for testing or
>>> emergency use, as well as to maintain the original semantics.>
>>>>> +
>>>>> +    /*
>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>> +     */
>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>
>>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>>> top level control while any per-size controls were set to 'inherit' would be an
>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>> would be an error?
>>>
>>> Right.
>>>
>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>> per-size control settings (or the top-level control as a proxy for every
>>>> per-size control that has 'inherit' set)?
>>>
>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>
>> But why not just maintain per-size controls and refactor tmpfs to just look at
>> the PMD-size THP control for now. It can ignore the other sizes. That's much
>> simpler and cleaner IMHO.
> 
> As I said above, 'force' and 'deny' option will not be added for per-size controls.

Understood.

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 11:15     ` Baolin Wang
  2024-07-04 13:58       ` Ryan Roberts
@ 2024-07-04 14:46       ` Bang Li
  2024-07-05  3:01         ` Baolin Wang
  1 sibling, 1 reply; 45+ messages in thread
From: Bang Li @ 2024-07-04 14:46 UTC (permalink / raw)
  To: Baolin Wang, Ryan Roberts, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

Hi Bao lin,

On 2024/7/4 19:15, Baolin Wang wrote:
>
>>> +
>>> +    /*
>>> +     * Only allow inherit orders if the top-level value is 'force', 
>>> which
>>> +     * means non-PMD sized THP can not override 'huge' mount option 
>>> now.
>>> +     */
>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>
>> I vaguely recall that we originally discussed that trying to set 
>> 'force' on the
>> top level control while any per-size controls were set to 'inherit' 
>> would be an
>> error, and trying to set 'force' on any per-size control except the 
>> PMD-size
>> would be an error?
>
> Right.
>
>> I don't really understand this logic. Shouldn't we just be looking at 
>> the
>> per-size control settings (or the top-level control as a proxy for every
>> per-size control that has 'inherit' set)?
>
> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we 
> only allow pmd-mapped THP to be forced. We should not look at per-size 
> control settings for tmpfs now (mTHP for tmpfs will be discussed in 
> future).
>
>>
>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just 
>> always use the
>> PMD-size control for decisions.
>>
>> I'm also really struggling with the concept of shmem_is_huge() 
>> existing along
>> side shmem_allowable_huge_orders(). Surely this needs to all be 
>> refactored into
>> shmem_allowable_huge_orders()?
>
> I understood. But now they serve different purposes: shmem_is_huge() 
> will be used to check the huge orders for the top level, for *tmpfs* 
> and anon shmem; whereas shmem_allowable_huge_orders() will only be 
> used to check the per-size huge orders for anon shmem (excluding tmpfs 
> now). However, as I plan to add mTHP support for tmpfs, I think we can 
> perform some cleanups. 

Please count me in, I'd be happy to contribute to the cleanup and 
enhancement
process if I can.

Thanks,
Bang



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-04 14:46       ` Bang Li
@ 2024-07-05  3:01         ` Baolin Wang
  2024-07-05  8:42           ` Ryan Roberts
  0 siblings, 1 reply; 45+ messages in thread
From: Baolin Wang @ 2024-07-05  3:01 UTC (permalink / raw)
  To: Bang Li, Ryan Roberts, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/7/4 22:46, Bang Li wrote:
> Hi Bao lin,
> 
> On 2024/7/4 19:15, Baolin Wang wrote:
>>
>>>> +
>>>> +    /*
>>>> +     * Only allow inherit orders if the top-level value is 'force', 
>>>> which
>>>> +     * means non-PMD sized THP can not override 'huge' mount option 
>>>> now.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>
>>> I vaguely recall that we originally discussed that trying to set 
>>> 'force' on the
>>> top level control while any per-size controls were set to 'inherit' 
>>> would be an
>>> error, and trying to set 'force' on any per-size control except the 
>>> PMD-size
>>> would be an error?
>>
>> Right.
>>
>>> I don't really understand this logic. Shouldn't we just be looking at 
>>> the
>>> per-size control settings (or the top-level control as a proxy for every
>>> per-size control that has 'inherit' set)?
>>
>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we 
>> only allow pmd-mapped THP to be forced. We should not look at per-size 
>> control settings for tmpfs now (mTHP for tmpfs will be discussed in 
>> future).
>>
>>>
>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just 
>>> always use the
>>> PMD-size control for decisions.
>>>
>>> I'm also really struggling with the concept of shmem_is_huge() 
>>> existing along
>>> side shmem_allowable_huge_orders(). Surely this needs to all be 
>>> refactored into
>>> shmem_allowable_huge_orders()?
>>
>> I understood. But now they serve different purposes: shmem_is_huge() 
>> will be used to check the huge orders for the top level, for *tmpfs* 
>> and anon shmem; whereas shmem_allowable_huge_orders() will only be 
>> used to check the per-size huge orders for anon shmem (excluding tmpfs 
>> now). However, as I plan to add mTHP support for tmpfs, I think we can 
>> perform some cleanups. 
> 
> Please count me in, I'd be happy to contribute to the cleanup and 
> enhancement
> process if I can.

Good. If you have time, I think you can look at the shmem khugepaged 
issue from the previous discussion [1], which I don't have time to look 
at now.

"
(3) khugepaged

khugepaged needs to handle larger folios properly as well. Until fixed,
using smaller THP sizes as fallback might prohibit collapsing a
PMD-sized THP later. But really, khugepaged needs to be fixed to handle
that.
"

[1] 
https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-05  3:01         ` Baolin Wang
@ 2024-07-05  8:42           ` Ryan Roberts
  2024-07-05  8:57             ` Baolin Wang
  0 siblings, 1 reply; 45+ messages in thread
From: Ryan Roberts @ 2024-07-05  8:42 UTC (permalink / raw)
  To: Baolin Wang, Bang Li, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05/07/2024 04:01, Baolin Wang wrote:
> 
> 
> On 2024/7/4 22:46, Bang Li wrote:
>> Hi Bao lin,
>>
>> On 2024/7/4 19:15, Baolin Wang wrote:
>>>
>>>>> +
>>>>> +    /*
>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>> +     */
>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>
>>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>>> top level control while any per-size controls were set to 'inherit' would be an
>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>> would be an error?
>>>
>>> Right.
>>>
>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>> per-size control settings (or the top-level control as a proxy for every
>>>> per-size control that has 'inherit' set)?
>>>
>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>>
>>>>
>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>>> PMD-size control for decisions.
>>>>
>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>>> shmem_allowable_huge_orders()?
>>>
>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add
>>> mTHP support for tmpfs, I think we can perform some cleanups. 
>>
>> Please count me in, I'd be happy to contribute to the cleanup and enhancement
>> process if I can.
> 
> Good. If you have time, I think you can look at the shmem khugepaged issue from
> the previous discussion [1], which I don't have time to look at now.
> 
> "
> (3) khugepaged
> 
> khugepaged needs to handle larger folios properly as well. Until fixed,
> using smaller THP sizes as fallback might prohibit collapsing a
> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
> that.
> "

khugepaged can already collapse "folios of any order less then PMD-size" to
PMD-size, for anon memory. Infact I modified the selftest to be able to test
that in commit 9f0704eae8a4 ("selftests/mm/khugepaged: enlighten for multi-size
THP"). I'd be surprised if khugepaged can't alreay handle the same for shmem?
Although the test will definitely want to be extended to test it.

Thanks,
Ryan

> 
> [1]
> https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-05  8:42           ` Ryan Roberts
@ 2024-07-05  8:57             ` Baolin Wang
  2024-07-05  9:05               ` Ryan Roberts
  0 siblings, 1 reply; 45+ messages in thread
From: Baolin Wang @ 2024-07-05  8:57 UTC (permalink / raw)
  To: Ryan Roberts, Bang Li, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/7/5 16:42, Ryan Roberts wrote:
> On 05/07/2024 04:01, Baolin Wang wrote:
>>
>>
>> On 2024/7/4 22:46, Bang Li wrote:
>>> Hi Bao lin,
>>>
>>> On 2024/7/4 19:15, Baolin Wang wrote:
>>>>
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>>> +     */
>>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>>
>>>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>>>> top level control while any per-size controls were set to 'inherit' would be an
>>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>>> would be an error?
>>>>
>>>> Right.
>>>>
>>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>>> per-size control settings (or the top-level control as a proxy for every
>>>>> per-size control that has 'inherit' set)?
>>>>
>>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>>>
>>>>>
>>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>>>> PMD-size control for decisions.
>>>>>
>>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>>>> shmem_allowable_huge_orders()?
>>>>
>>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>>> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
>>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add
>>>> mTHP support for tmpfs, I think we can perform some cleanups.
>>>
>>> Please count me in, I'd be happy to contribute to the cleanup and enhancement
>>> process if I can.
>>
>> Good. If you have time, I think you can look at the shmem khugepaged issue from
>> the previous discussion [1], which I don't have time to look at now.
>>
>> "
>> (3) khugepaged
>>
>> khugepaged needs to handle larger folios properly as well. Until fixed,
>> using smaller THP sizes as fallback might prohibit collapsing a
>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>> that.
>> "
> 
> khugepaged can already collapse "folios of any order less then PMD-size" to
> PMD-size, for anon memory. Infact I modified the selftest to be able to test
> that in commit 9f0704eae8a4 ("selftests/mm/khugepaged: enlighten for multi-size
> THP"). I'd be surprised if khugepaged can't alreay handle the same for shmem?

I did not test this, but from the comment in hpage_collapse_scan_file(), 
seems that compacting small mTHP into a single PMD-mapped THP is not 
supported yet.

/*
		 * TODO: khugepaged should compact smaller compound pages
		 * into a PMD sized page
		 */
		if (folio_test_large(folio)) {
			result = folio_order(folio) == HPAGE_PMD_ORDER &&
					folio->index == start
					/* Maybe PMD-mapped */
					? SCAN_PTE_MAPPED_HUGEPAGE
					: SCAN_PAGE_COMPOUND;
			/*
			 * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
			 * by the caller won't touch the page cache, and so
			 * it's safe to skip LRU and refcount checks before
			 * returning.
			 */
			break;
		}

> Although the test will definitely want to be extended to test it.

Right.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-07-05  8:57             ` Baolin Wang
@ 2024-07-05  9:05               ` Ryan Roberts
  0 siblings, 0 replies; 45+ messages in thread
From: Ryan Roberts @ 2024-07-05  9:05 UTC (permalink / raw)
  To: Baolin Wang, Bang Li, akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, shy828301,
	ziy, ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05/07/2024 09:57, Baolin Wang wrote:
> 
> 
> On 2024/7/5 16:42, Ryan Roberts wrote:
>> On 05/07/2024 04:01, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/4 22:46, Bang Li wrote:
>>>> Hi Bao lin,
>>>>
>>>> On 2024/7/4 19:15, Baolin Wang wrote:
>>>>>
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>>>> +     */
>>>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>>>
>>>>>> I vaguely recall that we originally discussed that trying to set 'force'
>>>>>> on the
>>>>>> top level control while any per-size controls were set to 'inherit' would
>>>>>> be an
>>>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>>>> would be an error?
>>>>>
>>>>> Right.
>>>>>
>>>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>>>> per-size control settings (or the top-level control as a proxy for every
>>>>>> per-size control that has 'inherit' set)?
>>>>>
>>>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>>>>
>>>>>>
>>>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always
>>>>>> use the
>>>>>> PMD-size control for decisions.
>>>>>>
>>>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored
>>>>>> into
>>>>>> shmem_allowable_huge_orders()?
>>>>>
>>>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>>>> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
>>>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add
>>>>> mTHP support for tmpfs, I think we can perform some cleanups.
>>>>
>>>> Please count me in, I'd be happy to contribute to the cleanup and enhancement
>>>> process if I can.
>>>
>>> Good. If you have time, I think you can look at the shmem khugepaged issue from
>>> the previous discussion [1], which I don't have time to look at now.
>>>
>>> "
>>> (3) khugepaged
>>>
>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>> using smaller THP sizes as fallback might prohibit collapsing a
>>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>>> that.
>>> "
>>
>> khugepaged can already collapse "folios of any order less then PMD-size" to
>> PMD-size, for anon memory. Infact I modified the selftest to be able to test
>> that in commit 9f0704eae8a4 ("selftests/mm/khugepaged: enlighten for multi-size
>> THP"). I'd be surprised if khugepaged can't alreay handle the same for shmem?
> 
> I did not test this, but from the comment in hpage_collapse_scan_file(), seems
> that compacting small mTHP into a single PMD-mapped THP is not supported yet.
> 
> /*
>          * TODO: khugepaged should compact smaller compound pages
>          * into a PMD sized page
>          */
>         if (folio_test_large(folio)) {
>             result = folio_order(folio) == HPAGE_PMD_ORDER &&
>                     folio->index == start
>                     /* Maybe PMD-mapped */
>                     ? SCAN_PTE_MAPPED_HUGEPAGE
>                     : SCAN_PAGE_COMPOUND;
>             /*
>              * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
>              * by the caller won't touch the page cache, and so
>              * it's safe to skip LRU and refcount checks before
>              * returning.
>              */
>             break;
>         }

OK, so the functionality is missing just for shmem/file-backed collapse. Got it.
Sorry for the noise.

> 
>> Although the test will definitely want to be extended to test it.
> 
> Right.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (3 preceding siblings ...)
  2024-06-11 10:11 ` [PATCH v5 4/6] mm: shmem: add mTHP support " Baolin Wang
@ 2024-06-11 10:11 ` Baolin Wang
  2024-06-11 10:11 ` [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
  2024-07-04 18:43 ` [PATCH v5 0/6] add mTHP support " Matthew Wilcox
  6 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Although the top-level hugepage allocation can be turned off, anonymous shmem
can still use mTHP by configuring the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore,
add alignment for mTHP size to provide a suitable alignment address in
shmem_get_unmapped_area().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Lance Yang <ioworker0@gmail.com>
---
 mm/shmem.c | 40 +++++++++++++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e849c88452b2..f5469c357be6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2394,6 +2394,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	unsigned long inflated_len;
 	unsigned long inflated_addr;
 	unsigned long inflated_offset;
+	unsigned long hpage_size;
 
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -2412,8 +2413,6 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return addr;
-	if (len < HPAGE_PMD_SIZE)
-		return addr;
 	if (flags & MAP_FIXED)
 		return addr;
 	/*
@@ -2425,8 +2424,11 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (uaddr == addr)
 		return addr;
 
+	hpage_size = HPAGE_PMD_SIZE;
 	if (shmem_huge != SHMEM_HUGE_FORCE) {
 		struct super_block *sb;
+		unsigned long __maybe_unused hpage_orders;
+		int order = 0;
 
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
@@ -2439,18 +2441,38 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 			if (IS_ERR(shm_mnt))
 				return addr;
 			sb = shm_mnt->mnt_sb;
+
+			/*
+			 * Find the highest mTHP order used for anonymous shmem to
+			 * provide a suitable alignment address.
+			 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			hpage_orders = READ_ONCE(huge_shmem_orders_always);
+			hpage_orders |= READ_ONCE(huge_shmem_orders_within_size);
+			hpage_orders |= READ_ONCE(huge_shmem_orders_madvise);
+			if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER)
+				hpage_orders |= READ_ONCE(huge_shmem_orders_inherit);
+
+			if (hpage_orders > 0) {
+				order = highest_order(hpage_orders);
+				hpage_size = PAGE_SIZE << order;
+			}
+#endif
 		}
-		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
+		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order)
 			return addr;
 	}
 
-	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
-	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+	if (len < hpage_size)
+		return addr;
+
+	offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1);
+	if (offset && offset + len < 2 * hpage_size)
 		return addr;
-	if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+	if ((addr & (hpage_size - 1)) == offset)
 		return addr;
 
-	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+	inflated_len = len + hpage_size - PAGE_SIZE;
 	if (inflated_len > TASK_SIZE)
 		return addr;
 	if (inflated_len < len)
@@ -2463,10 +2485,10 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (inflated_addr & ~PAGE_MASK)
 		return addr;
 
-	inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+	inflated_offset = inflated_addr & (hpage_size - 1);
 	inflated_addr += offset - inflated_offset;
 	if (inflated_offset > offset)
-		inflated_addr += HPAGE_PMD_SIZE;
+		inflated_addr += hpage_size;
 
 	if (inflated_addr > TASK_SIZE - len)
 		return addr;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (4 preceding siblings ...)
  2024-06-11 10:11 ` [PATCH v5 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
@ 2024-06-11 10:11 ` Baolin Wang
  2024-06-12  8:04   ` Lance Yang
                     ` (2 more replies)
  2024-07-04 18:43 ` [PATCH v5 0/6] add mTHP support " Matthew Wilcox
  6 siblings, 3 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-11 10:11 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Add mTHP counters for anonymous shmem.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        |  6 ++++++
 mm/shmem.c              | 18 +++++++++++++++---
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 909cfc67521d..212cca384d7e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -281,6 +281,9 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
 	MTHP_STAT_SWPOUT,
 	MTHP_STAT_SWPOUT_FALLBACK,
+	MTHP_STAT_FILE_ALLOC,
+	MTHP_STAT_FILE_FALLBACK,
+	MTHP_STAT_FILE_FALLBACK_CHARGE,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1360a1903b66..3fbcd77f5957 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
 DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
 
 static struct attribute *stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
 	&anon_fault_fallback_charge_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
+	&file_alloc_attr.attr,
+	&file_fallback_attr.attr,
+	&file_fallback_charge_attr.attr,
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index f5469c357be6..99bd3c34f0fb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 
 			if (pages == HPAGE_PMD_NR)
 				count_vm_event(THP_FILE_FALLBACK);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
+#endif
 			order = next_order(&suitable_orders, order);
 		}
 	} else {
@@ -1792,9 +1795,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 		if (xa_find(&mapping->i_pages, &index,
 				index + pages - 1, XA_PRESENT)) {
 			error = -EEXIST;
-		} else if (pages == HPAGE_PMD_NR) {
-			count_vm_event(THP_FILE_FALLBACK);
-			count_vm_event(THP_FILE_FALLBACK_CHARGE);
+		} else if (pages > 1) {
+			if (pages == HPAGE_PMD_NR) {
+				count_vm_event(THP_FILE_FALLBACK);
+				count_vm_event(THP_FILE_FALLBACK_CHARGE);
+			}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
+#endif
 		}
 		goto unlock;
 	}
@@ -2168,6 +2177,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
+#endif
 			goto alloced;
 		}
 		if (PTR_ERR(folio) == -EEXIST)
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-11 10:11 ` [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
@ 2024-06-12  8:04   ` Lance Yang
  2024-06-12  9:28     ` Baolin Wang
  2024-06-12 13:46   ` Kefeng Wang
  2024-06-12 14:18   ` Lance Yang
  2 siblings, 1 reply; 45+ messages in thread
From: Lance Yang @ 2024-06-12  8:04 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, da.gomez, p.raghav, linux-mm,
	linux-kernel

Hi Baolin,

On Tue, Jun 11, 2024 at 6:11 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Add mTHP counters for anonymous shmem.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  include/linux/huge_mm.h |  3 +++
>  mm/huge_memory.c        |  6 ++++++
>  mm/shmem.c              | 18 +++++++++++++++---
>  3 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 909cfc67521d..212cca384d7e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -281,6 +281,9 @@ enum mthp_stat_item {
>         MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>         MTHP_STAT_SWPOUT,
>         MTHP_STAT_SWPOUT_FALLBACK,
> +       MTHP_STAT_FILE_ALLOC,
> +       MTHP_STAT_FILE_FALLBACK,
> +       MTHP_STAT_FILE_FALLBACK_CHARGE,
>         __MTHP_STAT_COUNT
>  };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1360a1903b66..3fbcd77f5957 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>  DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
>  DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
>
>  static struct attribute *stats_attrs[] = {
>         &anon_fault_alloc_attr.attr,
> @@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
>         &anon_fault_fallback_charge_attr.attr,
>         &swpout_attr.attr,
>         &swpout_fallback_attr.attr,
> +       &file_alloc_attr.attr,
> +       &file_fallback_attr.attr,
> +       &file_fallback_charge_attr.attr,
>         NULL,
>  };
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f5469c357be6..99bd3c34f0fb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>
>                         if (pages == HPAGE_PMD_NR)
>                                 count_vm_event(THP_FILE_FALLBACK);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +                       count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
> +#endif

Using the conditional compilation directives here is a bit weird :)
Would there be any issues if we were to drop them?

Since THP_FILE_FALLBACK is working as expected, MTHP_STAT_FILE_FALLBACK
should work as well without the conditional compilation directives, IIUC.

Thanks,
Lance

>                         order = next_order(&suitable_orders, order);
>                 }
>         } else {
> @@ -1792,9 +1795,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>                 if (xa_find(&mapping->i_pages, &index,
>                                 index + pages - 1, XA_PRESENT)) {
>                         error = -EEXIST;
> -               } else if (pages == HPAGE_PMD_NR) {
> -                       count_vm_event(THP_FILE_FALLBACK);
> -                       count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +               } else if (pages > 1) {
> +                       if (pages == HPAGE_PMD_NR) {
> +                               count_vm_event(THP_FILE_FALLBACK);
> +                               count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +                       }
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
> +#endif
>                 }
>                 goto unlock;
>         }
> @@ -2168,6 +2177,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>                 if (!IS_ERR(folio)) {
>                         if (folio_test_pmd_mappable(folio))
>                                 count_vm_event(THP_FILE_ALLOC);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
> +#endif
>                         goto alloced;
>                 }
>                 if (PTR_ERR(folio) == -EEXIST)
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-12  8:04   ` Lance Yang
@ 2024-06-12  9:28     ` Baolin Wang
  2024-06-12 14:16       ` Lance Yang
  0 siblings, 1 reply; 45+ messages in thread
From: Baolin Wang @ 2024-06-12  9:28 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, da.gomez, p.raghav, linux-mm,
	linux-kernel



On 2024/6/12 16:04, Lance Yang wrote:
> Hi Baolin,
> 
> On Tue, Jun 11, 2024 at 6:11 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Add mTHP counters for anonymous shmem.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   include/linux/huge_mm.h |  3 +++
>>   mm/huge_memory.c        |  6 ++++++
>>   mm/shmem.c              | 18 +++++++++++++++---
>>   3 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 909cfc67521d..212cca384d7e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -281,6 +281,9 @@ enum mthp_stat_item {
>>          MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>>          MTHP_STAT_SWPOUT,
>>          MTHP_STAT_SWPOUT_FALLBACK,
>> +       MTHP_STAT_FILE_ALLOC,
>> +       MTHP_STAT_FILE_FALLBACK,
>> +       MTHP_STAT_FILE_FALLBACK_CHARGE,
>>          __MTHP_STAT_COUNT
>>   };
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1360a1903b66..3fbcd77f5957 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>>   DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
>>   DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
>> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
>> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
>> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
>>
>>   static struct attribute *stats_attrs[] = {
>>          &anon_fault_alloc_attr.attr,
>> @@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
>>          &anon_fault_fallback_charge_attr.attr,
>>          &swpout_attr.attr,
>>          &swpout_fallback_attr.attr,
>> +       &file_alloc_attr.attr,
>> +       &file_fallback_attr.attr,
>> +       &file_fallback_charge_attr.attr,
>>          NULL,
>>   };
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f5469c357be6..99bd3c34f0fb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>>
>>                          if (pages == HPAGE_PMD_NR)
>>                                  count_vm_event(THP_FILE_FALLBACK);
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +                       count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
>> +#endif
> 
> Using the conditional compilation directives here is a bit weird :)
> Would there be any issues if we were to drop them?

Will cause building errors if CONFIG_TRANSPARENT_HUGEPAGE is not enabled.

> 
> Since THP_FILE_FALLBACK is working as expected, MTHP_STAT_FILE_FALLBACK
> should work as well without the conditional compilation directives, IIUC.

No, you should take a look at how count_mthp_stat() is defined :)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-12  9:28     ` Baolin Wang
@ 2024-06-12 14:16       ` Lance Yang
  0 siblings, 0 replies; 45+ messages in thread
From: Lance Yang @ 2024-06-12 14:16 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, da.gomez, p.raghav, linux-mm,
	linux-kernel

On Wed, Jun 12, 2024 at 5:28 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/6/12 16:04, Lance Yang wrote:
> > Hi Baolin,
> >
> > On Tue, Jun 11, 2024 at 6:11 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >> Add mTHP counters for anonymous shmem.
> >>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> ---
> >>   include/linux/huge_mm.h |  3 +++
> >>   mm/huge_memory.c        |  6 ++++++
> >>   mm/shmem.c              | 18 +++++++++++++++---
> >>   3 files changed, 24 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 909cfc67521d..212cca384d7e 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -281,6 +281,9 @@ enum mthp_stat_item {
> >>          MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> >>          MTHP_STAT_SWPOUT,
> >>          MTHP_STAT_SWPOUT_FALLBACK,
> >> +       MTHP_STAT_FILE_ALLOC,
> >> +       MTHP_STAT_FILE_FALLBACK,
> >> +       MTHP_STAT_FILE_FALLBACK_CHARGE,
> >>          __MTHP_STAT_COUNT
> >>   };
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 1360a1903b66..3fbcd77f5957 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
> >>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> >>   DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
> >>   DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
> >> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
> >> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
> >> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
> >>
> >>   static struct attribute *stats_attrs[] = {
> >>          &anon_fault_alloc_attr.attr,
> >> @@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
> >>          &anon_fault_fallback_charge_attr.attr,
> >>          &swpout_attr.attr,
> >>          &swpout_fallback_attr.attr,
> >> +       &file_alloc_attr.attr,
> >> +       &file_fallback_attr.attr,
> >> +       &file_fallback_charge_attr.attr,
> >>          NULL,
> >>   };
> >>
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index f5469c357be6..99bd3c34f0fb 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> >>
> >>                          if (pages == HPAGE_PMD_NR)
> >>                                  count_vm_event(THP_FILE_FALLBACK);
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +                       count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
> >> +#endif
> >
> > Using the conditional compilation directives here is a bit weird :)
> > Would there be any issues if we were to drop them?
>
> Will cause building errors if CONFIG_TRANSPARENT_HUGEPAGE is not enabled.

Sorry, I got it wrong :p

>
> >
> > Since THP_FILE_FALLBACK is working as expected, MTHP_STAT_FILE_FALLBACK
> > should work as well without the conditional compilation directives, IIUC.
>
> No, you should take a look at how count_mthp_stat() is defined :)

You're correct. count_mthp_stat() does cause a compilation error without them
when CONFIG_TRANSPARENT_HUGEPAGE is not defined.

Thanks,
Lance


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-11 10:11 ` [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
  2024-06-12  8:04   ` Lance Yang
@ 2024-06-12 13:46   ` Kefeng Wang
  2024-06-13  1:00     ` Baolin Wang
  2024-06-12 14:18   ` Lance Yang
  2 siblings, 1 reply; 45+ messages in thread
From: Kefeng Wang @ 2024-06-12 13:46 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ying.huang, 21cnbao, ryan.roberts, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/6/11 18:11, Baolin Wang wrote:
> Add mTHP counters for anonymous shmem.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   include/linux/huge_mm.h |  3 +++
>   mm/huge_memory.c        |  6 ++++++
>   mm/shmem.c              | 18 +++++++++++++++---
>   3 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 909cfc67521d..212cca384d7e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -281,6 +281,9 @@ enum mthp_stat_item {
>   	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>   	MTHP_STAT_SWPOUT,
>   	MTHP_STAT_SWPOUT_FALLBACK,
> +	MTHP_STAT_FILE_ALLOC,
> +	MTHP_STAT_FILE_FALLBACK,
> +	MTHP_STAT_FILE_FALLBACK_CHARGE,
>   	__MTHP_STAT_COUNT
>   };
>   
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1360a1903b66..3fbcd77f5957 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>   DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
>   DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
>   
>   static struct attribute *stats_attrs[] = {
>   	&anon_fault_alloc_attr.attr,
> @@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
>   	&anon_fault_fallback_charge_attr.attr,
>   	&swpout_attr.attr,
>   	&swpout_fallback_attr.attr,
> +	&file_alloc_attr.attr,
> +	&file_fallback_attr.attr,
> +	&file_fallback_charge_attr.attr,
>   	NULL,
>   };
>   
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f5469c357be6..99bd3c34f0fb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>   
>   			if (pages == HPAGE_PMD_NR)
>   				count_vm_event(THP_FILE_FALLBACK);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +			count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
> +#endif
>   			order = next_order(&suitable_orders, order);
>   		}
>   	} else {
> @@ -1792,9 +1795,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>   		if (xa_find(&mapping->i_pages, &index,
>   				index + pages - 1, XA_PRESENT)) {
>   			error = -EEXIST;
> -		} else if (pages == HPAGE_PMD_NR) {
> -			count_vm_event(THP_FILE_FALLBACK);
> -			count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +		} else if (pages > 1) {
> +			if (pages == HPAGE_PMD_NR) {
> +				count_vm_event(THP_FILE_FALLBACK);
> +				count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +			}
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
> +			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);

pages > 1, we have correct order, count_mthp_stat(order, MTHP_XXX) ?

> +#endif
>   		}
>   		goto unlock;
>   	}
> @@ -2168,6 +2177,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>   		if (!IS_ERR(folio)) {
>   			if (folio_test_pmd_mappable(folio))
>   				count_vm_event(THP_FILE_ALLOC);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
> +#endif
>   			goto alloced;
>   		}
>   		if (PTR_ERR(folio) == -EEXIST)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-12 13:46   ` Kefeng Wang
@ 2024-06-13  1:00     ` Baolin Wang
  0 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-13  1:00 UTC (permalink / raw)
  To: Kefeng Wang, akpm, hughd
  Cc: willy, david, ying.huang, 21cnbao, ryan.roberts, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel



On 2024/6/12 21:46, Kefeng Wang wrote:
> 
> 
> On 2024/6/11 18:11, Baolin Wang wrote:
>> Add mTHP counters for anonymous shmem.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   include/linux/huge_mm.h |  3 +++
>>   mm/huge_memory.c        |  6 ++++++
>>   mm/shmem.c              | 18 +++++++++++++++---
>>   3 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 909cfc67521d..212cca384d7e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -281,6 +281,9 @@ enum mthp_stat_item {
>>       MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>>       MTHP_STAT_SWPOUT,
>>       MTHP_STAT_SWPOUT_FALLBACK,
>> +    MTHP_STAT_FILE_ALLOC,
>> +    MTHP_STAT_FILE_FALLBACK,
>> +    MTHP_STAT_FILE_FALLBACK_CHARGE,
>>       __MTHP_STAT_COUNT
>>   };
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1360a1903b66..3fbcd77f5957 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, 
>> MTHP_STAT_ANON_FAULT_FALLBACK);
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, 
>> MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>>   DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
>>   DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
>> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
>> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
>> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, 
>> MTHP_STAT_FILE_FALLBACK_CHARGE);
>>   static struct attribute *stats_attrs[] = {
>>       &anon_fault_alloc_attr.attr,
>> @@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
>>       &anon_fault_fallback_charge_attr.attr,
>>       &swpout_attr.attr,
>>       &swpout_fallback_attr.attr,
>> +    &file_alloc_attr.attr,
>> +    &file_fallback_attr.attr,
>> +    &file_fallback_charge_attr.attr,
>>       NULL,
>>   };
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f5469c357be6..99bd3c34f0fb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1773,6 +1773,9 @@ static struct folio 
>> *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>>               if (pages == HPAGE_PMD_NR)
>>                   count_vm_event(THP_FILE_FALLBACK);
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +            count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
>> +#endif
>>               order = next_order(&suitable_orders, order);
>>           }
>>       } else {
>> @@ -1792,9 +1795,15 @@ static struct folio 
>> *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>>           if (xa_find(&mapping->i_pages, &index,
>>                   index + pages - 1, XA_PRESENT)) {
>>               error = -EEXIST;
>> -        } else if (pages == HPAGE_PMD_NR) {
>> -            count_vm_event(THP_FILE_FALLBACK);
>> -            count_vm_event(THP_FILE_FALLBACK_CHARGE);
>> +        } else if (pages > 1) {
>> +            if (pages == HPAGE_PMD_NR) {
>> +                count_vm_event(THP_FILE_FALLBACK);
>> +                count_vm_event(THP_FILE_FALLBACK_CHARGE);
>> +            }
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +            count_mthp_stat(folio_order(folio), 
>> MTHP_STAT_FILE_FALLBACK);
>> +            count_mthp_stat(folio_order(folio), 
>> MTHP_STAT_FILE_FALLBACK_CHARGE);
> 
> pages > 1, we have correct order, count_mthp_stat(order, MTHP_XXX) ?

Yes, I can use 'order' instead if a new version is needed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-11 10:11 ` [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
  2024-06-12  8:04   ` Lance Yang
  2024-06-12 13:46   ` Kefeng Wang
@ 2024-06-12 14:18   ` Lance Yang
  2024-06-13  1:08     ` Baolin Wang
  2 siblings, 1 reply; 45+ messages in thread
From: Lance Yang @ 2024-06-12 14:18 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, da.gomez, p.raghav, linux-mm,
	linux-kernel

On Tue, Jun 11, 2024 at 6:11 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Add mTHP counters for anonymous shmem.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

LGTM. Feel free to add:
Reviewed-by: Lance Yang <ioworker0@gmail.com>

Just a friendly reminder: We also need to update the documentation
for the counters in transhuge.rst.

Thanks,
Lance

> ---
>  include/linux/huge_mm.h |  3 +++
>  mm/huge_memory.c        |  6 ++++++
>  mm/shmem.c              | 18 +++++++++++++++---
>  3 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 909cfc67521d..212cca384d7e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -281,6 +281,9 @@ enum mthp_stat_item {
>         MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>         MTHP_STAT_SWPOUT,
>         MTHP_STAT_SWPOUT_FALLBACK,
> +       MTHP_STAT_FILE_ALLOC,
> +       MTHP_STAT_FILE_FALLBACK,
> +       MTHP_STAT_FILE_FALLBACK_CHARGE,
>         __MTHP_STAT_COUNT
>  };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1360a1903b66..3fbcd77f5957 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>  DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
>  DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
>
>  static struct attribute *stats_attrs[] = {
>         &anon_fault_alloc_attr.attr,
> @@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
>         &anon_fault_fallback_charge_attr.attr,
>         &swpout_attr.attr,
>         &swpout_fallback_attr.attr,
> +       &file_alloc_attr.attr,
> +       &file_fallback_attr.attr,
> +       &file_fallback_charge_attr.attr,
>         NULL,
>  };
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f5469c357be6..99bd3c34f0fb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>
>                         if (pages == HPAGE_PMD_NR)
>                                 count_vm_event(THP_FILE_FALLBACK);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +                       count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
> +#endif
>                         order = next_order(&suitable_orders, order);
>                 }
>         } else {
> @@ -1792,9 +1795,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>                 if (xa_find(&mapping->i_pages, &index,
>                                 index + pages - 1, XA_PRESENT)) {
>                         error = -EEXIST;
> -               } else if (pages == HPAGE_PMD_NR) {
> -                       count_vm_event(THP_FILE_FALLBACK);
> -                       count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +               } else if (pages > 1) {
> +                       if (pages == HPAGE_PMD_NR) {
> +                               count_vm_event(THP_FILE_FALLBACK);
> +                               count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +                       }
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
> +#endif
>                 }
>                 goto unlock;
>         }
> @@ -2168,6 +2177,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>                 if (!IS_ERR(folio)) {
>                         if (folio_test_pmd_mappable(folio))
>                                 count_vm_event(THP_FILE_ALLOC);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
> +#endif
>                         goto alloced;
>                 }
>                 if (PTR_ERR(folio) == -EEXIST)
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-12 14:18   ` Lance Yang
@ 2024-06-13  1:08     ` Baolin Wang
  0 siblings, 0 replies; 45+ messages in thread
From: Baolin Wang @ 2024-06-13  1:08 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, da.gomez, p.raghav, linux-mm,
	linux-kernel



On 2024/6/12 22:18, Lance Yang wrote:
> On Tue, Jun 11, 2024 at 6:11 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Add mTHP counters for anonymous shmem.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> LGTM. Feel free to add:
> Reviewed-by: Lance Yang <ioworker0@gmail.com>

Thanks.

> 
> Just a friendly reminder: We also need to update the documentation
> for the counters in transhuge.rst.

Indeed.

Andrew, could you help to fold following changes into this patch? Thanks.

diff --git a/Documentation/admin-guide/mm/transhuge.rst 
b/Documentation/admin-guide/mm/transhuge.rst
index e7232b46fe14..8f6ffbfc4b16 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -501,6 +501,19 @@ swpout_fallback
         Usually because failed to allocate some continuous swap space
         for the huge page.

+file_alloc
+        is incremented every time a file huge page is successfully
+        allocated.
+
+file_fallback
+        is incremented if a file huge page is attempted to be allocated
+        but fails and instead falls back to using small pages.
+
+file_fallback_charge
+        is incremented if a file huge page cannot be charged and instead
+        falls back to using small pages even though the allocation was
+        successful.
+
  As the system ages, allocating huge pages may be expensive as the
  system uses memory compaction to copy data around memory to free a
  huge page for use. There are some counters in ``/proc/vmstat`` to help


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (5 preceding siblings ...)
  2024-06-11 10:11 ` [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
@ 2024-07-04 18:43 ` Matthew Wilcox
  2024-07-04 19:03   ` David Hildenbrand
  6 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2024-07-04 18:43 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel

On Tue, Jun 11, 2024 at 06:11:04PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule configured
> through the sysfs interface, and can only use the PMD-mapped THP, that is not
> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> to apply an unified mTHP strategy for anonymous pages, also including the
> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> contiguous PTEs on ARM architecture to reduce TLB miss etc.

OK, this makes sense.

> As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
> all of shmem, not only anonymous shmem, but support will be added iteratively.
> Therefore, this patch set starts with support for anonymous shmem.

But then this doesn't.  You say first that users want the same controls
to control all anonymous memory, then you introduce a completely
separate set of controls for shared anonymous memory.

shmem has two uses:

 - MAP_ANONYMOUS | MAP_SHARED (this patch set)
 - tmpfs

For the second use case we don't want controls *at all*, we want the
same heiristics used for all other filesystems to apply to tmpfs.

There's no reason to have separate controls for choosing folio size
in shmem.

> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size, which
> is set to "inherit". This ensures backward compatibility with the anonymous
> shmem enabled of the top level, meanwhile also allows independent control of
> anonymous shmem enabled for each mTHP.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-04 18:43 ` [PATCH v5 0/6] add mTHP support " Matthew Wilcox
@ 2024-07-04 19:03   ` David Hildenbrand
  2024-07-04 19:19     ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2024-07-04 19:03 UTC (permalink / raw)
  To: Matthew Wilcox, Baolin Wang
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel

> shmem has two uses:
> 
>   - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>   - tmpfs
> 
> For the second use case we don't want controls *at all*, we want the
> same heiristics used for all other filesystems to apply to tmpfs.

As discussed in the MM meeting, Hugh had a different opinion on that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-04 19:03   ` David Hildenbrand
@ 2024-07-04 19:19     ` David Hildenbrand
  2024-07-04 19:49       ` Matthew Wilcox
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2024-07-04 19:19 UTC (permalink / raw)
  To: Matthew Wilcox, Baolin Wang
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel

On 04.07.24 21:03, David Hildenbrand wrote:
>> shmem has two uses:
>>
>>    - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>    - tmpfs
>>
>> For the second use case we don't want controls *at all*, we want the
>> same heiristics used for all other filesystems to apply to tmpfs.
> 
> As discussed in the MM meeting, Hugh had a different opinion on that.

FWIW, I just recalled that I wrote a quick summary:

https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com

I believe the meetings are recorded as well, but never looked at recordings.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-04 19:19     ` David Hildenbrand
@ 2024-07-04 19:49       ` Matthew Wilcox
  2024-07-05  5:47         ` Baolin Wang
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2024-07-04 19:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baolin Wang, akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel

On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
> On 04.07.24 21:03, David Hildenbrand wrote:
> > > shmem has two uses:
> > > 
> > >    - MAP_ANONYMOUS | MAP_SHARED (this patch set)
> > >    - tmpfs
> > > 
> > > For the second use case we don't want controls *at all*, we want the
> > > same heiristics used for all other filesystems to apply to tmpfs.
> > 
> > As discussed in the MM meeting, Hugh had a different opinion on that.
> 
> FWIW, I just recalled that I wrote a quick summary:
> 
> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
> 
> I believe the meetings are recorded as well, but never looked at recordings.

That's not what I understood Hugh to mean.  To me, it seemed that Hugh
was expressing an opinion on using shmem as shmem, not as using it as
tmpfs.

If I misunderstood Hugh, well, I still disagree.  We should not have
separate controls for this.  tmpfs is just not that special.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-04 19:49       ` Matthew Wilcox
@ 2024-07-05  5:47         ` Baolin Wang
  2024-07-05  8:45           ` Ryan Roberts
  0 siblings, 1 reply; 45+ messages in thread
From: Baolin Wang @ 2024-07-05  5:47 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel



On 2024/7/5 03:49, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>> shmem has two uses:
>>>>
>>>>     - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>     - tmpfs
>>>>
>>>> For the second use case we don't want controls *at all*, we want the
>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>
>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>
>> FWIW, I just recalled that I wrote a quick summary:
>>
>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>
>> I believe the meetings are recorded as well, but never looked at recordings.
> 
> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
> was expressing an opinion on using shmem as shmem, not as using it as
> tmpfs.
> 
> If I misunderstood Hugh, well, I still disagree.  We should not have
> separate controls for this.  tmpfs is just not that special.

But now we already have a PMD-mapped THP control for tmpfs, and mTHP 
simply extends this control to per-size.

IIUC, as David mentioned before, for tmpfs, mTHP should act like a huge 
order filter which should be respected by the expected huge orders in 
the write() and fallocate() paths. This would also solve the issue of 
allocating huge orders in writable mmap() path for tmpfs, as well as 
unifying the interface.

Anyway, I will try to provide an RFC to discuss the mTHP for tmpfs approach.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-05  5:47         ` Baolin Wang
@ 2024-07-05  8:45           ` Ryan Roberts
  2024-07-05  8:59             ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Ryan Roberts @ 2024-07-05  8:45 UTC (permalink / raw)
  To: Baolin Wang, Matthew Wilcox, David Hildenbrand
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05/07/2024 06:47, Baolin Wang wrote:
> 
> 
> On 2024/7/5 03:49, Matthew Wilcox wrote:
>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>> shmem has two uses:
>>>>>
>>>>>     - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>     - tmpfs
>>>>>
>>>>> For the second use case we don't want controls *at all*, we want the
>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>
>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>
>>> FWIW, I just recalled that I wrote a quick summary:
>>>
>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>
>>> I believe the meetings are recorded as well, but never looked at recordings.
>>
>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>> was expressing an opinion on using shmem as shmem, not as using it as
>> tmpfs.
>>
>> If I misunderstood Hugh, well, I still disagree.  We should not have
>> separate controls for this.  tmpfs is just not that special.

I wasn't at the meeting that's being referred to, but I thought we previously
agreed that tmpfs *is* special because in some configurations its not backed by
swap so is locked in ram?

> 
> But now we already have a PMD-mapped THP control for tmpfs, and mTHP simply
> extends this control to per-size.
> 
> IIUC, as David mentioned before, for tmpfs, mTHP should act like a huge order
> filter which should be respected by the expected huge orders in the write() and
> fallocate() paths. This would also solve the issue of allocating huge orders in
> writable mmap() path for tmpfs, as well as unifying the interface.
> 
> Anyway, I will try to provide an RFC to discuss the mTHP for tmpfs approach.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-05  8:45           ` Ryan Roberts
@ 2024-07-05  8:59             ` David Hildenbrand
  2024-07-05  9:13               ` Ryan Roberts
  2024-07-07 16:39               ` Daniel Gomez
  0 siblings, 2 replies; 45+ messages in thread
From: David Hildenbrand @ 2024-07-05  8:59 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, Matthew Wilcox
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05.07.24 10:45, Ryan Roberts wrote:
> On 05/07/2024 06:47, Baolin Wang wrote:
>>
>>
>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>> shmem has two uses:
>>>>>>
>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>      - tmpfs
>>>>>>
>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>
>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>
>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>
>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>
>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>
>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>> was expressing an opinion on using shmem as shmem, not as using it as
>>> tmpfs.
>>>
>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>> separate controls for this.  tmpfs is just not that special.
> 
> I wasn't at the meeting that's being referred to, but I thought we previously
> agreed that tmpfs *is* special because in some configurations its not backed by
> swap so is locked in ram?

There are multiple things to that, like:

* Machines only having limited/no swap configured
* tmpfs can be configured to never go to swap
* memfd/tmpfs files getting used purely for mmap(): there is no real
   difference to MAP_ANON|MAP_SHARE besides the processes we share that
   memory with.

Especially when it comes to memory waste concerns and access behavior in 
some cases, tmpfs behaved much more like anonymous memory. But there are 
for sure other use cases where tmpfs is not that special.

My opinion is that we need to let people configure orders (if you feel 
like it, configure all), but *select* the order to allocate based on 
readahead information -- in contrast to anonymous memory where we start 
at the highest order and don't have readahead information available.

Maybe we need different "order allcoation" logic for read/write vs. 
fault, not sure.

But I don't maintain that code, so I can only give stupid suggestions 
and repeat what I understood from the meeting with Hugh and Kirill :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-05  8:59             ` David Hildenbrand
@ 2024-07-05  9:13               ` Ryan Roberts
  2024-07-05  9:16                 ` David Hildenbrand
  2024-07-07 16:39               ` Daniel Gomez
  1 sibling, 1 reply; 45+ messages in thread
From: Ryan Roberts @ 2024-07-05  9:13 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, Matthew Wilcox
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05/07/2024 09:59, David Hildenbrand wrote:
> On 05.07.24 10:45, Ryan Roberts wrote:
>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>> shmem has two uses:
>>>>>>>
>>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>      - tmpfs
>>>>>>>
>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>
>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>
>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>
>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>
>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>
>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>> tmpfs.
>>>>
>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>> separate controls for this.  tmpfs is just not that special.
>>
>> I wasn't at the meeting that's being referred to, but I thought we previously
>> agreed that tmpfs *is* special because in some configurations its not backed by
>> swap so is locked in ram?
> 
> There are multiple things to that, like:
> 
> * Machines only having limited/no swap configured
> * tmpfs can be configured to never go to swap
> * memfd/tmpfs files getting used purely for mmap(): there is no real
>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
>   memory with.
> 
> Especially when it comes to memory waste concerns and access behavior in some
> cases, tmpfs behaved much more like anonymous memory. But there are for sure
> other use cases where tmpfs is not that special.
> 
> My opinion is that we need to let people configure orders (if you feel like it,
> configure all), but *select* the order to allocate based on readahead
> information -- in contrast to anonymous memory where we start at the highest
> order and don't have readahead information available.

That approach is exactly what I proposed to start playing with yesterday [1] for
regular pagecache folio allocations too :)

[1] https://lore.kernel.org/linux-mm/bdde4008-60db-4717-a6b5-53d77ab76bdb@arm.com/

> 
> Maybe we need different "order allcoation" logic for read/write vs. fault, not
> sure.
> 
> But I don't maintain that code, so I can only give stupid suggestions and repeat
> what I understood from the meeting with Hugh and Kirill :)
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-05  9:13               ` Ryan Roberts
@ 2024-07-05  9:16                 ` David Hildenbrand
  2024-07-05  9:23                   ` Ryan Roberts
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2024-07-05  9:16 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, Matthew Wilcox
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05.07.24 11:13, Ryan Roberts wrote:
> On 05/07/2024 09:59, David Hildenbrand wrote:
>> On 05.07.24 10:45, Ryan Roberts wrote:
>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>> shmem has two uses:
>>>>>>>>
>>>>>>>>       - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>       - tmpfs
>>>>>>>>
>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>
>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>
>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>
>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>
>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>
>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>> tmpfs.
>>>>>
>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>> separate controls for this.  tmpfs is just not that special.
>>>
>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>> swap so is locked in ram?
>>
>> There are multiple things to that, like:
>>
>> * Machines only having limited/no swap configured
>> * tmpfs can be configured to never go to swap
>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>    difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>    memory with.
>>
>> Especially when it comes to memory waste concerns and access behavior in some
>> cases, tmpfs behaved much more like anonymous memory. But there are for sure
>> other use cases where tmpfs is not that special.
>>
>> My opinion is that we need to let people configure orders (if you feel like it,
>> configure all), but *select* the order to allocate based on readahead
>> information -- in contrast to anonymous memory where we start at the highest
>> order and don't have readahead information available.
> 
> That approach is exactly what I proposed to start playing with yesterday [1] for
> regular pagecache folio allocations too :)

In German, there is this saying "zwei Dumme ein Gedanke".

The official English alternative is "great minds think alike".

... well, the direct German->English translation definitely has a 
"German touch" to it: "two stupid ones one thought"

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-05  9:16                 ` David Hildenbrand
@ 2024-07-05  9:23                   ` Ryan Roberts
  0 siblings, 0 replies; 45+ messages in thread
From: Ryan Roberts @ 2024-07-05  9:23 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, Matthew Wilcox
  Cc: akpm, hughd, wangkefeng.wang, ying.huang, 21cnbao, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 05/07/2024 10:16, David Hildenbrand wrote:
> On 05.07.24 11:13, Ryan Roberts wrote:
>> On 05/07/2024 09:59, David Hildenbrand wrote:
>>> On 05.07.24 10:45, Ryan Roberts wrote:
>>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>>> shmem has two uses:
>>>>>>>>>
>>>>>>>>>       - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>>       - tmpfs
>>>>>>>>>
>>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>>
>>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>>
>>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>>
>>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>>
>>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>>
>>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>>> tmpfs.
>>>>>>
>>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>>> separate controls for this.  tmpfs is just not that special.
>>>>
>>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>>> swap so is locked in ram?
>>>
>>> There are multiple things to that, like:
>>>
>>> * Machines only having limited/no swap configured
>>> * tmpfs can be configured to never go to swap
>>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>>    difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>>    memory with.
>>>
>>> Especially when it comes to memory waste concerns and access behavior in some
>>> cases, tmpfs behaved much more like anonymous memory. But there are for sure
>>> other use cases where tmpfs is not that special.
>>>
>>> My opinion is that we need to let people configure orders (if you feel like it,
>>> configure all), but *select* the order to allocate based on readahead
>>> information -- in contrast to anonymous memory where we start at the highest
>>> order and don't have readahead information available.
>>
>> That approach is exactly what I proposed to start playing with yesterday [1] for
>> regular pagecache folio allocations too :)
> 
> In German, there is this saying "zwei Dumme ein Gedanke".
> 
> The official English alternative is "great minds think alike".
> 
> ... well, the direct German->English translation definitely has a "German touch"
> to it: "two stupid ones one thought"

I definitely prefer the direct translation. :)



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-05  8:59             ` David Hildenbrand
  2024-07-05  9:13               ` Ryan Roberts
@ 2024-07-07 16:39               ` Daniel Gomez
  2024-07-09  8:28                 ` Ryan Roberts
  1 sibling, 1 reply; 45+ messages in thread
From: Daniel Gomez @ 2024-07-07 16:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Baolin Wang, Matthew Wilcox,
	akpm@linux-foundation.org, hughd@google.com,
	wangkefeng.wang@huawei.com, ying.huang@intel.com,
	21cnbao@gmail.com, shy828301@gmail.com, ziy@nvidia.com,
	ioworker0@gmail.com, Pankaj Raghav, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
> On 05.07.24 10:45, Ryan Roberts wrote:
> > On 05/07/2024 06:47, Baolin Wang wrote:
> > > 
> > > 
> > > On 2024/7/5 03:49, Matthew Wilcox wrote:
> > > > On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
> > > > > On 04.07.24 21:03, David Hildenbrand wrote:
> > > > > > > shmem has two uses:
> > > > > > > 
> > > > > > >      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
> > > > > > >      - tmpfs
> > > > > > > 
> > > > > > > For the second use case we don't want controls *at all*, we want the
> > > > > > > same heiristics used for all other filesystems to apply to tmpfs.
> > > > > > 
> > > > > > As discussed in the MM meeting, Hugh had a different opinion on that.
> > > > > 
> > > > > FWIW, I just recalled that I wrote a quick summary:
> > > > > 
> > > > > https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
> > > > > 
> > > > > I believe the meetings are recorded as well, but never looked at recordings.
> > > > 
> > > > That's not what I understood Hugh to mean.  To me, it seemed that Hugh
> > > > was expressing an opinion on using shmem as shmem, not as using it as
> > > > tmpfs.
> > > > 
> > > > If I misunderstood Hugh, well, I still disagree.  We should not have
> > > > separate controls for this.  tmpfs is just not that special.
> > 
> > I wasn't at the meeting that's being referred to, but I thought we previously
> > agreed that tmpfs *is* special because in some configurations its not backed by
> > swap so is locked in ram?
> 
> There are multiple things to that, like:
> 
> * Machines only having limited/no swap configured
> * tmpfs can be configured to never go to swap
> * memfd/tmpfs files getting used purely for mmap(): there is no real
>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
>   memory with.
> 
> Especially when it comes to memory waste concerns and access behavior in
> some cases, tmpfs behaved much more like anonymous memory. But there are for
> sure other use cases where tmpfs is not that special.

Having controls to select the allowable folio order allocations for
tmpfs does not address any of these issues. The suggested filesystem
approach [1] involves allocating orders in larger chunks, but always
the same size you would allocate when using order-0 folios. So,
it's a conservative approach. Using mTHP knobs in tmpfs would cause:
* Over allocation when using mTHP and/ord THP under the 'always' flag.
* Allocate in bigger chunks in a non optimal way, when
not all mTHP and THP orders are enabled.
* Operate in a similar manner as in [1] when all mTHP and THP orders
are enabled and 'within_size' flag is used (assuming we use patch 11
from [1]).

[1] Last 3 patches of these series:
https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/

My understanding of why mTHP was preferred is to raise awareness in
user space and allow tmpfs mounts used at boot time to operate in
'safe' mode (no large folios). Does it make more sense to have a large
folios enable flag to control order allocation as in [1], instead of
every single order possible?

> 
> My opinion is that we need to let people configure orders (if you feel like
> it, configure all), but *select* the order to allocate based on readahead
> information -- in contrast to anonymous memory where we start at the highest
> order and don't have readahead information available.
> 
> Maybe we need different "order allcoation" logic for read/write vs. fault,
> not sure.

I would suggest [1] the file size of the write for the write
and fallocate paths. But when does make sense to use readahead
information? Maybe when swap is involved?

> 
> But I don't maintain that code, so I can only give stupid suggestions and
> repeat what I understood from the meeting with Hugh and Kirill :)
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-07 16:39               ` Daniel Gomez
@ 2024-07-09  8:28                 ` Ryan Roberts
  2024-07-16 13:11                   ` Daniel Gomez
  0 siblings, 1 reply; 45+ messages in thread
From: Ryan Roberts @ 2024-07-09  8:28 UTC (permalink / raw)
  To: Daniel Gomez, David Hildenbrand
  Cc: Baolin Wang, Matthew Wilcox, akpm@linux-foundation.org,
	hughd@google.com, wangkefeng.wang@huawei.com,
	ying.huang@intel.com, 21cnbao@gmail.com, shy828301@gmail.com,
	ziy@nvidia.com, ioworker0@gmail.com, Pankaj Raghav,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org

On 07/07/2024 17:39, Daniel Gomez wrote:
> On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
>> On 05.07.24 10:45, Ryan Roberts wrote:
>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>> shmem has two uses:
>>>>>>>>
>>>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>      - tmpfs
>>>>>>>>
>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>
>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>
>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>
>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>
>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>
>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>> tmpfs.
>>>>>
>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>> separate controls for this.  tmpfs is just not that special.
>>>
>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>> swap so is locked in ram?
>>
>> There are multiple things to that, like:
>>
>> * Machines only having limited/no swap configured
>> * tmpfs can be configured to never go to swap
>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>   memory with.
>>
>> Especially when it comes to memory waste concerns and access behavior in
>> some cases, tmpfs behaved much more like anonymous memory. But there are for
>> sure other use cases where tmpfs is not that special.
> 
> Having controls to select the allowable folio order allocations for
> tmpfs does not address any of these issues. The suggested filesystem
> approach [1] involves allocating orders in larger chunks, but always
> the same size you would allocate when using order-0 folios. 

Well you can't know that you will never allocate more. If you allocate a 2M
block, you probably have some good readahead data that tells you you are likely
to keep reading sequentially, but you don't know for sure that the application
won't stop after just 4K.

> So,
> it's a conservative approach. Using mTHP knobs in tmpfs would cause:
> * Over allocation when using mTHP and/ord THP under the 'always' flag.
> * Allocate in bigger chunks in a non optimal way, when
> not all mTHP and THP orders are enabled.
> * Operate in a similar manner as in [1] when all mTHP and THP orders
> are enabled and 'within_size' flag is used (assuming we use patch 11
> from [1]).

Large folios may still be considered scarce resources even if the amount of
memory allocated is still the same. And if shmem isn't backed by swap then once
you have allocated a large folio for shmem, it is stuck in shmem, even if it
would be better used somewhere else.

And it's possible (likely even, in my opinion) that allocating lots of different
folio sizes will exacerbate memory fragmentation, leading to more order-0
fallbacks, which would hurt the overall system performance in the long run, vs
restricting to a couple of folio sizes.

I'm starting some work to actually measure how limiting the folio sizes
allocated for page cache memory can help reduce large folio allocation failure
overall. My hypothesis is that the data will show us that in an environment like
Android, where memory pressure is high, limiting everything to order-0 and
order-4 will significantly improve the allocation success rate of order-4. Let's
see.

> 
> [1] Last 3 patches of these series:
> https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> 
> My understanding of why mTHP was preferred is to raise awareness in
> user space and allow tmpfs mounts used at boot time to operate in
> 'safe' mode (no large folios). Does it make more sense to have a large
> folios enable flag to control order allocation as in [1], instead of
> every single order possible?

My intuition is towards every order possible, as per above. Let's see what the
data tells us.

> 
>>
>> My opinion is that we need to let people configure orders (if you feel like
>> it, configure all), but *select* the order to allocate based on readahead
>> information -- in contrast to anonymous memory where we start at the highest
>> order and don't have readahead information available.
>>
>> Maybe we need different "order allcoation" logic for read/write vs. fault,
>> not sure.
> 
> I would suggest [1] the file size of the write for the write
> and fallocate paths. But when does make sense to use readahead
> information? Maybe when swap is involved?
> 
>>
>> But I don't maintain that code, so I can only give stupid suggestions and
>> repeat what I understood from the meeting with Hugh and Kirill :)
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-09  8:28                 ` Ryan Roberts
@ 2024-07-16 13:11                   ` Daniel Gomez
  2024-07-16 13:22                     ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Daniel Gomez @ 2024-07-16 13:11 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Baolin Wang, Matthew Wilcox,
	akpm@linux-foundation.org, hughd@google.com,
	wangkefeng.wang@huawei.com, ying.huang@intel.com,
	21cnbao@gmail.com, shy828301@gmail.com, ziy@nvidia.com,
	ioworker0@gmail.com, Pankaj Raghav, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Tue, Jul 09, 2024 at 09:28:48AM GMT, Ryan Roberts wrote:
> On 07/07/2024 17:39, Daniel Gomez wrote:
> > On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
> >> On 05.07.24 10:45, Ryan Roberts wrote:
> >>> On 05/07/2024 06:47, Baolin Wang wrote:
> >>>>
> >>>>
> >>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
> >>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
> >>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
> >>>>>>>> shmem has two uses:
> >>>>>>>>
> >>>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
> >>>>>>>>      - tmpfs
> >>>>>>>>
> >>>>>>>> For the second use case we don't want controls *at all*, we want the
> >>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
> >>>>>>>
> >>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
> >>>>>>
> >>>>>> FWIW, I just recalled that I wrote a quick summary:
> >>>>>>
> >>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
> >>>>>>
> >>>>>> I believe the meetings are recorded as well, but never looked at recordings.
> >>>>>
> >>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
> >>>>> was expressing an opinion on using shmem as shmem, not as using it as
> >>>>> tmpfs.
> >>>>>
> >>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
> >>>>> separate controls for this.  tmpfs is just not that special.
> >>>
> >>> I wasn't at the meeting that's being referred to, but I thought we previously
> >>> agreed that tmpfs *is* special because in some configurations its not backed by
> >>> swap so is locked in ram?
> >>
> >> There are multiple things to that, like:
> >>
> >> * Machines only having limited/no swap configured
> >> * tmpfs can be configured to never go to swap
> >> * memfd/tmpfs files getting used purely for mmap(): there is no real
> >>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
> >>   memory with.
> >>
> >> Especially when it comes to memory waste concerns and access behavior in
> >> some cases, tmpfs behaved much more like anonymous memory. But there are for
> >> sure other use cases where tmpfs is not that special.
> > 
> > Having controls to select the allowable folio order allocations for
> > tmpfs does not address any of these issues. The suggested filesystem
> > approach [1] involves allocating orders in larger chunks, but always
> > the same size you would allocate when using order-0 folios. 
> 
> Well you can't know that you will never allocate more. If you allocate a 2M

In the fs large folio approach implementation [1], the allocation of a 2M (or
any non order-0) occurs when the size of the write/fallocate is 2M (and index
is aligned).

> block, you probably have some good readahead data that tells you you are likely
> to keep reading sequentially, but you don't know for sure that the application
> won't stop after just 4K.

Is shmem_file_read_iter() getting readahead data to perform the read? or what do
you mean exactly?

In [1], read is perform in chunks of 4k, so I think this does not apply.

> 
> > So,
> > it's a conservative approach. Using mTHP knobs in tmpfs would cause:
> > * Over allocation when using mTHP and/ord THP under the 'always' flag.
> > * Allocate in bigger chunks in a non optimal way, when
> > not all mTHP and THP orders are enabled.
> > * Operate in a similar manner as in [1] when all mTHP and THP orders
> > are enabled and 'within_size' flag is used (assuming we use patch 11
> > from [1]).
> 
> Large folios may still be considered scarce resources even if the amount of
> memory allocated is still the same. And if shmem isn't backed by swap then once
> you have allocated a large folio for shmem, it is stuck in shmem, even if it
> would be better used somewhere else.

Is that true for tmpfs as well? We have shmem_unused_huge_shrink() that will
reclaim unused large folios (when ENOSPC and free_cached_objects()). Can't we
reuse that when the system is under memory pressure?

> 
> And it's possible (likely even, in my opinion) that allocating lots of different
> folio sizes will exacerbate memory fragmentation, leading to more order-0
> fallbacks, which would hurt the overall system performance in the long run, vs
> restricting to a couple of folio sizes.

Since we are transitioning to large folios in other filesystems, the impact
of restricting the order here will only depend on the extent of tmpfs usage
relative to the rest of the system. Luis discussed the topic of mm fragmentation
and measurment in a session at LSFMM this year [2].

[2] https://lore.kernel.org/all/ZkUOXQvVjXP1T6Nk@bombadil.infradead.org/

> 
> I'm starting some work to actually measure how limiting the folio sizes
> allocated for page cache memory can help reduce large folio allocation failure

It would be great to hear more about that effort.

> overall. My hypothesis is that the data will show us that in an environment like
> Android, where memory pressure is high, limiting everything to order-0 and
> order-4 will significantly improve the allocation success rate of order-4. Let's
> see.
> 
> > 
> > [1] Last 3 patches of these series:
> > https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> > 
> > My understanding of why mTHP was preferred is to raise awareness in
> > user space and allow tmpfs mounts used at boot time to operate in
> > 'safe' mode (no large folios). Does it make more sense to have a large
> > folios enable flag to control order allocation as in [1], instead of
> > every single order possible?
> 
> My intuition is towards every order possible, as per above. Let's see what the
> data tells us.
> 
> > 
> >>
> >> My opinion is that we need to let people configure orders (if you feel like
> >> it, configure all), but *select* the order to allocate based on readahead
> >> information -- in contrast to anonymous memory where we start at the highest
> >> order and don't have readahead information available.
> >>
> >> Maybe we need different "order allcoation" logic for read/write vs. fault,
> >> not sure.
> > 
> > I would suggest [1] the file size of the write for the write
> > and fallocate paths. But when does make sense to use readahead
> > information? Maybe when swap is involved?
> > 
> >>
> >> But I don't maintain that code, so I can only give stupid suggestions and
> >> repeat what I understood from the meeting with Hugh and Kirill :)
> >>
> >> -- 
> >> Cheers,
> >>
> >> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/6] add mTHP support for anonymous shmem
  2024-07-16 13:11                   ` Daniel Gomez
@ 2024-07-16 13:22                     ` David Hildenbrand
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand @ 2024-07-16 13:22 UTC (permalink / raw)
  To: Daniel Gomez, Ryan Roberts
  Cc: Baolin Wang, Matthew Wilcox, akpm@linux-foundation.org,
	hughd@google.com, wangkefeng.wang@huawei.com,
	ying.huang@intel.com, 21cnbao@gmail.com, shy828301@gmail.com,
	ziy@nvidia.com, ioworker0@gmail.com, Pankaj Raghav,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins

On 16.07.24 15:11, Daniel Gomez wrote:
> On Tue, Jul 09, 2024 at 09:28:48AM GMT, Ryan Roberts wrote:
>> On 07/07/2024 17:39, Daniel Gomez wrote:
>>> On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
>>>> On 05.07.24 10:45, Ryan Roberts wrote:
>>>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>>>> shmem has two uses:
>>>>>>>>>>
>>>>>>>>>>       - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>>>       - tmpfs
>>>>>>>>>>
>>>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>>>
>>>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>>>
>>>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>>>
>>>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>>>
>>>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>>>
>>>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>>>> tmpfs.
>>>>>>>
>>>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>>>> separate controls for this.  tmpfs is just not that special.
>>>>>
>>>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>>>> swap so is locked in ram?
>>>>
>>>> There are multiple things to that, like:
>>>>
>>>> * Machines only having limited/no swap configured
>>>> * tmpfs can be configured to never go to swap
>>>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>>>    difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>>>    memory with.
>>>>
>>>> Especially when it comes to memory waste concerns and access behavior in
>>>> some cases, tmpfs behaved much more like anonymous memory. But there are for
>>>> sure other use cases where tmpfs is not that special.
>>>
>>> Having controls to select the allowable folio order allocations for
>>> tmpfs does not address any of these issues. The suggested filesystem
>>> approach [1] involves allocating orders in larger chunks, but always
>>> the same size you would allocate when using order-0 folios.
>>
>> Well you can't know that you will never allocate more. If you allocate a 2M
> 
> In the fs large folio approach implementation [1], the allocation of a 2M (or
> any non order-0) occurs when the size of the write/fallocate is 2M (and index
> is aligned).

I don't have time right now follow the discussion in detail here (I 
thought we had a meeting to discuss that and received guidance from 
Hugh?), but I'll point out two things:

(1) We need a reasonable model for handling/allocating of large folios
     during page faults. shmem/tmpfs can be used just like anon-shmem if
     you simply only mmap that thing (hello VMs!).

(2) Hugh gave (IMHO) clear feedback during the meeting how he thinks we
     should approach large folios in shmem.

Maybe I got (2) all wrong and people can point out all the issues in my 
summary from the meeting.

Otherwise, if people don't want to accept the result from that meeting, 
we need further guidance from Hugh.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2024-07-16 13:22 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-11 10:11 [PATCH v5 0/6] add mTHP support for anonymous shmem Baolin Wang
2024-06-11 10:11 ` [PATCH v5 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
2024-06-11 14:38   ` Zi Yan
2024-06-12  9:28     ` Baolin Wang
2024-06-12 13:40   ` Kefeng Wang
2024-06-13  0:51     ` Baolin Wang
2024-06-11 10:11 ` [PATCH v5 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
2024-06-11 10:11 ` [PATCH v5 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
2024-06-11 10:11 ` [PATCH v5 4/6] mm: shmem: add mTHP support " Baolin Wang
2024-07-03 17:25   ` Ryan Roberts
2024-07-04 11:15     ` Baolin Wang
2024-07-04 13:58       ` Ryan Roberts
2024-07-04 14:01         ` David Hildenbrand
2024-07-04 15:05         ` Bang Li
2024-07-04 15:54           ` Ryan Roberts
2024-07-05  2:56         ` Baolin Wang
2024-07-05  8:55           ` Ryan Roberts
2024-07-04 14:46       ` Bang Li
2024-07-05  3:01         ` Baolin Wang
2024-07-05  8:42           ` Ryan Roberts
2024-07-05  8:57             ` Baolin Wang
2024-07-05  9:05               ` Ryan Roberts
2024-06-11 10:11 ` [PATCH v5 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
2024-06-11 10:11 ` [PATCH v5 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
2024-06-12  8:04   ` Lance Yang
2024-06-12  9:28     ` Baolin Wang
2024-06-12 14:16       ` Lance Yang
2024-06-12 13:46   ` Kefeng Wang
2024-06-13  1:00     ` Baolin Wang
2024-06-12 14:18   ` Lance Yang
2024-06-13  1:08     ` Baolin Wang
2024-07-04 18:43 ` [PATCH v5 0/6] add mTHP support " Matthew Wilcox
2024-07-04 19:03   ` David Hildenbrand
2024-07-04 19:19     ` David Hildenbrand
2024-07-04 19:49       ` Matthew Wilcox
2024-07-05  5:47         ` Baolin Wang
2024-07-05  8:45           ` Ryan Roberts
2024-07-05  8:59             ` David Hildenbrand
2024-07-05  9:13               ` Ryan Roberts
2024-07-05  9:16                 ` David Hildenbrand
2024-07-05  9:23                   ` Ryan Roberts
2024-07-07 16:39               ` Daniel Gomez
2024-07-09  8:28                 ` Ryan Roberts
2024-07-16 13:11                   ` Daniel Gomez
2024-07-16 13:22                     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).