[PATCH v2 0/2] mm: collect the number of anon mTHP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] mm: collect the number of anon mTHP
@ 2024-08-11 22:49 Barry Song
  2024-08-11 22:49 ` [PATCH v2 1/2] mm: collect the number of anon large folios Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Barry Song @ 2024-08-11 22:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, ioworker0, kaleshsingh,
	kasong, linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai

From: Barry Song <v-songbaohua@oppo.com>

Knowing the number of anon mTHPs in the system is crucial for performance
analysis. It helps in understanding the ratio and distribution of
mTHPs versus small folios throughout the system.

Additionally, partial unmapping by userspace can lead to significant waste
of mTHPs over time and increase memory reclamation pressure. We need this
information for comprehensive system tuning.

-v2:
 * don't rely on rmap to implement - 1, uses folio_free, split etc.
   Thanks for David's comment;
 * rename sys counters and refine doc. Thanks for Ryan's comment;

-v1:
 https://lore.kernel.org/all/20240808010457.228753-1-21cnbao@gmail.com/

Barry Song (2):
  mm: collect the number of anon large folios
  mm: collect the number of anon large folios on split_deferred list

 Documentation/admin-guide/mm/transhuge.rst | 10 ++++++++++
 include/linux/huge_mm.h                    | 16 ++++++++++++++--
 mm/huge_memory.c                           | 19 ++++++++++++++++---
 mm/migrate.c                               |  4 ++++
 mm/page_alloc.c                            |  5 ++++-
 mm/rmap.c                                  |  1 +
 6 files changed, 49 insertions(+), 6 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-11 22:49 [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
@ 2024-08-11 22:49 ` Barry Song
  2024-08-21 21:34   ` David Hildenbrand
  2024-08-11 22:49 ` [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list Barry Song
  2024-08-18  7:58 ` [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
  2 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-11 22:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, ioworker0, kaleshsingh,
	kasong, linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai

From: Barry Song <v-songbaohua@oppo.com>

Anon large folios come from three places:
1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
for rmap;
2. a large folio is split into multiple lower-order large folios;
3. a large folio is migrated to a new large folio.

In all above three counts, we increase nr_anon by 1;

Anon large folios might go either because of be split or be put
to free, in these cases, we reduce the count by 1.

Folios that have been added to the swap cache but have not yet received
an anon mapping won't be counted. This is consistent with the AnonPages
statistics in /proc/meminfo.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  5 +++++
 include/linux/huge_mm.h                    | 15 +++++++++++++--
 mm/huge_memory.c                           | 13 ++++++++++---
 mm/migrate.c                               |  4 ++++
 mm/page_alloc.c                            |  5 ++++-
 mm/rmap.c                                  |  1 +
 6 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..9fdfb46e4560 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -527,6 +527,11 @@ split_deferred
         it would free up some memory. Pages on split queue are going to
         be split under memory pressure, if splitting is possible.
 
+nr_anon
+       the number of anon huge pages we have in the whole system.
+       These huge pages could be entirely mapped or have partially
+       unmapped/unused subpages.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c32058cacfe..2ee2971e4e10 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -126,6 +126,7 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT,
 	MTHP_STAT_SPLIT_FAILED,
 	MTHP_STAT_SPLIT_DEFERRED,
+	MTHP_STAT_NR_ANON,
 	__MTHP_STAT_COUNT
 };
 
@@ -136,14 +137,24 @@ struct mthp_stat {
 
 DECLARE_PER_CPU(struct mthp_stat, mthp_stats);
 
-static inline void count_mthp_stat(int order, enum mthp_stat_item item)
+static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
 {
 	if (order <= 0 || order > PMD_ORDER)
 		return;
 
-	this_cpu_inc(mthp_stats.stats[order][item]);
+	this_cpu_add(mthp_stats.stats[order][item], delta);
+}
+
+static inline void count_mthp_stat(int order, enum mthp_stat_item item)
+{
+	mod_mthp_stat(order, item, 1);
 }
+
 #else
+static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
+{
+}
+
 static inline void count_mthp_stat(int order, enum mthp_stat_item item)
 {
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d90d6e94a800..afb911789df8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -596,6 +596,7 @@ DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE);
 DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT);
 DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
+DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -608,6 +609,7 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_attr.attr,
 	&split_failed_attr.attr,
 	&split_deferred_attr.attr,
+	&nr_anon_attr.attr,
 	NULL,
 };
 
@@ -3216,8 +3218,9 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	/* reset xarray order to new order after split */
 	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
-	struct anon_vma *anon_vma = NULL;
+	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	int order = folio_order(folio);
 	int extra_pins, ret;
 	pgoff_t end;
@@ -3229,7 +3232,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	if (new_order >= folio_order(folio))
 		return -EINVAL;
 
-	if (folio_test_anon(folio)) {
+	if (is_anon) {
 		/* order-1 is not supported for anonymous THP. */
 		if (new_order == 1) {
 			VM_WARN_ONCE(1, "Cannot split to order-1 folio");
@@ -3269,7 +3272,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	if (folio_test_writeback(folio))
 		return -EBUSY;
 
-	if (folio_test_anon(folio)) {
+	if (is_anon) {
 		/*
 		 * The caller does not necessarily hold an mmap_lock that would
 		 * prevent the anon_vma disappearing so we first we take a
@@ -3382,6 +3385,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			}
 		}
 
+		if (is_anon) {
+			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+			mod_mthp_stat(new_order, MTHP_STAT_NR_ANON, 1 << (order - new_order));
+		}
 		__split_huge_page(page, list, end, new_order);
 		ret = 0;
 	} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 7e1267042a56..bde573ec2af8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -423,6 +423,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		/* No turning back from here */
 		newfolio->index = folio->index;
 		newfolio->mapping = folio->mapping;
+		if (folio_test_anon(folio) && folio_test_large(folio))
+			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
 		if (folio_test_swapbacked(folio))
 			__folio_set_swapbacked(newfolio);
 
@@ -447,6 +449,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	 */
 	newfolio->index = folio->index;
 	newfolio->mapping = folio->mapping;
+	if (folio_test_anon(folio) && folio_test_large(folio))
+		mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
 	folio_ref_add(newfolio, nr); /* add cache reference */
 	if (folio_test_swapbacked(folio)) {
 		__folio_set_swapbacked(newfolio);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 84a7154fde93..382c364d3efa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1084,8 +1084,11 @@ __always_inline bool free_pages_prepare(struct page *page,
 			(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		}
 	}
-	if (PageMappingFlags(page))
+	if (PageMappingFlags(page)) {
+		if (PageAnon(page) && compound)
+			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
 		page->mapping = NULL;
+	}
 	if (is_check_pages_enabled()) {
 		if (free_page_is_bad(page))
 			bad++;
diff --git a/mm/rmap.c b/mm/rmap.c
index a6b9cd0b2b18..b7609920704c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1467,6 +1467,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	}
 
 	__folio_mod_stat(folio, nr, nr_pmdmapped);
+	mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
 }
 
 static __always_inline void __folio_add_file_rmap(struct folio *folio,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list
  2024-08-11 22:49 [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
  2024-08-11 22:49 ` [PATCH v2 1/2] mm: collect the number of anon large folios Barry Song
@ 2024-08-11 22:49 ` Barry Song
  2024-08-21 21:39   ` David Hildenbrand
  2024-08-18  7:58 ` [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
  2 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-11 22:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, ioworker0, kaleshsingh,
	kasong, linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai

From: Barry Song <v-songbaohua@oppo.com>

When an mTHP is added to the deferred_list, its partial pages
are unused, leading to wasted memory and potentially increasing
memory reclamation pressure.

Detailing the specifics of how unmapping occurs is quite difficult
and not that useful, so we adopt a simple approach: each time an
mTHP enters the deferred_list, we increment the count by 1; whenever
it leaves for any reason, we decrement the count by 1.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 5 +++++
 include/linux/huge_mm.h                    | 1 +
 mm/huge_memory.c                           | 6 ++++++
 3 files changed, 12 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9fdfb46e4560..7072469de8a8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -532,6 +532,11 @@ nr_anon
        These huge pages could be entirely mapped or have partially
        unmapped/unused subpages.
 
+nr_split_deferred
+       the number of anon huge pages which have been partially unmapped
+       and put onto split queue. Those unmapped subpages are also unused
+       and temporarily wasting memory.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2ee2971e4e10..1e2d5dbe82c5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -127,6 +127,7 @@ enum mthp_stat_item {
 	MTHP_STAT_SPLIT_FAILED,
 	MTHP_STAT_SPLIT_DEFERRED,
 	MTHP_STAT_NR_ANON,
+	MTHP_STAT_NR_SPLIT_DEFERRED,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afb911789df8..1a12c011e2df 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -597,6 +597,7 @@ DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT);
 DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
 DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
 DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
+DEFINE_MTHP_STAT_ATTR(nr_split_deferred, MTHP_STAT_NR_SPLIT_DEFERRED);
 
 static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -610,6 +611,7 @@ static struct attribute *anon_stats_attrs[] = {
 	&split_failed_attr.attr,
 	&split_deferred_attr.attr,
 	&nr_anon_attr.attr,
+	&nr_split_deferred_attr.attr,
 	NULL,
 };
 
@@ -3359,6 +3361,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (folio_order(folio) > 1 &&
 		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
+			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_SPLIT_DEFERRED, -1);
 			/*
 			 * Reinitialize page_deferred_list after removing the
 			 * page from the split_queue, otherwise a subsequent
@@ -3425,6 +3428,7 @@ void __folio_undo_large_rmappable(struct folio *folio)
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
+		mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_SPLIT_DEFERRED, -1);
 		list_del_init(&folio->_deferred_list);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
@@ -3466,6 +3470,7 @@ void deferred_split_folio(struct folio *folio)
 		if (folio_test_pmd_mappable(folio))
 			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
 		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_SPLIT_DEFERRED, 1);
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
@@ -3513,6 +3518,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 			list_move(&folio->_deferred_list, &list);
 		} else {
 			/* We lost race with folio_put() */
+			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_SPLIT_DEFERRED, -1);
 			list_del_init(&folio->_deferred_list);
 			ds_queue->split_queue_len--;
 		}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/2] mm: collect the number of anon mTHP
  2024-08-11 22:49 [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
  2024-08-11 22:49 ` [PATCH v2 1/2] mm: collect the number of anon large folios Barry Song
  2024-08-11 22:49 ` [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list Barry Song
@ 2024-08-18  7:58 ` Barry Song
  2024-08-19  2:44   ` Usama Arif
  2024-08-19  8:28   ` David Hildenbrand
  2 siblings, 2 replies; 19+ messages in thread
From: Barry Song @ 2024-08-18  7:58 UTC (permalink / raw)
  To: akpm, linux-mm, Usama Arif
  Cc: baolin.wang, chrisl, david, hanchuanhua, ioworker0, kaleshsingh,
	kasong, linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai

Hi Andrew, David, Usama,

I'm attempting to rebase this series on top of Usama's
[PATCH v3 0/6] mm: split underutilized THPs[1]

However, I feel it is impossible and we might be tackling things
in the wrong order.

This series should probably come before Usama's one, as the new
partially_mapped should be ok to handle both nr_split_deferred
and split_deferred. That said, the new partially_mapped still has
several unresolved issues blocking my rebase.

[1]https://lore.kernel.org/linux-mm/20240813120328.1275952-1-usamaarif642@gmail.com

What do you think about this?

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/2] mm: collect the number of anon mTHP
  2024-08-18  7:58 ` [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
@ 2024-08-19  2:44   ` Usama Arif
  2024-08-19  8:28   ` David Hildenbrand
  1 sibling, 0 replies; 19+ messages in thread
From: Usama Arif @ 2024-08-19  2:44 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, ioworker0, kaleshsingh,
	kasong, linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai



On 18/08/2024 08:58, Barry Song wrote:
> Hi Andrew, David, Usama,
> 
> I'm attempting to rebase this series on top of Usama's
> [PATCH v3 0/6] mm: split underutilized THPs[1]
> 
> However, I feel it is impossible and we might be tackling things
> in the wrong order.
> 
> This series should probably come before Usama's one, as the new
> partially_mapped should be ok to handle both nr_split_deferred
> and split_deferred. That said, the new partially_mapped still has
> several unresolved issues blocking my rebase.

Hi,

I have sent a v4 [1] that resolves all outstanding issues/comments raised in v3. It has been running a production workload for 12 hours, showing good performance and 0 crashes. Hopefully it helps to solve the existing merge conflicts between my series, yours and Yu Zhaos.

[1] https://lore.kernel.org/all/20240819023145.2415299-1-usamaarif642@gmail.com/

Thanks

> 
> [1]https://lore.kernel.org/linux-mm/20240813120328.1275952-1-usamaarif642@gmail.com
> 
> What do you think about this?
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/2] mm: collect the number of anon mTHP
  2024-08-18  7:58 ` [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
  2024-08-19  2:44   ` Usama Arif
@ 2024-08-19  8:28   ` David Hildenbrand
  2024-08-19  8:33     ` Barry Song
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2024-08-19  8:28 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, Usama Arif
  Cc: baolin.wang, chrisl, hanchuanhua, ioworker0, kaleshsingh, kasong,
	linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai

On 18.08.24 09:58, Barry Song wrote:
> Hi Andrew, David, Usama,
> 
> I'm attempting to rebase this series on top of Usama's
> [PATCH v3 0/6] mm: split underutilized THPs[1]
> 
> However, I feel it is impossible and we might be tackling things
> in the wrong order.

Is just the ordering suboptimal (which can/will get resolved one way or 
the other), or is there something fundamental that will make this series 
here "impossible"?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/2] mm: collect the number of anon mTHP
  2024-08-19  8:28   ` David Hildenbrand
@ 2024-08-19  8:33     ` Barry Song
  2024-08-19  8:52       ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-19  8:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, Usama Arif, baolin.wang, chrisl, hanchuanhua,
	ioworker0, kaleshsingh, kasong, linux-kernel, ryan.roberts,
	v-songbaohua, ziy, yuanshuai

On Mon, Aug 19, 2024 at 8:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.08.24 09:58, Barry Song wrote:
> > Hi Andrew, David, Usama,
> >
> > I'm attempting to rebase this series on top of Usama's
> > [PATCH v3 0/6] mm: split underutilized THPs[1]
> >
> > However, I feel it is impossible and we might be tackling things
> > in the wrong order.
>
> Is just the ordering suboptimal (which can/will get resolved one way or
> the other), or is there something fundamental that will make this series
> here "impossible"?

i think it is just the ordering suboptimal. Ideally, mTHP counters can go
first, then the new partially_mapped feature will rebase on top of
mTHP counters.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/2] mm: collect the number of anon mTHP
  2024-08-19  8:33     ` Barry Song
@ 2024-08-19  8:52       ` Barry Song
  2024-08-19 14:22         ` Usama Arif
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-19  8:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, Usama Arif, baolin.wang, chrisl, hanchuanhua,
	ioworker0, kaleshsingh, kasong, linux-kernel, ryan.roberts,
	v-songbaohua, ziy, yuanshuai

On Mon, Aug 19, 2024 at 8:33 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 8:28 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 18.08.24 09:58, Barry Song wrote:
> > > Hi Andrew, David, Usama,
> > >
> > > I'm attempting to rebase this series on top of Usama's
> > > [PATCH v3 0/6] mm: split underutilized THPs[1]
> > >
> > > However, I feel it is impossible and we might be tackling things
> > > in the wrong order.
> >
> > Is just the ordering suboptimal (which can/will get resolved one way or
> > the other), or is there something fundamental that will make this series
> > here "impossible"?
>
> i think it is just the ordering suboptimal. Ideally, mTHP counters can go
> first, then the new partially_mapped feature will rebase on top of
> mTHP counters.

Sorry, please allow me to ramble a bit more.

The nr_split_deferred counter is straightforward and simple without the
partially_mapped feature. Each time we enter split_list, we increment by
1, and when we leave, we decrement by 1.

With the new partially_mapped feature, we can enter split_list without
actually being partially_mapped. If the MTHP counter series is processed
first, the partially_mapped series can handle all cases while properly
clearing and setting the partially_mapped flag. These flag operations
need to be handled carefully.
Currently, I notice that Usama's series is clearing the flag unconditionally
in all cases.

In simple terms, mTHP counters are just a counting mechanism that
doesn't introduce new features. However, partially_mapped is a new
feature. A better approach might be to handle the counters first, then
ensure that the new feature doesn't break the counter.

>
> >
> > --
> > Cheers,
> >
> > David / dhildenb

Thanks
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/2] mm: collect the number of anon mTHP
  2024-08-19  8:52       ` Barry Song
@ 2024-08-19 14:22         ` Usama Arif
  0 siblings, 0 replies; 19+ messages in thread
From: Usama Arif @ 2024-08-19 14:22 UTC (permalink / raw)
  To: Barry Song, David Hildenbrand, akpm
  Cc: linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai



On 19/08/2024 09:52, Barry Song wrote:
> On Mon, Aug 19, 2024 at 8:33 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Mon, Aug 19, 2024 at 8:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 18.08.24 09:58, Barry Song wrote:
>>>> Hi Andrew, David, Usama,
>>>>
>>>> I'm attempting to rebase this series on top of Usama's
>>>> [PATCH v3 0/6] mm: split underutilized THPs[1]
>>>>
>>>> However, I feel it is impossible and we might be tackling things
>>>> in the wrong order.
>>>
>>> Is just the ordering suboptimal (which can/will get resolved one way or
>>> the other), or is there something fundamental that will make this series
>>> here "impossible"?
>>
>> i think it is just the ordering suboptimal. Ideally, mTHP counters can go
>> first, then the new partially_mapped feature will rebase on top of
>> mTHP counters.
> 
> Sorry, please allow me to ramble a bit more.
> 
> The nr_split_deferred counter is straightforward and simple without the
> partially_mapped feature. Each time we enter split_list, we increment by
> 1, and when we leave, we decrement by 1.
> 
> With the new partially_mapped feature, we can enter split_list without
> actually being partially_mapped. If the MTHP counter series is processed
> first, the partially_mapped series can handle all cases while properly
> clearing and setting the partially_mapped flag. These flag operations
> need to be handled carefully.
> Currently, I notice that Usama's series is clearing the flag unconditionally
> in all cases.
> 
> In simple terms, mTHP counters are just a counting mechanism that
> doesn't introduce new features. However, partially_mapped is a new
> feature. A better approach might be to handle the counters first, then
> ensure that the new feature doesn't break the counter.
> 

I am ok if the series needs to be reversed, I think the difficulty is for Andrew to tackle my series, yours and Yu Zhaos, which all seem to be touching the same code, so whatever makes it easier for Andrew I am happy with it.


>>
>>>
>>> --
>>> Cheers,
>>>
>>> David / dhildenb
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-11 22:49 ` [PATCH v2 1/2] mm: collect the number of anon large folios Barry Song
@ 2024-08-21 21:34   ` David Hildenbrand
  2024-08-22  0:52     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2024-08-21 21:34 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, hanchuanhua, ioworker0, kaleshsingh, kasong,
	linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai

On 12.08.24 00:49, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Anon large folios come from three places:
> 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
> for rmap;
> 2. a large folio is split into multiple lower-order large folios;
> 3. a large folio is migrated to a new large folio.
> 
> In all above three counts, we increase nr_anon by 1;
> 
> Anon large folios might go either because of be split or be put
> to free, in these cases, we reduce the count by 1.
> 
> Folios that have been added to the swap cache but have not yet received
> an anon mapping won't be counted. This is consistent with the AnonPages
> statistics in /proc/meminfo.

Thinking out loud, I wonder if we want to have something like that for 
any anon folios (including small ones).

Assume we longterm-pinned an anon folio and unmapped/zapped it. It would 
be quite interesting to see that these are actually anon pages still 
consuming memory. Same with memory leaks, when an anon folio doesn't get 
freed for some reason.

The whole "AnonPages" counter thingy is just confusing, it only counts 
what's currently mapped ... so we'd want something different.

But it's okay to start with large folios only, there we have a new 
interface without that legacy stuff :)

> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst |  5 +++++
>   include/linux/huge_mm.h                    | 15 +++++++++++++--
>   mm/huge_memory.c                           | 13 ++++++++++---
>   mm/migrate.c                               |  4 ++++
>   mm/page_alloc.c                            |  5 ++++-
>   mm/rmap.c                                  |  1 +
>   6 files changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 058485daf186..9fdfb46e4560 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -527,6 +527,11 @@ split_deferred
>           it would free up some memory. Pages on split queue are going to
>           be split under memory pressure, if splitting is possible.
>   
> +nr_anon
> +       the number of anon huge pages we have in the whole system.

"transparent ..." otherwise people might confuse it with anon hugetlb 
"huge pages" ... :)

I briefly tried coming up with a better name than "nr_anon" but failed.


[...]

> @@ -447,6 +449,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>   	 */
>   	newfolio->index = folio->index;
>   	newfolio->mapping = folio->mapping;
> +	if (folio_test_anon(folio) && folio_test_large(folio))
> +		mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
>   	folio_ref_add(newfolio, nr); /* add cache reference */
>   	if (folio_test_swapbacked(folio)) {
>   		__folio_set_swapbacked(newfolio);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 84a7154fde93..382c364d3efa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1084,8 +1084,11 @@ __always_inline bool free_pages_prepare(struct page *page,
>   			(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>   		}
>   	}
> -	if (PageMappingFlags(page))
> +	if (PageMappingFlags(page)) {
> +		if (PageAnon(page) && compound)
> +			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);

I wonder if you could even drop the "compound" check. mod_mthp_stat 
would handle order == 0 just fine. Not that I think it makes much 
difference.


Nothing else jumped at me.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list
  2024-08-11 22:49 ` [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list Barry Song
@ 2024-08-21 21:39   ` David Hildenbrand
  2024-08-21 22:01     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2024-08-21 21:39 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, hanchuanhua, ioworker0, kaleshsingh, kasong,
	linux-kernel, ryan.roberts, v-songbaohua, ziy, yuanshuai,
	Usama Arif

On 12.08.24 00:49, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When an mTHP is added to the deferred_list, its partial pages
> are unused, leading to wasted memory and potentially increasing
> memory reclamation pressure.
> 
> Detailing the specifics of how unmapping occurs is quite difficult
> and not that useful, so we adopt a simple approach: each time an
> mTHP enters the deferred_list, we increment the count by 1; whenever
> it leaves for any reason, we decrement the count by 1.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst | 5 +++++
>   include/linux/huge_mm.h                    | 1 +
>   mm/huge_memory.c                           | 6 ++++++
>   3 files changed, 12 insertions(+)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 9fdfb46e4560..7072469de8a8 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -532,6 +532,11 @@ nr_anon
>          These huge pages could be entirely mapped or have partially
>          unmapped/unused subpages.
>   
> +nr_split_deferred
> +       the number of anon huge pages which have been partially unmapped
> +       and put onto split queue. Those unmapped subpages are also unused
> +       and temporarily wasting memory.

The name suggests something else ... like a counter of how many have 
been deferred split :)

Would "nr_anon_partially_mapped" "nr_anon_split_pending" (or something 
less mouthful) be clearer?

(likely "anon" really should be part of the name in any case)

The name we chose (and the implied semantics) will likely have 
implications on the handling of Usamas series.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list
  2024-08-21 21:39   ` David Hildenbrand
@ 2024-08-21 22:01     ` Barry Song
  2024-08-21 22:10       ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-21 22:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai, Usama Arif

On Thu, Aug 22, 2024 at 5:39 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.08.24 00:49, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When an mTHP is added to the deferred_list, its partial pages
> > are unused, leading to wasted memory and potentially increasing
> > memory reclamation pressure.
> >
> > Detailing the specifics of how unmapping occurs is quite difficult
> > and not that useful, so we adopt a simple approach: each time an
> > mTHP enters the deferred_list, we increment the count by 1; whenever
> > it leaves for any reason, we decrement the count by 1.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst | 5 +++++
> >   include/linux/huge_mm.h                    | 1 +
> >   mm/huge_memory.c                           | 6 ++++++
> >   3 files changed, 12 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 9fdfb46e4560..7072469de8a8 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -532,6 +532,11 @@ nr_anon
> >          These huge pages could be entirely mapped or have partially
> >          unmapped/unused subpages.
> >
> > +nr_split_deferred
> > +       the number of anon huge pages which have been partially unmapped
> > +       and put onto split queue. Those unmapped subpages are also unused
> > +       and temporarily wasting memory.
>
> The name suggests something else ... like a counter of how many have
> been deferred split :)
>
> Would "nr_anon_partially_mapped" "nr_anon_split_pending" (or something
> less mouthful) be clearer?
>
> (likely "anon" really should be part of the name in any case)
>
> The name we chose (and the implied semantics) will likely have
> implications on the handling of Usamas series.
>

Hi David,

Your point is quite similar to my V1, though not exactly the same. I aimed to
make the name more meaningful for users.
https://lore.kernel.org/all/20240808010457.228753-3-21cnbao@gmail.com/

Ryan felt that the name should be consistent with the existing split_deferred.
https://lore.kernel.org/all/36e8f1be-868d-4bce-8f32-e2d96b8b7af3@arm.com/#t

It seems that the existing split_deferred may now be less optimal with
Usama's series, as entirely_mapped folios might also be on the list.

Ryan is out right now, but I suppose he will be convinced that
"nr_anon_partially_mapped" is probably a better name once
he returns and reviews Usama's series. :-)

> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list
  2024-08-21 22:01     ` Barry Song
@ 2024-08-21 22:10       ` David Hildenbrand
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2024-08-21 22:10 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai, Usama Arif

On 22.08.24 00:01, Barry Song wrote:
> On Thu, Aug 22, 2024 at 5:39 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.08.24 00:49, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> When an mTHP is added to the deferred_list, its partial pages
>>> are unused, leading to wasted memory and potentially increasing
>>> memory reclamation pressure.
>>>
>>> Detailing the specifics of how unmapping occurs is quite difficult
>>> and not that useful, so we adopt a simple approach: each time an
>>> mTHP enters the deferred_list, we increment the count by 1; whenever
>>> it leaves for any reason, we decrement the count by 1.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>    Documentation/admin-guide/mm/transhuge.rst | 5 +++++
>>>    include/linux/huge_mm.h                    | 1 +
>>>    mm/huge_memory.c                           | 6 ++++++
>>>    3 files changed, 12 insertions(+)
>>>
>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>> index 9fdfb46e4560..7072469de8a8 100644
>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>> @@ -532,6 +532,11 @@ nr_anon
>>>           These huge pages could be entirely mapped or have partially
>>>           unmapped/unused subpages.
>>>
>>> +nr_split_deferred
>>> +       the number of anon huge pages which have been partially unmapped
>>> +       and put onto split queue. Those unmapped subpages are also unused
>>> +       and temporarily wasting memory.
>>
>> The name suggests something else ... like a counter of how many have
>> been deferred split :)
>>
>> Would "nr_anon_partially_mapped" "nr_anon_split_pending" (or something
>> less mouthful) be clearer?
>>
>> (likely "anon" really should be part of the name in any case)
>>
>> The name we chose (and the implied semantics) will likely have
>> implications on the handling of Usamas series.
>>
> 
> Hi David,
> 
> Your point is quite similar to my V1, though not exactly the same. I aimed to
> make the name more meaningful for users.
> https://lore.kernel.org/all/20240808010457.228753-3-21cnbao@gmail.com/
> 
> Ryan felt that the name should be consistent with the existing split_deferred.
> https://lore.kernel.org/all/36e8f1be-868d-4bce-8f32-e2d96b8b7af3@arm.com/#t

Right, it's an increasing counter of how often we added something to the 
deferred split queue. It's more like a low-level primitive, although 
currently only used for one reason (anon + partial mapping).

> 
> It seems that the existing split_deferred may now be less optimal with
> Usama's series, as entirely_mapped folios might also be on the list.

Yes.

> 
> Ryan is out right now, but I suppose he will be convinced that
> "nr_anon_partially_mapped" is probably a better name once
> he returns and reviews Usama's series. :-)

Maybe ;) At least it's clear from the name what we are talking about.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-21 21:34   ` David Hildenbrand
@ 2024-08-22  0:52     ` Barry Song
  2024-08-22  8:44       ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-22  0:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai

On Thu, Aug 22, 2024 at 5:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.08.24 00:49, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Anon large folios come from three places:
> > 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
> > for rmap;
> > 2. a large folio is split into multiple lower-order large folios;
> > 3. a large folio is migrated to a new large folio.
> >
> > In all above three counts, we increase nr_anon by 1;
> >
> > Anon large folios might go either because of be split or be put
> > to free, in these cases, we reduce the count by 1.
> >
> > Folios that have been added to the swap cache but have not yet received
> > an anon mapping won't be counted. This is consistent with the AnonPages
> > statistics in /proc/meminfo.
>
> Thinking out loud, I wonder if we want to have something like that for
> any anon folios (including small ones).
>
> Assume we longterm-pinned an anon folio and unmapped/zapped it. It would
> be quite interesting to see that these are actually anon pages still
> consuming memory. Same with memory leaks, when an anon folio doesn't get
> freed for some reason.
>
> The whole "AnonPages" counter thingy is just confusing, it only counts
> what's currently mapped ... so we'd want something different.
>
> But it's okay to start with large folios only, there we have a new
> interface without that legacy stuff :)

We have two options to do this:
1. add a new separate nr_anon_unmapped interface which
counts unmapped anon memory only
2. let the nr_anon count both mapped and unmapped anon
folios.

I would assume 1 is clearer as right now AnonPages have been
there for years. and counting all mapped and unmapped together,
we are still lacking an approach to find out anon memory leak
problem you mentioned.

We are right now comparing nr_anon(including mapped folios only)
with AnonPages to get the distribution of different folio sizes in
performance profiling.

unmapped_nr_anon should be normally always quite small. otherwise,
something must be wrong.

>
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst |  5 +++++
> >   include/linux/huge_mm.h                    | 15 +++++++++++++--
> >   mm/huge_memory.c                           | 13 ++++++++++---
> >   mm/migrate.c                               |  4 ++++
> >   mm/page_alloc.c                            |  5 ++++-
> >   mm/rmap.c                                  |  1 +
> >   6 files changed, 37 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 058485daf186..9fdfb46e4560 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -527,6 +527,11 @@ split_deferred
> >           it would free up some memory. Pages on split queue are going to
> >           be split under memory pressure, if splitting is possible.
> >
> > +nr_anon
> > +       the number of anon huge pages we have in the whole system.
>
> "transparent ..." otherwise people might confuse it with anon hugetlb
> "huge pages" ... :)
>
> I briefly tried coming up with a better name than "nr_anon" but failed.
>
>

if we might have unmapped_anon counter later, maybe rename it to
nr_anon_mapped? and the new interface we will have in the future
might be nr_anon_unmapped?

> [...]
>
> > @@ -447,6 +449,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >        */
> >       newfolio->index = folio->index;
> >       newfolio->mapping = folio->mapping;
> > +     if (folio_test_anon(folio) && folio_test_large(folio))
> > +             mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
> >       folio_ref_add(newfolio, nr); /* add cache reference */
> >       if (folio_test_swapbacked(folio)) {
> >               __folio_set_swapbacked(newfolio);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 84a7154fde93..382c364d3efa 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1084,8 +1084,11 @@ __always_inline bool free_pages_prepare(struct page *page,
> >                       (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >               }
> >       }
> > -     if (PageMappingFlags(page))
> > +     if (PageMappingFlags(page)) {
> > +             if (PageAnon(page) && compound)
> > +                     mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>
> I wonder if you could even drop the "compound" check. mod_mthp_stat
> would handle order == 0 just fine. Not that I think it makes much
> difference.

i think either is fine as mod_mthp_stat will filter out order==0
right now.

>
>
> Nothing else jumped at me.
>
> Acked-by: David Hildenbrand <david@redhat.com>
>

Thanks!

> --
> Cheers,
>
> David / dhildenb
>

Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-22  0:52     ` Barry Song
@ 2024-08-22  8:44       ` Barry Song
  2024-08-22  8:59         ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-22  8:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai

On Thu, Aug 22, 2024 at 12:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Aug 22, 2024 at 5:34 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 12.08.24 00:49, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > Anon large folios come from three places:
> > > 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
> > > for rmap;
> > > 2. a large folio is split into multiple lower-order large folios;
> > > 3. a large folio is migrated to a new large folio.
> > >
> > > In all above three counts, we increase nr_anon by 1;
> > >
> > > Anon large folios might go either because of be split or be put
> > > to free, in these cases, we reduce the count by 1.
> > >
> > > Folios that have been added to the swap cache but have not yet received
> > > an anon mapping won't be counted. This is consistent with the AnonPages
> > > statistics in /proc/meminfo.
> >
> > Thinking out loud, I wonder if we want to have something like that for
> > any anon folios (including small ones).
> >
> > Assume we longterm-pinned an anon folio and unmapped/zapped it. It would
> > be quite interesting to see that these are actually anon pages still
> > consuming memory. Same with memory leaks, when an anon folio doesn't get
> > freed for some reason.
> >
> > The whole "AnonPages" counter thingy is just confusing, it only counts
> > what's currently mapped ... so we'd want something different.
> >
> > But it's okay to start with large folios only, there we have a new
> > interface without that legacy stuff :)
>
> We have two options to do this:
> 1. add a new separate nr_anon_unmapped interface which
> counts unmapped anon memory only
> 2. let the nr_anon count both mapped and unmapped anon
> folios.
>
> I would assume 1 is clearer as right now AnonPages have been
> there for years. and counting all mapped and unmapped together,
> we are still lacking an approach to find out anon memory leak
> problem you mentioned.
>
> We are right now comparing nr_anon(including mapped folios only)
> with AnonPages to get the distribution of different folio sizes in
> performance profiling.
>
> unmapped_nr_anon should be normally always quite small. otherwise,
> something must be wrong.
>
> >
> > >
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >   Documentation/admin-guide/mm/transhuge.rst |  5 +++++
> > >   include/linux/huge_mm.h                    | 15 +++++++++++++--
> > >   mm/huge_memory.c                           | 13 ++++++++++---
> > >   mm/migrate.c                               |  4 ++++
> > >   mm/page_alloc.c                            |  5 ++++-
> > >   mm/rmap.c                                  |  1 +
> > >   6 files changed, 37 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > > index 058485daf186..9fdfb46e4560 100644
> > > --- a/Documentation/admin-guide/mm/transhuge.rst
> > > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > > @@ -527,6 +527,11 @@ split_deferred
> > >           it would free up some memory. Pages on split queue are going to
> > >           be split under memory pressure, if splitting is possible.
> > >
> > > +nr_anon
> > > +       the number of anon huge pages we have in the whole system.
> >
> > "transparent ..." otherwise people might confuse it with anon hugetlb
> > "huge pages" ... :)
> >
> > I briefly tried coming up with a better name than "nr_anon" but failed.
> >
> >
>
> if we might have unmapped_anon counter later, maybe rename it to
> nr_anon_mapped? and the new interface we will have in the future
> might be nr_anon_unmapped?

On second thought, this might be incorrect as well. Concepts like 'anon',
'shmem', and 'file' refer to states after mapping. If an 'anon' has been
unmapped but is still pinned and not yet freed, it isn't technically an
'anon' anymore?

On the other hand, implementing nr_anon_unmapped could be extremely
tricky. I have no idea how to implement it as we are losing those mapping
flags.

However, a page that is read-ahead but not yet mapped can still become
an anon, which seems slightly less tricky to count though seems still
difficult - except anon pages, shmem can be also swapped-backed?

>
> > [...]
> >
> > > @@ -447,6 +449,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> > >        */
> > >       newfolio->index = folio->index;
> > >       newfolio->mapping = folio->mapping;
> > > +     if (folio_test_anon(folio) && folio_test_large(folio))
> > > +             mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
> > >       folio_ref_add(newfolio, nr); /* add cache reference */
> > >       if (folio_test_swapbacked(folio)) {
> > >               __folio_set_swapbacked(newfolio);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 84a7154fde93..382c364d3efa 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1084,8 +1084,11 @@ __always_inline bool free_pages_prepare(struct page *page,
> > >                       (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >               }
> > >       }
> > > -     if (PageMappingFlags(page))
> > > +     if (PageMappingFlags(page)) {
> > > +             if (PageAnon(page) && compound)
> > > +                     mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> >
> > I wonder if you could even drop the "compound" check. mod_mthp_stat
> > would handle order == 0 just fine. Not that I think it makes much
> > difference.
>
> i think either is fine as mod_mthp_stat will filter out order==0
> right now.
>
> >
> >
> > Nothing else jumped at me.
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> >
>
> Thanks!
>
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Barry

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-22  8:44       ` Barry Song
@ 2024-08-22  8:59         ` David Hildenbrand
  2024-08-22  9:21           ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2024-08-22  8:59 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai

On 22.08.24 10:44, Barry Song wrote:
> On Thu, Aug 22, 2024 at 12:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Thu, Aug 22, 2024 at 5:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 12.08.24 00:49, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> Anon large folios come from three places:
>>>> 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
>>>> for rmap;
>>>> 2. a large folio is split into multiple lower-order large folios;
>>>> 3. a large folio is migrated to a new large folio.
>>>>
>>>> In all above three counts, we increase nr_anon by 1;
>>>>
>>>> Anon large folios might go either because of be split or be put
>>>> to free, in these cases, we reduce the count by 1.
>>>>
>>>> Folios that have been added to the swap cache but have not yet received
>>>> an anon mapping won't be counted. This is consistent with the AnonPages
>>>> statistics in /proc/meminfo.
>>>
>>> Thinking out loud, I wonder if we want to have something like that for
>>> any anon folios (including small ones).
>>>
>>> Assume we longterm-pinned an anon folio and unmapped/zapped it. It would
>>> be quite interesting to see that these are actually anon pages still
>>> consuming memory. Same with memory leaks, when an anon folio doesn't get
>>> freed for some reason.
>>>
>>> The whole "AnonPages" counter thingy is just confusing, it only counts
>>> what's currently mapped ... so we'd want something different.
>>>
>>> But it's okay to start with large folios only, there we have a new
>>> interface without that legacy stuff :)
>>
>> We have two options to do this:
>> 1. add a new separate nr_anon_unmapped interface which
>> counts unmapped anon memory only
>> 2. let the nr_anon count both mapped and unmapped anon
>> folios.
>>
>> I would assume 1 is clearer as right now AnonPages have been
>> there for years. and counting all mapped and unmapped together,
>> we are still lacking an approach to find out anon memory leak
>> problem you mentioned.
>>
>> We are right now comparing nr_anon(including mapped folios only)
>> with AnonPages to get the distribution of different folio sizes in
>> performance profiling.
>>
>> unmapped_nr_anon should be normally always quite small. otherwise,
>> something must be wrong.
>>
>>>
>>>>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> ---
>>>>    Documentation/admin-guide/mm/transhuge.rst |  5 +++++
>>>>    include/linux/huge_mm.h                    | 15 +++++++++++++--
>>>>    mm/huge_memory.c                           | 13 ++++++++++---
>>>>    mm/migrate.c                               |  4 ++++
>>>>    mm/page_alloc.c                            |  5 ++++-
>>>>    mm/rmap.c                                  |  1 +
>>>>    6 files changed, 37 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>> index 058485daf186..9fdfb46e4560 100644
>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>> @@ -527,6 +527,11 @@ split_deferred
>>>>            it would free up some memory. Pages on split queue are going to
>>>>            be split under memory pressure, if splitting is possible.
>>>>
>>>> +nr_anon
>>>> +       the number of anon huge pages we have in the whole system.
>>>
>>> "transparent ..." otherwise people might confuse it with anon hugetlb
>>> "huge pages" ... :)
>>>
>>> I briefly tried coming up with a better name than "nr_anon" but failed.
>>>
>>>
>>
>> if we might have unmapped_anon counter later, maybe rename it to
>> nr_anon_mapped? and the new interface we will have in the future
>> might be nr_anon_unmapped?

We really shouldn't be using the mapped/unmapped terminology here ... we 
allocated pages and turned them into anonymous folios. At some point we 
free them. That's the lifecycle.

> 
> On second thought, this might be incorrect as well. Concepts like 'anon',
> 'shmem', and 'file' refer to states after mapping. If an 'anon' has been
> unmapped but is still pinned and not yet freed, it isn't technically an
> 'anon' anymore?

It's just not mapped, and cannot get mapped, anymore. In the memdesc 
world, we'd be freeing the "struct anon" or "struct folio" once the last 
refcount goes to 0, not once (e.g., temporarily during a failed 
migration?) unmapped.

The important part to me would be: this is memory that was allocated for 
anonymous memory, and it's still around for some reason and not getting 
freed. Usually, we would expect anon memory to get freed fairly quickly 
once unmapped. Except when there are long-term pinnings or other types 
of memory leaks.

You could happily continue using these anon pages via vmsplice() or 
similar, even thought he original page table mapping was torn down.

> 
> On the other hand, implementing nr_anon_unmapped could be extremely
> tricky. I have no idea how to implement it as we are losing those mapping
> flags.

folio_mapcount() can tell you efficiently whether a folio is mapped or 
not -- and that information will stay for all eternity as long as we 
have any mapcounts :) . It cannot tell "how many" pages of a large folio 
are mapped, but at least "is any page of this large folio mapped".

> 
> However, a page that is read-ahead but not yet mapped can still become
> an anon, which seems slightly less tricky to count though seems still
> difficult - except anon pages, shmem can be also swapped-backed?

Yes. I'm sure there would be ways to achieve it, but I am not sure if 
it's worth the churn. These pages can be reclaimed easily (I would 
assume? They are not even mapped and were never accessible via GUP), and 
can certainly not have any longterm pinnings or similar. There are more 
like "cached things that could become an anon folio".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-22  8:59         ` David Hildenbrand
@ 2024-08-22  9:21           ` Barry Song
  2024-08-22 10:01             ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-22  9:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai

On Thu, Aug 22, 2024 at 8:59 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.24 10:44, Barry Song wrote:
> > On Thu, Aug 22, 2024 at 12:52 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Thu, Aug 22, 2024 at 5:34 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 12.08.24 00:49, Barry Song wrote:
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> Anon large folios come from three places:
> >>>> 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
> >>>> for rmap;
> >>>> 2. a large folio is split into multiple lower-order large folios;
> >>>> 3. a large folio is migrated to a new large folio.
> >>>>
> >>>> In all above three counts, we increase nr_anon by 1;
> >>>>
> >>>> Anon large folios might go either because of be split or be put
> >>>> to free, in these cases, we reduce the count by 1.
> >>>>
> >>>> Folios that have been added to the swap cache but have not yet received
> >>>> an anon mapping won't be counted. This is consistent with the AnonPages
> >>>> statistics in /proc/meminfo.
> >>>
> >>> Thinking out loud, I wonder if we want to have something like that for
> >>> any anon folios (including small ones).
> >>>
> >>> Assume we longterm-pinned an anon folio and unmapped/zapped it. It would
> >>> be quite interesting to see that these are actually anon pages still
> >>> consuming memory. Same with memory leaks, when an anon folio doesn't get
> >>> freed for some reason.
> >>>
> >>> The whole "AnonPages" counter thingy is just confusing, it only counts
> >>> what's currently mapped ... so we'd want something different.
> >>>
> >>> But it's okay to start with large folios only, there we have a new
> >>> interface without that legacy stuff :)
> >>
> >> We have two options to do this:
> >> 1. add a new separate nr_anon_unmapped interface which
> >> counts unmapped anon memory only
> >> 2. let the nr_anon count both mapped and unmapped anon
> >> folios.
> >>
> >> I would assume 1 is clearer as right now AnonPages have been
> >> there for years. and counting all mapped and unmapped together,
> >> we are still lacking an approach to find out anon memory leak
> >> problem you mentioned.
> >>
> >> We are right now comparing nr_anon(including mapped folios only)
> >> with AnonPages to get the distribution of different folio sizes in
> >> performance profiling.
> >>
> >> unmapped_nr_anon should be normally always quite small. otherwise,
> >> something must be wrong.
> >>
> >>>
> >>>>
> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>> ---
> >>>>    Documentation/admin-guide/mm/transhuge.rst |  5 +++++
> >>>>    include/linux/huge_mm.h                    | 15 +++++++++++++--
> >>>>    mm/huge_memory.c                           | 13 ++++++++++---
> >>>>    mm/migrate.c                               |  4 ++++
> >>>>    mm/page_alloc.c                            |  5 ++++-
> >>>>    mm/rmap.c                                  |  1 +
> >>>>    6 files changed, 37 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >>>> index 058485daf186..9fdfb46e4560 100644
> >>>> --- a/Documentation/admin-guide/mm/transhuge.rst
> >>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>>> @@ -527,6 +527,11 @@ split_deferred
> >>>>            it would free up some memory. Pages on split queue are going to
> >>>>            be split under memory pressure, if splitting is possible.
> >>>>
> >>>> +nr_anon
> >>>> +       the number of anon huge pages we have in the whole system.
> >>>
> >>> "transparent ..." otherwise people might confuse it with anon hugetlb
> >>> "huge pages" ... :)
> >>>
> >>> I briefly tried coming up with a better name than "nr_anon" but failed.
> >>>
> >>>
> >>
> >> if we might have unmapped_anon counter later, maybe rename it to
> >> nr_anon_mapped? and the new interface we will have in the future
> >> might be nr_anon_unmapped?
>
> We really shouldn't be using the mapped/unmapped terminology here ... we
> allocated pages and turned them into anonymous folios. At some point we
> free them. That's the lifecycle.
>
> >
> > On second thought, this might be incorrect as well. Concepts like 'anon',
> > 'shmem', and 'file' refer to states after mapping. If an 'anon' has been
> > unmapped but is still pinned and not yet freed, it isn't technically an
> > 'anon' anymore?
>
> It's just not mapped, and cannot get mapped, anymore. In the memdesc
> world, we'd be freeing the "struct anon" or "struct folio" once the last
> refcount goes to 0, not once (e.g., temporarily during a failed
> migration?) unmapped.
>
> The important part to me would be: this is memory that was allocated for
> anonymous memory, and it's still around for some reason and not getting
> freed. Usually, we would expect anon memory to get freed fairly quickly
> once unmapped. Except when there are long-term pinnings or other types
> of memory leaks.
>
> You could happily continue using these anon pages via vmsplice() or
> similar, even thought he original page table mapping was torn down.
>
> >
> > On the other hand, implementing nr_anon_unmapped could be extremely
> > tricky. I have no idea how to implement it as we are losing those mapping
> > flags.
>
> folio_mapcount() can tell you efficiently whether a folio is mapped or
> not -- and that information will stay for all eternity as long as we
> have any mapcounts :) . It cannot tell "how many" pages of a large folio
> are mapped, but at least "is any page of this large folio mapped".

Exactly. AnonPages decreases by -1 when removed from the rmap,
whereas nr_anon decreases by -1 when an anon folio is freed. So,
I would assume nr_anon includes those pinned and unmapped anon
folios but AnonPages doesn't.

If there's a significant amount of 'leaked' anon, we should consider
having a separate counter for them. For instance, if nr_anon is
100,000 and pinned/unmapped pages account for 50%, then nr_anon
alone doesn’t effectively reflect the system's state.

to implement that, it seems we do need to detect the moment mapcount==0
and the moment of freeing anon?

when mapcount==0 in rmap
      unmapped_pinned_anon++;

when free
      unmapped_pinned_anon--;

Anyway, it seems this is a separate job.

>
> >
> > However, a page that is read-ahead but not yet mapped can still become
> > an anon, which seems slightly less tricky to count though seems still
> > difficult - except anon pages, shmem can be also swapped-backed?
>
> Yes. I'm sure there would be ways to achieve it, but I am not sure if
> it's worth the churn. These pages can be reclaimed easily (I would
> assume? They are not even mapped and were never accessible via GUP), and
> can certainly not have any longterm pinnings or similar. There are more
> like "cached things that could become an anon folio".

Exactly. If no one maps the pages for an extended period, I assume
the LRU will reclaim them as well.

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-22  9:21           ` Barry Song
@ 2024-08-22 10:01             ` David Hildenbrand
  2024-08-22 10:12               ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2024-08-22 10:01 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai

On 22.08.24 11:21, Barry Song wrote:
> On Thu, Aug 22, 2024 at 8:59 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 22.08.24 10:44, Barry Song wrote:
>>> On Thu, Aug 22, 2024 at 12:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Thu, Aug 22, 2024 at 5:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 12.08.24 00:49, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> Anon large folios come from three places:
>>>>>> 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
>>>>>> for rmap;
>>>>>> 2. a large folio is split into multiple lower-order large folios;
>>>>>> 3. a large folio is migrated to a new large folio.
>>>>>>
>>>>>> In all above three counts, we increase nr_anon by 1;
>>>>>>
>>>>>> Anon large folios might go either because of be split or be put
>>>>>> to free, in these cases, we reduce the count by 1.
>>>>>>
>>>>>> Folios that have been added to the swap cache but have not yet received
>>>>>> an anon mapping won't be counted. This is consistent with the AnonPages
>>>>>> statistics in /proc/meminfo.
>>>>>
>>>>> Thinking out loud, I wonder if we want to have something like that for
>>>>> any anon folios (including small ones).
>>>>>
>>>>> Assume we longterm-pinned an anon folio and unmapped/zapped it. It would
>>>>> be quite interesting to see that these are actually anon pages still
>>>>> consuming memory. Same with memory leaks, when an anon folio doesn't get
>>>>> freed for some reason.
>>>>>
>>>>> The whole "AnonPages" counter thingy is just confusing, it only counts
>>>>> what's currently mapped ... so we'd want something different.
>>>>>
>>>>> But it's okay to start with large folios only, there we have a new
>>>>> interface without that legacy stuff :)
>>>>
>>>> We have two options to do this:
>>>> 1. add a new separate nr_anon_unmapped interface which
>>>> counts unmapped anon memory only
>>>> 2. let the nr_anon count both mapped and unmapped anon
>>>> folios.
>>>>
>>>> I would assume 1 is clearer as right now AnonPages have been
>>>> there for years. and counting all mapped and unmapped together,
>>>> we are still lacking an approach to find out anon memory leak
>>>> problem you mentioned.
>>>>
>>>> We are right now comparing nr_anon(including mapped folios only)
>>>> with AnonPages to get the distribution of different folio sizes in
>>>> performance profiling.
>>>>
>>>> unmapped_nr_anon should be normally always quite small. otherwise,
>>>> something must be wrong.
>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>> ---
>>>>>>     Documentation/admin-guide/mm/transhuge.rst |  5 +++++
>>>>>>     include/linux/huge_mm.h                    | 15 +++++++++++++--
>>>>>>     mm/huge_memory.c                           | 13 ++++++++++---
>>>>>>     mm/migrate.c                               |  4 ++++
>>>>>>     mm/page_alloc.c                            |  5 ++++-
>>>>>>     mm/rmap.c                                  |  1 +
>>>>>>     6 files changed, 37 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>>>> index 058485daf186..9fdfb46e4560 100644
>>>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>>>> @@ -527,6 +527,11 @@ split_deferred
>>>>>>             it would free up some memory. Pages on split queue are going to
>>>>>>             be split under memory pressure, if splitting is possible.
>>>>>>
>>>>>> +nr_anon
>>>>>> +       the number of anon huge pages we have in the whole system.
>>>>>
>>>>> "transparent ..." otherwise people might confuse it with anon hugetlb
>>>>> "huge pages" ... :)
>>>>>
>>>>> I briefly tried coming up with a better name than "nr_anon" but failed.
>>>>>
>>>>>
>>>>
>>>> if we might have unmapped_anon counter later, maybe rename it to
>>>> nr_anon_mapped? and the new interface we will have in the future
>>>> might be nr_anon_unmapped?
>>
>> We really shouldn't be using the mapped/unmapped terminology here ... we
>> allocated pages and turned them into anonymous folios. At some point we
>> free them. That's the lifecycle.
>>
>>>
>>> On second thought, this might be incorrect as well. Concepts like 'anon',
>>> 'shmem', and 'file' refer to states after mapping. If an 'anon' has been
>>> unmapped but is still pinned and not yet freed, it isn't technically an
>>> 'anon' anymore?
>>
>> It's just not mapped, and cannot get mapped, anymore. In the memdesc
>> world, we'd be freeing the "struct anon" or "struct folio" once the last
>> refcount goes to 0, not once (e.g., temporarily during a failed
>> migration?) unmapped.
>>
>> The important part to me would be: this is memory that was allocated for
>> anonymous memory, and it's still around for some reason and not getting
>> freed. Usually, we would expect anon memory to get freed fairly quickly
>> once unmapped. Except when there are long-term pinnings or other types
>> of memory leaks.
>>
>> You could happily continue using these anon pages via vmsplice() or
>> similar, even thought he original page table mapping was torn down.
>>
>>>
>>> On the other hand, implementing nr_anon_unmapped could be extremely
>>> tricky. I have no idea how to implement it as we are losing those mapping
>>> flags.
>>
>> folio_mapcount() can tell you efficiently whether a folio is mapped or
>> not -- and that information will stay for all eternity as long as we
>> have any mapcounts :) . It cannot tell "how many" pages of a large folio
>> are mapped, but at least "is any page of this large folio mapped".
> 
> Exactly. AnonPages decreases by -1 when removed from the rmap,
> whereas nr_anon decreases by -1 when an anon folio is freed. So,
> I would assume nr_anon includes those pinned and unmapped anon
> folios but AnonPages doesn't.

Right, note how internally it is called "NR_ANON_MAPPED", but we ended 
up calling it "AnonPages". But that's rather a legacy interface we 
cannot change (fix) that easily. At least not without a config option.

At some point it might indeed be interesting to have "nr_anon_mapped", 
here, but that would correspond to "is any part of this large folio 
mapped". For debugging purposes in the future, that might be indeed 
interesting.

"nr_anon": anon allocations (until freed -> +1)
"nr_anon_mapped": anon allocations that are mapped (any part mapped -> +1)
"nr_anon_partially_mapped": anon allocations that was detected to be 
partially mapped at some point -> +1

If a folio is in the swapcache, I would still want to see that it is an 
anon allocation lurking around in the system. Like we do with pagecache 
pages. *There* we do have the difference between "allocated" and 
"mapped" already.

So likely, calling it "nr_anon" here, and tracking it on an allocation 
level, is good enough for now and future proof.

> 
> If there's a significant amount of 'leaked' anon, we should consider
> having a separate counter for them. For instance, if nr_anon is
> 100,000 and pinned/unmapped pages account for 50%, then nr_anon
> alone doesn’t effectively reflect the system's state.

Right, but if you stare at the system you could tell that a significant 
amount of memory is still getting consumed through existing/previous 
anon mappings. Depends on how valuable that distinction really is.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] mm: collect the number of anon large folios
  2024-08-22 10:01             ` David Hildenbrand
@ 2024-08-22 10:12               ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-08-22 10:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, ioworker0,
	kaleshsingh, kasong, linux-kernel, ryan.roberts, v-songbaohua,
	ziy, yuanshuai

On Thu, Aug 22, 2024 at 10:01 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.24 11:21, Barry Song wrote:
> > On Thu, Aug 22, 2024 at 8:59 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 22.08.24 10:44, Barry Song wrote:
> >>> On Thu, Aug 22, 2024 at 12:52 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Thu, Aug 22, 2024 at 5:34 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 12.08.24 00:49, Barry Song wrote:
> >>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> Anon large folios come from three places:
> >>>>>> 1. new allocated large folios in PF, they will call folio_add_new_anon_rmap()
> >>>>>> for rmap;
> >>>>>> 2. a large folio is split into multiple lower-order large folios;
> >>>>>> 3. a large folio is migrated to a new large folio.
> >>>>>>
> >>>>>> In all above three counts, we increase nr_anon by 1;
> >>>>>>
> >>>>>> Anon large folios might go either because of be split or be put
> >>>>>> to free, in these cases, we reduce the count by 1.
> >>>>>>
> >>>>>> Folios that have been added to the swap cache but have not yet received
> >>>>>> an anon mapping won't be counted. This is consistent with the AnonPages
> >>>>>> statistics in /proc/meminfo.
> >>>>>
> >>>>> Thinking out loud, I wonder if we want to have something like that for
> >>>>> any anon folios (including small ones).
> >>>>>
> >>>>> Assume we longterm-pinned an anon folio and unmapped/zapped it. It would
> >>>>> be quite interesting to see that these are actually anon pages still
> >>>>> consuming memory. Same with memory leaks, when an anon folio doesn't get
> >>>>> freed for some reason.
> >>>>>
> >>>>> The whole "AnonPages" counter thingy is just confusing, it only counts
> >>>>> what's currently mapped ... so we'd want something different.
> >>>>>
> >>>>> But it's okay to start with large folios only, there we have a new
> >>>>> interface without that legacy stuff :)
> >>>>
> >>>> We have two options to do this:
> >>>> 1. add a new separate nr_anon_unmapped interface which
> >>>> counts unmapped anon memory only
> >>>> 2. let the nr_anon count both mapped and unmapped anon
> >>>> folios.
> >>>>
> >>>> I would assume 1 is clearer as right now AnonPages have been
> >>>> there for years. and counting all mapped and unmapped together,
> >>>> we are still lacking an approach to find out anon memory leak
> >>>> problem you mentioned.
> >>>>
> >>>> We are right now comparing nr_anon(including mapped folios only)
> >>>> with AnonPages to get the distribution of different folio sizes in
> >>>> performance profiling.
> >>>>
> >>>> unmapped_nr_anon should be normally always quite small. otherwise,
> >>>> something must be wrong.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>> ---
> >>>>>>     Documentation/admin-guide/mm/transhuge.rst |  5 +++++
> >>>>>>     include/linux/huge_mm.h                    | 15 +++++++++++++--
> >>>>>>     mm/huge_memory.c                           | 13 ++++++++++---
> >>>>>>     mm/migrate.c                               |  4 ++++
> >>>>>>     mm/page_alloc.c                            |  5 ++++-
> >>>>>>     mm/rmap.c                                  |  1 +
> >>>>>>     6 files changed, 37 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >>>>>> index 058485daf186..9fdfb46e4560 100644
> >>>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
> >>>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>>>>> @@ -527,6 +527,11 @@ split_deferred
> >>>>>>             it would free up some memory. Pages on split queue are going to
> >>>>>>             be split under memory pressure, if splitting is possible.
> >>>>>>
> >>>>>> +nr_anon
> >>>>>> +       the number of anon huge pages we have in the whole system.
> >>>>>
> >>>>> "transparent ..." otherwise people might confuse it with anon hugetlb
> >>>>> "huge pages" ... :)
> >>>>>
> >>>>> I briefly tried coming up with a better name than "nr_anon" but failed.
> >>>>>
> >>>>>
> >>>>
> >>>> if we might have unmapped_anon counter later, maybe rename it to
> >>>> nr_anon_mapped? and the new interface we will have in the future
> >>>> might be nr_anon_unmapped?
> >>
> >> We really shouldn't be using the mapped/unmapped terminology here ... we
> >> allocated pages and turned them into anonymous folios. At some point we
> >> free them. That's the lifecycle.
> >>
> >>>
> >>> On second thought, this might be incorrect as well. Concepts like 'anon',
> >>> 'shmem', and 'file' refer to states after mapping. If an 'anon' has been
> >>> unmapped but is still pinned and not yet freed, it isn't technically an
> >>> 'anon' anymore?
> >>
> >> It's just not mapped, and cannot get mapped, anymore. In the memdesc
> >> world, we'd be freeing the "struct anon" or "struct folio" once the last
> >> refcount goes to 0, not once (e.g., temporarily during a failed
> >> migration?) unmapped.
> >>
> >> The important part to me would be: this is memory that was allocated for
> >> anonymous memory, and it's still around for some reason and not getting
> >> freed. Usually, we would expect anon memory to get freed fairly quickly
> >> once unmapped. Except when there are long-term pinnings or other types
> >> of memory leaks.
> >>
> >> You could happily continue using these anon pages via vmsplice() or
> >> similar, even thought he original page table mapping was torn down.
> >>
> >>>
> >>> On the other hand, implementing nr_anon_unmapped could be extremely
> >>> tricky. I have no idea how to implement it as we are losing those mapping
> >>> flags.
> >>
> >> folio_mapcount() can tell you efficiently whether a folio is mapped or
> >> not -- and that information will stay for all eternity as long as we
> >> have any mapcounts :) . It cannot tell "how many" pages of a large folio
> >> are mapped, but at least "is any page of this large folio mapped".
> >
> > Exactly. AnonPages decreases by -1 when removed from the rmap,
> > whereas nr_anon decreases by -1 when an anon folio is freed. So,
> > I would assume nr_anon includes those pinned and unmapped anon
> > folios but AnonPages doesn't.
>
> Right, note how internally it is called "NR_ANON_MAPPED", but we ended
> up calling it "AnonPages". But that's rather a legacy interface we
> cannot change (fix) that easily. At least not without a config option.
>
> At some point it might indeed be interesting to have "nr_anon_mapped",
> here, but that would correspond to "is any part of this large folio
> mapped". For debugging purposes in the future, that might be indeed
> interesting.
>
> "nr_anon": anon allocations (until freed -> +1)
> "nr_anon_mapped": anon allocations that are mapped (any part mapped -> +1)
> "nr_anon_partially_mapped": anon allocations that was detected to be
> partially mapped at some point -> +1
>
> If a folio is in the swapcache, I would still want to see that it is an
> anon allocation lurking around in the system. Like we do with pagecache
> pages. *There* we do have the difference between "allocated" and
> "mapped" already.
>
> So likely, calling it "nr_anon" here, and tracking it on an allocation
> level, is good enough for now and future proof.

Right. I plan to send v3 tomorrow to at least unblock Usama's series,
in case he wants to rebase on top of it.

>
> >
> > If there's a significant amount of 'leaked' anon, we should consider
> > having a separate counter for them. For instance, if nr_anon is
> > 100,000 and pinned/unmapped pages account for 50%, then nr_anon
> > alone doesn’t effectively reflect the system's state.
>
> Right, but if you stare at the system you could tell that a significant
> amount of memory is still getting consumed through existing/previous
> anon mappings. Depends on how valuable that distinction really is.
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-08-22 10:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-11 22:49 [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
2024-08-11 22:49 ` [PATCH v2 1/2] mm: collect the number of anon large folios Barry Song
2024-08-21 21:34   ` David Hildenbrand
2024-08-22  0:52     ` Barry Song
2024-08-22  8:44       ` Barry Song
2024-08-22  8:59         ` David Hildenbrand
2024-08-22  9:21           ` Barry Song
2024-08-22 10:01             ` David Hildenbrand
2024-08-22 10:12               ` Barry Song
2024-08-11 22:49 ` [PATCH v2 2/2] mm: collect the number of anon large folios on split_deferred list Barry Song
2024-08-21 21:39   ` David Hildenbrand
2024-08-21 22:01     ` Barry Song
2024-08-21 22:10       ` David Hildenbrand
2024-08-18  7:58 ` [PATCH v2 0/2] mm: collect the number of anon mTHP Barry Song
2024-08-19  2:44   ` Usama Arif
2024-08-19  8:28   ` David Hildenbrand
2024-08-19  8:33     ` Barry Song
2024-08-19  8:52       ` Barry Song
2024-08-19 14:22         ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).