All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
@ 2024-10-23 20:34 Joshua Hahn
  2024-10-23 21:15 ` Yosry Ahmed
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Joshua Hahn @ 2024-10-23 20:34 UTC (permalink / raw)
  To: hannes
  Cc: nphamcs, mhocko, roman.gushcin, shakeel.butt, muchun.song, lnyng,
	akpm, cgroups

Changelog
v2:
  * Enables the feature only if memcg accounts for hugeTLB usage
  * Moves the counter from memcg_stat_item to node_stat_item
  * Expands on motivation & justification in commitlog
  * Added Suggested-by: Nhat Pham

This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This
feature is enabled the same way hugeTLB accounting is enabled, via
the memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including
the metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistentcy between the two files, we also see benefits
in observability. Userspace might be interested in the hugeTLB footprint
of cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that
don't consume a lot of memory, or is seen to fault frequently. Note that
this is separate from wanting to inspect the distribution for limiting
purposes (in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability
benefits that I mentioned above, and users can check the total hugeTLB
usage, verify if it is distributed as expected, etc.

With this said, there are 2 problems:
  (a) They are still not reported in memory.stat, which means the
      disparity between the memcg reports are still there.
  (b) We cannot reasonably expect users to enable the hugeTLB controller
      just for the sake of hugeTLB usage reporting, especially since
      they don't have any use for hugeTLB usage enforcing [2].

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the exsting solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.

Suggested-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

---
 include/linux/mmzone.h |  3 +++
 mm/hugetlb.c           |  4 ++++
 mm/memcontrol.c        | 11 +++++++++++
 mm/vmstat.c            |  3 +++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..d3ba49a974b2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -215,6 +215,9 @@ enum node_stat_item {
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_B,
 #endif
 	/* PGDEMOTE_*: pages demoted */
 	PGDEMOTE_KSWAPD,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..055bc91858e4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+		lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
 	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
@@ -3094,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	if (!memcg_charge_ret)
 		mem_cgroup_commit_charge(folio, memcg);
 	mem_cgroup_put(memcg);
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+		lruvec_stat_mod_folio(folio, HUGETLB_B, pages_per_huge_page(h));
 
 	return folio;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..de5899eb8203 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -306,6 +306,9 @@ static const unsigned int memcg_node_stat_items[] = {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_B,
 #endif
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
@@ -1327,6 +1330,9 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
 	{ "zswapped",			MEMCG_ZSWAPPED			},
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	{ "hugeTLB",			HUGETLB_B			},
 #endif
 	{ "file_mapped",		NR_FILE_MAPPED			},
 	{ "file_dirty",			NR_FILE_DIRTY			},
@@ -1441,6 +1447,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
 
+#ifdef CONFIG_HUGETLB_PAGE
+		if (unlikely(memory_stats[i].idx == HUGETLB_B) &&
+				!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+			continue;
+#endif
 		size = memcg_page_state_output(memcg, memory_stats[i].idx);
 		seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b5a4cea423e1..466c40cffeb0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1269,6 +1269,9 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	"pgpromote_success",
 	"pgpromote_candidate",
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	"hugeTLB",
 #endif
 	"pgdemote_kswapd",
 	"pgdemote_direct",
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 20:34 [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg Joshua Hahn
@ 2024-10-23 21:15 ` Yosry Ahmed
  2024-10-23 21:17   ` Yosry Ahmed
  2024-10-23 21:32   ` Joshua Hahn
  2024-10-23 22:17 ` Shakeel Butt
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Yosry Ahmed @ 2024-10-23 21:15 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: hannes, nphamcs, mhocko, roman.gushcin, shakeel.butt, muchun.song,
	lnyng, akpm, cgroups

Hi Joshua,

On Wed, Oct 23, 2024 at 1:34 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> Changelog
> v2:
>   * Enables the feature only if memcg accounts for hugeTLB usage
>   * Moves the counter from memcg_stat_item to node_stat_item
>   * Expands on motivation & justification in commitlog
>   * Added Suggested-by: Nhat Pham

Changelogs usually go at the end, after ---, not as part of the commit
log itself.

>
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
>
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
>
> Aside from the consistentcy between the two files, we also see benefits

consistency*

> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or is seen to fault frequently. Note that

are* seen

> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
>
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
>
> With this said, there are 2 problems:
>   (a) They are still not reported in memory.stat, which means the
>       disparity between the memcg reports are still there.
>   (b) We cannot reasonably expect users to enable the hugeTLB controller
>       just for the sake of hugeTLB usage reporting, especially since
>       they don't have any use for hugeTLB usage enforcing [2].
>
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the exsting solution of enabling the

existing*

>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

You should also CC linux-mm on such patches.

>
> ---
>  include/linux/mmzone.h |  3 +++
>  mm/hugetlb.c           |  4 ++++
>  mm/memcontrol.c        | 11 +++++++++++
>  mm/vmstat.c            |  3 +++
>  4 files changed, 21 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 17506e4a2835..d3ba49a974b2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -215,6 +215,9 @@ enum node_stat_item {
>  #ifdef CONFIG_NUMA_BALANCING
>         PGPROMOTE_SUCCESS,      /* promote successfully */
>         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> +#endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +       HUGETLB_B,

Why '_B'?

>  #endif
>         /* PGDEMOTE_*: pages demoted */
>         PGDEMOTE_KSWAPD,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 190fa05635f4..055bc91858e4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
>                                      pages_per_huge_page(h), folio);
>         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                           pages_per_huge_page(h), folio);
> +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)

I think we already have a couple of these checks and this patch adds a
few more, perhaps we should add a helper at this point to improve
readability? Maybe something like memcg_accounts_hugetlb()?

> +               lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
>         mem_cgroup_uncharge(folio);
>         if (restore_reserve)
>                 h->resv_huge_pages++;
> @@ -3094,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>         if (!memcg_charge_ret)
>                 mem_cgroup_commit_charge(folio, memcg);
>         mem_cgroup_put(memcg);
> +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +               lruvec_stat_mod_folio(folio, HUGETLB_B, pages_per_huge_page(h));
>
>         return folio;
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7845c64a2c57..de5899eb8203 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -306,6 +306,9 @@ static const unsigned int memcg_node_stat_items[] = {
>  #endif
>  #ifdef CONFIG_NUMA_BALANCING
>         PGPROMOTE_SUCCESS,
> +#endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +       HUGETLB_B,
>  #endif
>         PGDEMOTE_KSWAPD,
>         PGDEMOTE_DIRECT,
> @@ -1327,6 +1330,9 @@ static const struct memory_stat memory_stats[] = {
>  #ifdef CONFIG_ZSWAP
>         { "zswap",                      MEMCG_ZSWAP_B                   },
>         { "zswapped",                   MEMCG_ZSWAPPED                  },
> +#endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +       { "hugeTLB",                    HUGETLB_B                       },

nit: I think we usually use lowercase letters when naming stats.

>  #endif
>         { "file_mapped",                NR_FILE_MAPPED                  },
>         { "file_dirty",                 NR_FILE_DIRTY                   },
> @@ -1441,6 +1447,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>                 u64 size;
>
> +#ifdef CONFIG_HUGETLB_PAGE
> +               if (unlikely(memory_stats[i].idx == HUGETLB_B) &&
> +                               !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> +                       continue;
> +#endif
>                 size = memcg_page_state_output(memcg, memory_stats[i].idx);
>                 seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index b5a4cea423e1..466c40cffeb0 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1269,6 +1269,9 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA_BALANCING
>         "pgpromote_success",
>         "pgpromote_candidate",
> +#endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +       "hugeTLB",
>  #endif
>         "pgdemote_kswapd",
>         "pgdemote_direct",
> --
> 2.43.5
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 21:15 ` Yosry Ahmed
@ 2024-10-23 21:17   ` Yosry Ahmed
  2024-10-24 16:57     ` Roman Gushchin
  2024-10-23 21:32   ` Joshua Hahn
  1 sibling, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2024-10-23 21:17 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: hannes, nphamcs, mhocko, roman.gushcin, shakeel.butt, muchun.song,
	lnyng, akpm, cgroups, Roman Gushchin

On Wed, Oct 23, 2024 at 2:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hi Joshua,
>
> On Wed, Oct 23, 2024 at 1:34 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >
> > Changelog
> > v2:
> >   * Enables the feature only if memcg accounts for hugeTLB usage
> >   * Moves the counter from memcg_stat_item to node_stat_item
> >   * Expands on motivation & justification in commitlog
> >   * Added Suggested-by: Nhat Pham
>
> Changelogs usually go at the end, after ---, not as part of the commit
> log itself.
>
> >
> > This patch introduces a new counter to memory.stat that tracks hugeTLB
> > usage, only if hugeTLB accounting is done to memory.current. This
> > feature is enabled the same way hugeTLB accounting is enabled, via
> > the memory_hugetlb_accounting mount flag for cgroupsv2.
> >
> > 1. Why is this patch necessary?
> > Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> > hugeTLB usage to memory.current. However, the metric is not reported in
> > memory.stat. Given that users often interpret memory.stat as a breakdown
> > of the value reported in memory.current, the disparity between the two
> > reports can be confusing. This patch solves this problem by including
> > the metric in memory.stat as well, but only if it is also reported in
> > memory.current (it would also be confusing if the value was reported in
> > memory.stat, but not in memory.current)
> >
> > Aside from the consistentcy between the two files, we also see benefits
>
> consistency*
>
> > in observability. Userspace might be interested in the hugeTLB footprint
> > of cgroups for many reasons. For instance, system admins might want to
> > verify that hugeTLB usage is distributed as expected across tasks: i.e.
> > memory-intensive tasks are using more hugeTLB pages than tasks that
> > don't consume a lot of memory, or is seen to fault frequently. Note that
>
> are* seen
>
> > this is separate from wanting to inspect the distribution for limiting
> > purposes (in which case, hugeTLB controller makes more sense).
> >
> > 2. We already have a hugeTLB controller. Why not use that?
> > It is true that hugeTLB tracks the exact value that we want. In fact, by
> > enabling the hugeTLB controller, we get all of the observability
> > benefits that I mentioned above, and users can check the total hugeTLB
> > usage, verify if it is distributed as expected, etc.
> >
> > With this said, there are 2 problems:
> >   (a) They are still not reported in memory.stat, which means the
> >       disparity between the memcg reports are still there.
> >   (b) We cannot reasonably expect users to enable the hugeTLB controller
> >       just for the sake of hugeTLB usage reporting, especially since
> >       they don't have any use for hugeTLB usage enforcing [2].
> >
> > [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> > [2] Of course, we can't make a new patch for every feature that can be
> >     duplicated. However, since the exsting solution of enabling the
>
> existing*
>
> >     hugeTLB controller is an imperfect solution that still leaves a
> >     discrepancy between memory.stat and memory.curent, I think that it
> >     is reasonable to isolate the feature in this case.
> >
> > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>
> You should also CC linux-mm on such patches.

+roman.gushchin@linux.dev

CCing Roman's correct email.

>
> >
> > ---
> >  include/linux/mmzone.h |  3 +++
> >  mm/hugetlb.c           |  4 ++++
> >  mm/memcontrol.c        | 11 +++++++++++
> >  mm/vmstat.c            |  3 +++
> >  4 files changed, 21 insertions(+)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 17506e4a2835..d3ba49a974b2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -215,6 +215,9 @@ enum node_stat_item {
> >  #ifdef CONFIG_NUMA_BALANCING
> >         PGPROMOTE_SUCCESS,      /* promote successfully */
> >         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> > +#endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +       HUGETLB_B,
>
> Why '_B'?
>
> >  #endif
> >         /* PGDEMOTE_*: pages demoted */
> >         PGDEMOTE_KSWAPD,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 190fa05635f4..055bc91858e4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
> >                                      pages_per_huge_page(h), folio);
> >         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >                                           pages_per_huge_page(h), folio);
> > +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
>
> I think we already have a couple of these checks and this patch adds a
> few more, perhaps we should add a helper at this point to improve
> readability? Maybe something like memcg_accounts_hugetlb()?
>
> > +               lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
> >         mem_cgroup_uncharge(folio);
> >         if (restore_reserve)
> >                 h->resv_huge_pages++;
> > @@ -3094,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >         if (!memcg_charge_ret)
> >                 mem_cgroup_commit_charge(folio, memcg);
> >         mem_cgroup_put(memcg);
> > +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> > +               lruvec_stat_mod_folio(folio, HUGETLB_B, pages_per_huge_page(h));
> >
> >         return folio;
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7845c64a2c57..de5899eb8203 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -306,6 +306,9 @@ static const unsigned int memcg_node_stat_items[] = {
> >  #endif
> >  #ifdef CONFIG_NUMA_BALANCING
> >         PGPROMOTE_SUCCESS,
> > +#endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +       HUGETLB_B,
> >  #endif
> >         PGDEMOTE_KSWAPD,
> >         PGDEMOTE_DIRECT,
> > @@ -1327,6 +1330,9 @@ static const struct memory_stat memory_stats[] = {
> >  #ifdef CONFIG_ZSWAP
> >         { "zswap",                      MEMCG_ZSWAP_B                   },
> >         { "zswapped",                   MEMCG_ZSWAPPED                  },
> > +#endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +       { "hugeTLB",                    HUGETLB_B                       },
>
> nit: I think we usually use lowercase letters when naming stats.
>
> >  #endif
> >         { "file_mapped",                NR_FILE_MAPPED                  },
> >         { "file_dirty",                 NR_FILE_DIRTY                   },
> > @@ -1441,6 +1447,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> >         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
> >                 u64 size;
> >
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +               if (unlikely(memory_stats[i].idx == HUGETLB_B) &&
> > +                               !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> > +                       continue;
> > +#endif
> >                 size = memcg_page_state_output(memcg, memory_stats[i].idx);
> >                 seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index b5a4cea423e1..466c40cffeb0 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1269,6 +1269,9 @@ const char * const vmstat_text[] = {
> >  #ifdef CONFIG_NUMA_BALANCING
> >         "pgpromote_success",
> >         "pgpromote_candidate",
> > +#endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +       "hugeTLB",
> >  #endif
> >         "pgdemote_kswapd",
> >         "pgdemote_direct",
> > --
> > 2.43.5
> >
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 21:15 ` Yosry Ahmed
  2024-10-23 21:17   ` Yosry Ahmed
@ 2024-10-23 21:32   ` Joshua Hahn
  2024-10-23 21:37     ` Yosry Ahmed
  1 sibling, 1 reply; 10+ messages in thread
From: Joshua Hahn @ 2024-10-23 21:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: hannes, nphamcs, mhocko, roman.gushcin, shakeel.butt, muchun.song,
	lnyng, akpm, cgroups

Hi Yosry, thank you for taking the time to review my patch. It seems
like I made a
lot of silly spelling / grammar / style mistakes in this patch, I'll
be more mindful of
these in the future.

On Wed, Oct 23, 2024 at 5:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hi Joshua,
> [...]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 17506e4a2835..d3ba49a974b2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -215,6 +215,9 @@ enum node_stat_item {
> >  #ifdef CONFIG_NUMA_BALANCING
> >         PGPROMOTE_SUCCESS,      /* promote successfully */
> >         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> > +#endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +       HUGETLB_B,
>
> Why '_B'?

I added _B because I am measuring the statistics in bytes (not pages).
IIRC some stat items use _B at the end to denote that the unit is in bytes,
so I put it at the end in the spirit of maintaining consistency. However, if
is not the case, I will remove it in the next version.

> [...]
> >  #endif
> >         /* PGDEMOTE_*: pages demoted */
> >         PGDEMOTE_KSWAPD,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 190fa05635f4..055bc91858e4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
> >                                      pages_per_huge_page(h), folio);
> >         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >                                           pages_per_huge_page(h), folio);
> > +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
>
> I think we already have a couple of these checks and this patch adds a
> few more, perhaps we should add a helper at this point to improve
> readability? Maybe something like memcg_accounts_hugetlb()?
> [...]

As far as I can tell, these checks only come up in mem_cgroup_hugetlb_try_charge
and cgroup_show_options. Shakeel already requested a cleanup of the try charge
function in the v1 thread, so I think the best course of action is to
add the helper
function in this patch (series) and use that helper function in
another patch series to
clean up the remaining functions.

I'll be sure to add the other grammar / style changes that you
mentioned above in
v3 as well. Thank you again for your feedback!

Joshua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 21:32   ` Joshua Hahn
@ 2024-10-23 21:37     ` Yosry Ahmed
  0 siblings, 0 replies; 10+ messages in thread
From: Yosry Ahmed @ 2024-10-23 21:37 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: hannes, nphamcs, mhocko, shakeel.butt, muchun.song, lnyng, akpm,
	cgroups, Linux-MM, Roman Gushchin

On Wed, Oct 23, 2024 at 2:32 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> Hi Yosry, thank you for taking the time to review my patch. It seems
> like I made a
> lot of silly spelling / grammar / style mistakes in this patch, I'll
> be more mindful of
> these in the future.
>
> On Wed, Oct 23, 2024 at 5:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Hi Joshua,
> > [...]
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 17506e4a2835..d3ba49a974b2 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -215,6 +215,9 @@ enum node_stat_item {
> > >  #ifdef CONFIG_NUMA_BALANCING
> > >         PGPROMOTE_SUCCESS,      /* promote successfully */
> > >         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> > > +#endif
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > +       HUGETLB_B,
> >
> > Why '_B'?
>
> I added _B because I am measuring the statistics in bytes (not pages).
> IIRC some stat items use _B at the end to denote that the unit is in bytes,
> so I put it at the end in the spirit of maintaining consistency. However, if
> is not the case, I will remove it in the next version.

I think ~all the memcg stats are exported in bytes, the ones that have
_B are also counted in bytes. I think in this case we are counting the
stats in pages.

See memcg_page_state_output_unit() and memcg_page_state_unit() for more details.

>
> > [...]
> > >  #endif
> > >         /* PGDEMOTE_*: pages demoted */
> > >         PGDEMOTE_KSWAPD,
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 190fa05635f4..055bc91858e4 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
> > >                                      pages_per_huge_page(h), folio);
> > >         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > >                                           pages_per_huge_page(h), folio);
> > > +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> >
> > I think we already have a couple of these checks and this patch adds a
> > few more, perhaps we should add a helper at this point to improve
> > readability? Maybe something like memcg_accounts_hugetlb()?
> > [...]
>
> As far as I can tell, these checks only come up in mem_cgroup_hugetlb_try_charge
> and cgroup_show_options. Shakeel already requested a cleanup of the try charge
> function in the v1 thread, so I think the best course of action is to
> add the helper
> function in this patch (series) and use that helper function in
> another patch series to
> clean up the remaining functions.

Introducing the helper in this patch/series and using it in further
cleanups makes sense to me.

>
> I'll be sure to add the other grammar / style changes that you
> mentioned above in
> v3 as well. Thank you again for your feedback!
>
> Joshua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 20:34 [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg Joshua Hahn
  2024-10-23 21:15 ` Yosry Ahmed
@ 2024-10-23 22:17 ` Shakeel Butt
  2024-10-24 18:13   ` Joshua Hahn
  2024-10-24 15:52 ` kernel test robot
  2024-10-25  0:11 ` kernel test robot
  3 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2024-10-23 22:17 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: hannes, nphamcs, mhocko, roman.gushcin, muchun.song, lnyng, akpm,
	cgroups

On Wed, Oct 23, 2024 at 01:34:33PM GMT, Joshua Hahn wrote:
> Changelog
> v2:
>   * Enables the feature only if memcg accounts for hugeTLB usage
>   * Moves the counter from memcg_stat_item to node_stat_item
>   * Expands on motivation & justification in commitlog
>   * Added Suggested-by: Nhat Pham
> 
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
> 
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
> 
> Aside from the consistentcy between the two files, we also see benefits
> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or is seen to fault frequently. Note that
> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
> 
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
> 
> With this said, there are 2 problems:
>   (a) They are still not reported in memory.stat, which means the
>       disparity between the memcg reports are still there.
>   (b) We cannot reasonably expect users to enable the hugeTLB controller
>       just for the sake of hugeTLB usage reporting, especially since
>       they don't have any use for hugeTLB usage enforcing [2].
> 
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the exsting solution of enabling the
>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
> 
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> 
> ---
>  include/linux/mmzone.h |  3 +++
>  mm/hugetlb.c           |  4 ++++
>  mm/memcontrol.c        | 11 +++++++++++
>  mm/vmstat.c            |  3 +++
>  4 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 17506e4a2835..d3ba49a974b2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -215,6 +215,9 @@ enum node_stat_item {
>  #ifdef CONFIG_NUMA_BALANCING
>  	PGPROMOTE_SUCCESS,	/* promote successfully */
>  	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
> +#endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +	HUGETLB_B,

As Yosry pointed out, this is in pages, not bytes. There is already
functionality to display this bin ytes for the readers of the memory
stats.

Also you will need to update Documentation/admin-guide/cgroup-v2.rst to
include the hugetlb stats.

>  #endif
>  	/* PGDEMOTE_*: pages demoted */
>  	PGDEMOTE_KSWAPD,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 190fa05635f4..055bc91858e4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
>  				     pages_per_huge_page(h), folio);
>  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					  pages_per_huge_page(h), folio);
> +	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +		lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));

Please note that by you are adding this stat not only in memcg but also
in global and per-node vmstat. This check will break those interfaces
when this mount option is not used. You only need the check at the
charging time. The uncharging and stats update functions will do the
right thing as they check memcg_data attached to the folio.

>  	mem_cgroup_uncharge(folio);
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
> @@ -3094,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	if (!memcg_charge_ret)
>  		mem_cgroup_commit_charge(folio, memcg);
>  	mem_cgroup_put(memcg);
> +	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)

Same here.

> +		lruvec_stat_mod_folio(folio, HUGETLB_B, pages_per_huge_page(h));
>  
>  	return folio;
  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 20:34 [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg Joshua Hahn
  2024-10-23 21:15 ` Yosry Ahmed
  2024-10-23 22:17 ` Shakeel Butt
@ 2024-10-24 15:52 ` kernel test robot
  2024-10-25  0:11 ` kernel test robot
  3 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2024-10-24 15:52 UTC (permalink / raw)
  To: Joshua Hahn, hannes
  Cc: oe-kbuild-all, nphamcs, mhocko, roman.gushcin, shakeel.butt,
	muchun.song, lnyng, akpm, cgroups

Hi Joshua,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.12-rc4 next-20241024]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/memcg-hugetlb-Adding-hugeTLB-counters-to-memcg/20241024-043559
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241023203433.1568323-1-joshua.hahnjy%40gmail.com
patch subject: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
config: powerpc-ep88xc_defconfig (https://download.01.org/0day-ci/archive/20241024/202410242321.pttwmbpo-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241024/202410242321.pttwmbpo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410242321.pttwmbpo-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/hugetlb.c: In function 'free_huge_folio':
>> mm/hugetlb.c:1928:13: error: 'cgrp_dfl_root' undeclared (first use in this function)
    1928 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |             ^~~~~~~~~~~~~
   mm/hugetlb.c:1928:13: note: each undeclared identifier is reported only once for each function it appears in
>> mm/hugetlb.c:1928:35: error: 'CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING' undeclared (first use in this function)
    1928 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/hugetlb.c: In function 'alloc_hugetlb_folio':
   mm/hugetlb.c:3099:13: error: 'cgrp_dfl_root' undeclared (first use in this function)
    3099 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |             ^~~~~~~~~~~~~
   mm/hugetlb.c:3099:35: error: 'CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING' undeclared (first use in this function)
    3099 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/cgrp_dfl_root +1928 mm/hugetlb.c

  1880	
  1881	void free_huge_folio(struct folio *folio)
  1882	{
  1883		/*
  1884		 * Can't pass hstate in here because it is called from the
  1885		 * generic mm code.
  1886		 */
  1887		struct hstate *h = folio_hstate(folio);
  1888		int nid = folio_nid(folio);
  1889		struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
  1890		bool restore_reserve;
  1891		unsigned long flags;
  1892	
  1893		VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
  1894		VM_BUG_ON_FOLIO(folio_mapcount(folio), folio);
  1895	
  1896		hugetlb_set_folio_subpool(folio, NULL);
  1897		if (folio_test_anon(folio))
  1898			__ClearPageAnonExclusive(&folio->page);
  1899		folio->mapping = NULL;
  1900		restore_reserve = folio_test_hugetlb_restore_reserve(folio);
  1901		folio_clear_hugetlb_restore_reserve(folio);
  1902	
  1903		/*
  1904		 * If HPageRestoreReserve was set on page, page allocation consumed a
  1905		 * reservation.  If the page was associated with a subpool, there
  1906		 * would have been a page reserved in the subpool before allocation
  1907		 * via hugepage_subpool_get_pages().  Since we are 'restoring' the
  1908		 * reservation, do not call hugepage_subpool_put_pages() as this will
  1909		 * remove the reserved page from the subpool.
  1910		 */
  1911		if (!restore_reserve) {
  1912			/*
  1913			 * A return code of zero implies that the subpool will be
  1914			 * under its minimum size if the reservation is not restored
  1915			 * after page is free.  Therefore, force restore_reserve
  1916			 * operation.
  1917			 */
  1918			if (hugepage_subpool_put_pages(spool, 1) == 0)
  1919				restore_reserve = true;
  1920		}
  1921	
  1922		spin_lock_irqsave(&hugetlb_lock, flags);
  1923		folio_clear_hugetlb_migratable(folio);
  1924		hugetlb_cgroup_uncharge_folio(hstate_index(h),
  1925					     pages_per_huge_page(h), folio);
  1926		hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
  1927						  pages_per_huge_page(h), folio);
> 1928		if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
  1929			lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
  1930		mem_cgroup_uncharge(folio);
  1931		if (restore_reserve)
  1932			h->resv_huge_pages++;
  1933	
  1934		if (folio_test_hugetlb_temporary(folio)) {
  1935			remove_hugetlb_folio(h, folio, false);
  1936			spin_unlock_irqrestore(&hugetlb_lock, flags);
  1937			update_and_free_hugetlb_folio(h, folio, true);
  1938		} else if (h->surplus_huge_pages_node[nid]) {
  1939			/* remove the page from active list */
  1940			remove_hugetlb_folio(h, folio, true);
  1941			spin_unlock_irqrestore(&hugetlb_lock, flags);
  1942			update_and_free_hugetlb_folio(h, folio, true);
  1943		} else {
  1944			arch_clear_hugetlb_flags(folio);
  1945			enqueue_hugetlb_folio(h, folio);
  1946			spin_unlock_irqrestore(&hugetlb_lock, flags);
  1947		}
  1948	}
  1949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 21:17   ` Yosry Ahmed
@ 2024-10-24 16:57     ` Roman Gushchin
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2024-10-24 16:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Joshua Hahn, hannes, nphamcs, mhocko, roman.gushcin, shakeel.butt,
	muchun.song, lnyng, akpm, cgroups

On Wed, Oct 23, 2024 at 02:17:27PM -0700, Yosry Ahmed wrote:
> On Wed, Oct 23, 2024 at 2:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Hi Joshua,
> >
> > On Wed, Oct 23, 2024 at 1:34 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > >
> > > Changelog
> > > v2:
> > >   * Enables the feature only if memcg accounts for hugeTLB usage
> > >   * Moves the counter from memcg_stat_item to node_stat_item
> > >   * Expands on motivation & justification in commitlog
> > >   * Added Suggested-by: Nhat Pham
> >
> > Changelogs usually go at the end, after ---, not as part of the commit
> > log itself.
> >
> > >
> > > This patch introduces a new counter to memory.stat that tracks hugeTLB
> > > usage, only if hugeTLB accounting is done to memory.current. This
> > > feature is enabled the same way hugeTLB accounting is enabled, via
> > > the memory_hugetlb_accounting mount flag for cgroupsv2.
> > >
> > > 1. Why is this patch necessary?
> > > Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> > > hugeTLB usage to memory.current. However, the metric is not reported in
> > > memory.stat. Given that users often interpret memory.stat as a breakdown
> > > of the value reported in memory.current, the disparity between the two
> > > reports can be confusing. This patch solves this problem by including
> > > the metric in memory.stat as well, but only if it is also reported in
> > > memory.current (it would also be confusing if the value was reported in
> > > memory.stat, but not in memory.current)
> > >
> > > Aside from the consistentcy between the two files, we also see benefits
> >
> > consistency*
> >
> > > in observability. Userspace might be interested in the hugeTLB footprint
> > > of cgroups for many reasons. For instance, system admins might want to
> > > verify that hugeTLB usage is distributed as expected across tasks: i.e.
> > > memory-intensive tasks are using more hugeTLB pages than tasks that
> > > don't consume a lot of memory, or is seen to fault frequently. Note that
> >
> > are* seen
> >
> > > this is separate from wanting to inspect the distribution for limiting
> > > purposes (in which case, hugeTLB controller makes more sense).
> > >
> > > 2. We already have a hugeTLB controller. Why not use that?
> > > It is true that hugeTLB tracks the exact value that we want. In fact, by
> > > enabling the hugeTLB controller, we get all of the observability
> > > benefits that I mentioned above, and users can check the total hugeTLB
> > > usage, verify if it is distributed as expected, etc.
> > >
> > > With this said, there are 2 problems:
> > >   (a) They are still not reported in memory.stat, which means the
> > >       disparity between the memcg reports are still there.
> > >   (b) We cannot reasonably expect users to enable the hugeTLB controller
> > >       just for the sake of hugeTLB usage reporting, especially since
> > >       they don't have any use for hugeTLB usage enforcing [2].
> > >
> > > [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> > > [2] Of course, we can't make a new patch for every feature that can be
> > >     duplicated. However, since the exsting solution of enabling the
> >
> > existing*
> >
> > >     hugeTLB controller is an imperfect solution that still leaves a
> > >     discrepancy between memory.stat and memory.curent, I think that it
> > >     is reasonable to isolate the feature in this case.
> > >
> > > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> >
> > You should also CC linux-mm on such patches.
> 
> +roman.gushchin@linux.dev
> 
> CCing Roman's correct email.


Funny enough I suggested a similar functionality back to 2017:
https://lkml.iu.edu/hypermail/linux/kernel/1711.1/04891.html ,
but it was rejected at that time.
So yeah, it still sounds like a good idea to me.

As I understand, there is a new version of this patchset pending,
I'll wait for it for the review.

Thank you for looping me in!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 22:17 ` Shakeel Butt
@ 2024-10-24 18:13   ` Joshua Hahn
  0 siblings, 0 replies; 10+ messages in thread
From: Joshua Hahn @ 2024-10-24 18:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: hannes, nphamcs, mhocko, roman.gushcin, muchun.song, lnyng, akpm,
	cgroups

On Wed, Oct 23, 2024 at 6:17 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Oct 23, 2024 at 01:34:33PM GMT, Joshua Hahn wrote:
> >  include/linux/mmzone.h |  3 +++
> >  mm/hugetlb.c           |  4 ++++
> >  mm/memcontrol.c        | 11 +++++++++++
> >  mm/vmstat.c            |  3 +++
> >  4 files changed, 21 insertions(+)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 17506e4a2835..d3ba49a974b2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -215,6 +215,9 @@ enum node_stat_item {
> >  #ifdef CONFIG_NUMA_BALANCING
> >       PGPROMOTE_SUCCESS,      /* promote successfully */
> >       PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> > +#endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +     HUGETLB_B,
>
> As Yosry pointed out, this is in pages, not bytes. There is already
> functionality to display this bin ytes for the readers of the memory
> stats.

Ah I see. I misunderstood what the _B meant, I didn't realize that it's
used to denote what the units of accounting were, not for exporting. I just
checked the functions that Yosry mentioned, and it makes a lot more sense.
(I also should have known, the method that I'm using to do the accounting
is called "pages"_per_huge_page)

> Also you will need to update Documentation/admin-guide/cgroup-v2.rst to
> include the hugetlb stats.

Thank you for pointing this out, I'll update the doc to include hugetlb stats.

> >  #endif
> >       /* PGDEMOTE_*: pages demoted */
> >       PGDEMOTE_KSWAPD,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 190fa05635f4..055bc91858e4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
> >                                    pages_per_huge_page(h), folio);
> >       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >                                         pages_per_huge_page(h), folio);
> > +     if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> > +             lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
>
> Please note that by you are adding this stat not only in memcg but also
> in global and per-node vmstat. This check will break those interfaces
> when this mount option is not used. You only need the check at the
> charging time. The uncharging and stats update functions will do the
> right thing as they check memcg_data attached to the folio.

Sounds good. I'll go back to make sure which other parts of the code
I'm touching
with this patch. Thank you for your feedback!

Joshua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
  2024-10-23 20:34 [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg Joshua Hahn
                   ` (2 preceding siblings ...)
  2024-10-24 15:52 ` kernel test robot
@ 2024-10-25  0:11 ` kernel test robot
  3 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2024-10-25  0:11 UTC (permalink / raw)
  To: Joshua Hahn, hannes
  Cc: llvm, oe-kbuild-all, nphamcs, mhocko, roman.gushcin, shakeel.butt,
	muchun.song, lnyng, akpm, cgroups

Hi Joshua,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.12-rc4 next-20241024]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/memcg-hugetlb-Adding-hugeTLB-counters-to-memcg/20241024-043559
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241023203433.1568323-1-joshua.hahnjy%40gmail.com
patch subject: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg
config: i386-buildonly-randconfig-003-20241025 (https://download.01.org/0day-ci/archive/20241025/202410250704.0LVzYMzi-lkp@intel.com/config)
compiler: clang version 19.1.2 (https://github.com/llvm/llvm-project 7ba7d8e2f7b6445b60679da826210cdde29eaf8b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241025/202410250704.0LVzYMzi-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410250704.0LVzYMzi-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/hugetlb.c:8:
   In file included from include/linux/mm.h:2213:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from mm/hugetlb.c:37:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/hugetlb.c:1928:6: error: use of undeclared identifier 'cgrp_dfl_root'
    1928 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |             ^
>> mm/hugetlb.c:1928:28: error: use of undeclared identifier 'CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING'
    1928 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |                                   ^
   mm/hugetlb.c:3099:6: error: use of undeclared identifier 'cgrp_dfl_root'
    3099 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |             ^
   mm/hugetlb.c:3099:28: error: use of undeclared identifier 'CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING'
    3099 |         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
         |                                   ^
   3 warnings and 4 errors generated.


vim +/cgrp_dfl_root +1928 mm/hugetlb.c

  1880	
  1881	void free_huge_folio(struct folio *folio)
  1882	{
  1883		/*
  1884		 * Can't pass hstate in here because it is called from the
  1885		 * generic mm code.
  1886		 */
  1887		struct hstate *h = folio_hstate(folio);
  1888		int nid = folio_nid(folio);
  1889		struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
  1890		bool restore_reserve;
  1891		unsigned long flags;
  1892	
  1893		VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
  1894		VM_BUG_ON_FOLIO(folio_mapcount(folio), folio);
  1895	
  1896		hugetlb_set_folio_subpool(folio, NULL);
  1897		if (folio_test_anon(folio))
  1898			__ClearPageAnonExclusive(&folio->page);
  1899		folio->mapping = NULL;
  1900		restore_reserve = folio_test_hugetlb_restore_reserve(folio);
  1901		folio_clear_hugetlb_restore_reserve(folio);
  1902	
  1903		/*
  1904		 * If HPageRestoreReserve was set on page, page allocation consumed a
  1905		 * reservation.  If the page was associated with a subpool, there
  1906		 * would have been a page reserved in the subpool before allocation
  1907		 * via hugepage_subpool_get_pages().  Since we are 'restoring' the
  1908		 * reservation, do not call hugepage_subpool_put_pages() as this will
  1909		 * remove the reserved page from the subpool.
  1910		 */
  1911		if (!restore_reserve) {
  1912			/*
  1913			 * A return code of zero implies that the subpool will be
  1914			 * under its minimum size if the reservation is not restored
  1915			 * after page is free.  Therefore, force restore_reserve
  1916			 * operation.
  1917			 */
  1918			if (hugepage_subpool_put_pages(spool, 1) == 0)
  1919				restore_reserve = true;
  1920		}
  1921	
  1922		spin_lock_irqsave(&hugetlb_lock, flags);
  1923		folio_clear_hugetlb_migratable(folio);
  1924		hugetlb_cgroup_uncharge_folio(hstate_index(h),
  1925					     pages_per_huge_page(h), folio);
  1926		hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
  1927						  pages_per_huge_page(h), folio);
> 1928		if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
  1929			lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
  1930		mem_cgroup_uncharge(folio);
  1931		if (restore_reserve)
  1932			h->resv_huge_pages++;
  1933	
  1934		if (folio_test_hugetlb_temporary(folio)) {
  1935			remove_hugetlb_folio(h, folio, false);
  1936			spin_unlock_irqrestore(&hugetlb_lock, flags);
  1937			update_and_free_hugetlb_folio(h, folio, true);
  1938		} else if (h->surplus_huge_pages_node[nid]) {
  1939			/* remove the page from active list */
  1940			remove_hugetlb_folio(h, folio, true);
  1941			spin_unlock_irqrestore(&hugetlb_lock, flags);
  1942			update_and_free_hugetlb_folio(h, folio, true);
  1943		} else {
  1944			arch_clear_hugetlb_flags(folio);
  1945			enqueue_hugetlb_folio(h, folio);
  1946			spin_unlock_irqrestore(&hugetlb_lock, flags);
  1947		}
  1948	}
  1949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-10-25  0:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-23 20:34 [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg Joshua Hahn
2024-10-23 21:15 ` Yosry Ahmed
2024-10-23 21:17   ` Yosry Ahmed
2024-10-24 16:57     ` Roman Gushchin
2024-10-23 21:32   ` Joshua Hahn
2024-10-23 21:37     ` Yosry Ahmed
2024-10-23 22:17 ` Shakeel Butt
2024-10-24 18:13   ` Joshua Hahn
2024-10-24 15:52 ` kernel test robot
2024-10-25  0:11 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.