* [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information
@ 2022-01-31 19:23 Waiman Long
[not found] ` <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw)
To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
Rasmus Villemoes
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin,
Rafael Aquini, Waiman Long
v3:
- Add unlikely() to patch 1 and clarify that -1 will not be returned.
- Use a helper function to print out memcg information in patch 3.
- Add a new patch 4 to store task command name in page_owner
structure.
v2:
- Remove the SNPRINTF() macro as suggested by Ira and use scnprintf()
instead to remove some buffer overrun checks.
- Add a patch to optimize vscnprintf with a size parameter of 0.
While debugging the constant increase in percpu memory consumption on
a system that spawned large number of containers, it was found that a
lot of offlined mem_cgroup structures remained in place without being
freed. Further investigation indicated that those mem_cgroup structures
were pinned by some pages.
In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are offlined or not. With the enhanced page_owner tool,
the following is a typical page that pinned the mem_cgroup structure
in my test case:
Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns
PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
prep_new_page+0xac/0xe0
get_page_from_freelist+0x1327/0x14d0
__alloc_pages+0x191/0x340
alloc_pages_vma+0x84/0x250
shmem_alloc_page+0x3f/0x90
shmem_alloc_and_acct_page+0x76/0x1c0
shmem_getpage_gfp+0x281/0x940
shmem_write_begin+0x36/0xe0
generic_perform_write+0xed/0x1d0
__generic_file_write_iter+0xdc/0x1b0
generic_file_write_iter+0x5d/0xb0
new_sync_write+0x11f/0x1b0
vfs_write+0x1ba/0x2a0
ksys_write+0x59/0xd0
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to offlined memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8.
So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.
Waiman Long (4):
lib/vsprintf: Avoid redundant work with 0 size
mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
mm/page_owner: Print memcg information
mm/page_owner: Record task command name
lib/vsprintf.c | 8 +++---
mm/page_owner.c | 69 ++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 59 insertions(+), 18 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 39+ messages in thread[parent not found: <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 19:23 ` Waiman Long [not found] ` <20220131192308.608837-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long For *scnprintf(), vsnprintf() is always called even if the input size is 0. That is a waste of time, so just return 0 in this case. Note that vsnprintf() will never return -1 to indicate an error. So skipping the call to vsnprintf() when size is 0 will have no functional impact at all. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> --- lib/vsprintf.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 3b8129dd374c..d419154b47bb 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) { int i; + if (unlikely(!size)) + return 0; + i = vsnprintf(buf, size, fmt, args); if (likely(i < size)) return i; - if (size != 0) - return size - 1; - return 0; + + return size - 1; } EXPORT_SYMBOL(vscnprintf); -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220131192308.608837-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220131192308.608837-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 20:42 ` Mike Rapoport 0 siblings, 0 replies; 39+ messages in thread From: Mike Rapoport @ 2022-01-31 20:42 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, David Rientjes, Roman Gushchin, Rafael Aquini On Mon, Jan 31, 2022 at 02:23:05PM -0500, Waiman Long wrote: > For *scnprintf(), vsnprintf() is always called even if the input size is > 0. That is a waste of time, so just return 0 in this case. > > Note that vsnprintf() will never return -1 to indicate an error. So > skipping the call to vsnprintf() when size is 0 will have no functional > impact at all. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> > Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Acked-by: Mike Rapoport <rppt-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org> > --- > lib/vsprintf.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/lib/vsprintf.c b/lib/vsprintf.c > index 3b8129dd374c..d419154b47bb 100644 > --- a/lib/vsprintf.c > +++ b/lib/vsprintf.c > @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) > { > int i; > > + if (unlikely(!size)) > + return 0; > + > i = vsnprintf(buf, size, fmt, args); > > if (likely(i < size)) > return i; > - if (size != 0) > - return size - 1; > - return 0; > + > + return size - 1; > } > EXPORT_SYMBOL(vscnprintf); > > -- > 2.27.0 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long @ 2022-01-31 19:23 ` Waiman Long [not found] ` <20220131192308.608837-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long 2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long The snprintf() function can return a length greater than the given input size. That will require a check for buffer overrun after each invocation of snprintf(). scnprintf(), on the other hand, will never return a greater length. By using scnprintf() in selected places, we can avoid some buffer overrun checks except after stack_depot_snprint() and after the last snprintf(). Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> --- mm/page_owner.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index 99e360df9465..28dac73e0542 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, if (!kbuf) return -ENOMEM; - ret = snprintf(kbuf, count, + ret = scnprintf(kbuf, count, "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", page_owner->order, page_owner->gfp_mask, &page_owner->gfp_mask, page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); - if (ret >= count) - goto err; - /* Print information relevant to grouping pages by mobility */ pageblock_mt = get_pageblock_migratetype(page); page_mt = gfp_migratetype(page_owner->gfp_mask); - ret += snprintf(kbuf + ret, count - ret, + ret += scnprintf(kbuf + ret, count - ret, "PFN %lu type %s Block %lu type %s Flags %pGp\n", pfn, migratetype_names[page_mt], @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migratetype_names[pageblock_mt], &page->flags); - if (ret >= count) - goto err; - ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0); if (ret >= count) goto err; if (page_owner->last_migrate_reason != -1) { - ret += snprintf(kbuf + ret, count - ret, + ret += scnprintf(kbuf + ret, count - ret, "Page has been migrated, last migrate reason: %s\n", migrate_reason_names[page_owner->last_migrate_reason]); - if (ret >= count) - goto err; } ret += snprintf(kbuf + ret, count - ret, "\n"); -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220131192308.608837-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <20220131192308.608837-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 20:38 ` Roman Gushchin 2022-01-31 20:43 ` Mike Rapoport 1 sibling, 0 replies; 39+ messages in thread From: Roman Gushchin @ 2022-01-31 20:38 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Rafael Aquini On Mon, Jan 31, 2022 at 02:23:06PM -0500, Waiman Long wrote: > The snprintf() function can return a length greater than the given > input size. That will require a check for buffer overrun after each > invocation of snprintf(). scnprintf(), on the other hand, will never > return a greater length. By using scnprintf() in selected places, we > can avoid some buffer overrun checks except after stack_depot_snprint() > and after the last snprintf(). > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <20220131192308.608837-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 20:38 ` Roman Gushchin @ 2022-01-31 20:43 ` Mike Rapoport 1 sibling, 0 replies; 39+ messages in thread From: Mike Rapoport @ 2022-01-31 20:43 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, David Rientjes, Roman Gushchin, Rafael Aquini On Mon, Jan 31, 2022 at 02:23:06PM -0500, Waiman Long wrote: > The snprintf() function can return a length greater than the given > input size. That will require a check for buffer overrun after each > invocation of snprintf(). scnprintf(), on the other hand, will never > return a greater length. By using scnprintf() in selected places, we > can avoid some buffer overrun checks except after stack_depot_snprint() > and after the last snprintf(). > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> Acked-by: Mike Rapoport <rppt-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org> > --- > mm/page_owner.c | 14 +++----------- > 1 file changed, 3 insertions(+), 11 deletions(-) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 99e360df9465..28dac73e0542 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > if (!kbuf) > return -ENOMEM; > > - ret = snprintf(kbuf, count, > + ret = scnprintf(kbuf, count, > "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", > page_owner->order, page_owner->gfp_mask, > &page_owner->gfp_mask, page_owner->pid, > page_owner->ts_nsec, page_owner->free_ts_nsec); > > - if (ret >= count) > - goto err; > - > /* Print information relevant to grouping pages by mobility */ > pageblock_mt = get_pageblock_migratetype(page); > page_mt = gfp_migratetype(page_owner->gfp_mask); > - ret += snprintf(kbuf + ret, count - ret, > + ret += scnprintf(kbuf + ret, count - ret, > "PFN %lu type %s Block %lu type %s Flags %pGp\n", > pfn, > migratetype_names[page_mt], > @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migratetype_names[pageblock_mt], > &page->flags); > > - if (ret >= count) > - goto err; > - > ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0); > if (ret >= count) > goto err; > > if (page_owner->last_migrate_reason != -1) { > - ret += snprintf(kbuf + ret, count - ret, > + ret += scnprintf(kbuf + ret, count - ret, > "Page has been migrated, last migrate reason: %s\n", > migrate_reason_names[page_owner->last_migrate_reason]); > - if (ret >= count) > - goto err; > } > > ret += snprintf(kbuf + ret, count - ret, "\n"); > -- > 2.27.0 > > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long 2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long @ 2022-01-31 19:23 ` Waiman Long [not found] ` <20220131192308.608837-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-02-01 10:54 ` Michal Hocko 2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long 3 siblings, 2 replies; 39+ messages in thread From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long It was found that a number of offlined memcgs were not freed because they were pinned by some charged pages that were present. Even "echo 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These offlined but not freed memcgs tend to increase in number over time with the side effect that percpu memory consumption as shown in /proc/meminfo also increases over time. In order to find out more information about those pages that pin offlined memcgs, the page_owner feature is extended to print memory cgroup information especially whether the cgroup is offlined or not. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/mm/page_owner.c b/mm/page_owner.c index 28dac73e0542..a471c74c7fe0 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -10,6 +10,7 @@ #include <linux/migrate.h> #include <linux/stackdepot.h> #include <linux/seq_file.h> +#include <linux/memcontrol.h> #include <linux/sched/clock.h> #include "internal.h" @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, seq_putc(m, '\n'); } +#ifdef CONFIG_MEMCG +/* + * Looking for memcg information and print it out + */ +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, + struct page *page) +{ + unsigned long memcg_data = READ_ONCE(page->memcg_data); + struct mem_cgroup *memcg; + bool onlined; + char name[80]; + + if (!memcg_data) + return; + + if (memcg_data & MEMCG_DATA_OBJCGS) + *pret += scnprintf(kbuf + *pret, count - *pret, + "Slab cache page\n"); + + memcg = page_memcg_check(page); + if (!memcg) + return; + + onlined = (memcg->css.flags & CSS_ONLINE); + cgroup_name(memcg->css.cgroup, name, sizeof(name)); + *pret += scnprintf(kbuf + *pret, count - *pret, + "Charged %sto %smemcg %s\n", + PageMemcgKmem(page) ? "(via objcg) " : "", + onlined ? "" : "offlined ", + name); +} +#else /* CONFIG_MEMCG */ +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, + struct page *page) { } +#endif /* CONFIG_MEMCG */ + static ssize_t print_page_owner(char __user *buf, size_t count, unsigned long pfn, struct page *page, struct page_owner *page_owner, @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migrate_reason_names[page_owner->last_migrate_reason]); } + print_page_owner_memcg(kbuf, count, &ret, page); + ret += snprintf(kbuf + ret, count - ret, "\n"); if (ret >= count) goto err; -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220131192308.608837-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <20220131192308.608837-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 20:51 ` Mike Rapoport [not found] ` <YfhLzI+RLRGgexmr-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2022-01-31 20:51 ` Roman Gushchin 1 sibling, 1 reply; 39+ messages in thread From: Mike Rapoport @ 2022-01-31 20:51 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, David Rientjes, Roman Gushchin, Rafael Aquini On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to print memory > cgroup information especially whether the cgroup is offlined or not. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > --- > mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..a471c74c7fe0 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, > seq_putc(m, '\n'); > } > > +#ifdef CONFIG_MEMCG > +/* > + * Looking for memcg information and print it out > + */ > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, > + struct page *page) > +{ > + unsigned long memcg_data = READ_ONCE(page->memcg_data); > + struct mem_cgroup *memcg; > + bool onlined; > + char name[80]; > + > + if (!memcg_data) > + return; > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + *pret += scnprintf(kbuf + *pret, count - *pret, > + "Slab cache page\n"); Don't we need to check for overflow here? > + > + memcg = page_memcg_check(page); > + if (!memcg) > + return; > + > + onlined = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + *pret += scnprintf(kbuf + *pret, count - *pret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + onlined ? "" : "offlined ", > + name); Ditto > +} > +#else /* CONFIG_MEMCG */ > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, > + struct page *page) { } I think #ifdef inside the print_page_owner_memcg() functions will be simpler and clearer. > +#endif /* CONFIG_MEMCG */ > + > static ssize_t > print_page_owner(char __user *buf, size_t count, unsigned long pfn, > struct page *page, struct page_owner *page_owner, > @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migrate_reason_names[page_owner->last_migrate_reason]); > } > > + print_page_owner_memcg(kbuf, count, &ret, page); > + ret can go over count here. Why not make print_page_owner_memcg() an int so that the call will be consistent with other calls in print_page_owner(): ret += print_page_owner_memcg(kbuf, count, page); if (ret >= count) goto err; > ret += snprintf(kbuf + ret, count - ret, "\n"); > if (ret >= count) > goto err; > -- > 2.27.0 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YfhLzI+RLRGgexmr-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <YfhLzI+RLRGgexmr-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2022-01-31 21:43 ` Waiman Long [not found] ` <4234fc60-5d65-1089-555a-734218aa6f9c-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-01-31 21:43 UTC (permalink / raw) To: Mike Rapoport Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, David Rientjes, Roman Gushchin, Rafael Aquini On 1/31/22 15:51, Mike Rapoport wrote: > On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote: >> It was found that a number of offlined memcgs were not freed because >> they were pinned by some charged pages that were present. Even "echo >> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >> offlined but not freed memcgs tend to increase in number over time with >> the side effect that percpu memory consumption as shown in /proc/meminfo >> also increases over time. >> >> In order to find out more information about those pages that pin >> offlined memcgs, the page_owner feature is extended to print memory >> cgroup information especially whether the cgroup is offlined or not. >> >> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> >> --- >> mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 39 insertions(+) >> >> diff --git a/mm/page_owner.c b/mm/page_owner.c >> index 28dac73e0542..a471c74c7fe0 100644 >> --- a/mm/page_owner.c >> +++ b/mm/page_owner.c >> @@ -10,6 +10,7 @@ >> #include <linux/migrate.h> >> #include <linux/stackdepot.h> >> #include <linux/seq_file.h> >> +#include <linux/memcontrol.h> >> #include <linux/sched/clock.h> >> >> #include "internal.h" >> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, >> seq_putc(m, '\n'); >> } >> >> +#ifdef CONFIG_MEMCG >> +/* >> + * Looking for memcg information and print it out >> + */ >> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, >> + struct page *page) >> +{ >> + unsigned long memcg_data = READ_ONCE(page->memcg_data); >> + struct mem_cgroup *memcg; >> + bool onlined; >> + char name[80]; >> + >> + if (!memcg_data) >> + return; >> + >> + if (memcg_data & MEMCG_DATA_OBJCGS) >> + *pret += scnprintf(kbuf + *pret, count - *pret, >> + "Slab cache page\n"); > Don't we need to check for overflow here? See my previous patch 2 and the reason I used scnprintf() is that it never return a length that is >= the given size. So overflow won't happen. The final snprintf() in print_page_owner() will detect buffer overflow. > >> + >> + memcg = page_memcg_check(page); >> + if (!memcg) >> + return; >> + >> + onlined = (memcg->css.flags & CSS_ONLINE); >> + cgroup_name(memcg->css.cgroup, name, sizeof(name)); >> + *pret += scnprintf(kbuf + *pret, count - *pret, >> + "Charged %sto %smemcg %s\n", >> + PageMemcgKmem(page) ? "(via objcg) " : "", >> + onlined ? "" : "offlined ", >> + name); > Ditto > >> +} >> +#else /* CONFIG_MEMCG */ >> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, >> + struct page *page) { } > I think #ifdef inside the print_page_owner_memcg() functions will be > simpler and clearer. Yes, I see both styles used in kernel code though this style is probably more common. I will keep this unless there is a good reason to do otherwise. > >> +#endif /* CONFIG_MEMCG */ >> + >> static ssize_t >> print_page_owner(char __user *buf, size_t count, unsigned long pfn, >> struct page *page, struct page_owner *page_owner, >> @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, >> migrate_reason_names[page_owner->last_migrate_reason]); >> } >> >> + print_page_owner_memcg(kbuf, count, &ret, page); >> + > ret can go over count here. > Why not make print_page_owner_memcg() an int so that the call will be > consistent with other calls in print_page_owner(): > > ret += print_page_owner_memcg(kbuf, count, page); > if (ret >= count) > goto err; See my comments above. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <4234fc60-5d65-1089-555a-734218aa6f9c-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <4234fc60-5d65-1089-555a-734218aa6f9c-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-01 6:23 ` Mike Rapoport 0 siblings, 0 replies; 39+ messages in thread From: Mike Rapoport @ 2022-02-01 6:23 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, David Rientjes, Roman Gushchin, Rafael Aquini On Mon, Jan 31, 2022 at 04:43:32PM -0500, Waiman Long wrote: > > On 1/31/22 15:51, Mike Rapoport wrote: > > On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote: > > > It was found that a number of offlined memcgs were not freed because > > > they were pinned by some charged pages that were present. Even "echo > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > offlined but not freed memcgs tend to increase in number over time with > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > also increases over time. > > > > > > In order to find out more information about those pages that pin > > > offlined memcgs, the page_owner feature is extended to print memory > > > cgroup information especially whether the cgroup is offlined or not. > > > > > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > > --- > > > mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 39 insertions(+) > > > > > > diff --git a/mm/page_owner.c b/mm/page_owner.c > > > index 28dac73e0542..a471c74c7fe0 100644 > > > --- a/mm/page_owner.c > > > +++ b/mm/page_owner.c > > > @@ -10,6 +10,7 @@ > > > #include <linux/migrate.h> > > > #include <linux/stackdepot.h> > > > #include <linux/seq_file.h> > > > +#include <linux/memcontrol.h> > > > #include <linux/sched/clock.h> > > > #include "internal.h" > > > @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, > > > seq_putc(m, '\n'); > > > } > > > +#ifdef CONFIG_MEMCG > > > +/* > > > + * Looking for memcg information and print it out > > > + */ > > > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, > > > + struct page *page) > > > +{ > > > + unsigned long memcg_data = READ_ONCE(page->memcg_data); > > > + struct mem_cgroup *memcg; > > > + bool onlined; > > > + char name[80]; > > > + > > > + if (!memcg_data) > > > + return; > > > + > > > + if (memcg_data & MEMCG_DATA_OBJCGS) > > > + *pret += scnprintf(kbuf + *pret, count - *pret, > > > + "Slab cache page\n"); > > Don't we need to check for overflow here? > > See my previous patch 2 and the reason I used scnprintf() is that it never > return a length that is >= the given size. So overflow won't happen. The > final snprintf() in print_page_owner() will detect buffer overflow. Right, I've missed that > > > + > > > + memcg = page_memcg_check(page); > > > + if (!memcg) > > > + return; > > > + > > > + onlined = (memcg->css.flags & CSS_ONLINE); > > > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > > > + *pret += scnprintf(kbuf + *pret, count - *pret, > > > + "Charged %sto %smemcg %s\n", > > > + PageMemcgKmem(page) ? "(via objcg) " : "", > > > + onlined ? "" : "offlined ", > > > + name); > > Ditto > > > > > +} > > > +#else /* CONFIG_MEMCG */ > > > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, > > > + struct page *page) { } > > I think #ifdef inside the print_page_owner_memcg() functions will be > > simpler and clearer. > > Yes, I see both styles used in kernel code though this style is probably > more common. I will keep this unless there is a good reason to do otherwise. Having #ifdef inside the function is safer wrt future updates. It's often happens that non-default arm of #ifdef is forgotten. Besides, it's several lines less. > > > +#endif /* CONFIG_MEMCG */ > > > + > > > static ssize_t > > > print_page_owner(char __user *buf, size_t count, unsigned long pfn, > > > struct page *page, struct page_owner *page_owner, > > > @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > > > migrate_reason_names[page_owner->last_migrate_reason]); > > > } > > > + print_page_owner_memcg(kbuf, count, &ret, page); > > > + > > ret can go over count here. > > Why not make print_page_owner_memcg() an int so that the call will be > > consistent with other calls in print_page_owner(): > > > > ret += print_page_owner_memcg(kbuf, count, page); > > if (ret >= count) > > goto err; I still think that 'int print_page_owner_memcg()' is clearer and more readable. > See my comments above. > > Cheers, > Longman > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <20220131192308.608837-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 20:51 ` Mike Rapoport @ 2022-01-31 20:51 ` Roman Gushchin 1 sibling, 0 replies; 39+ messages in thread From: Roman Gushchin @ 2022-01-31 20:51 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Rafael Aquini On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to print memory > cgroup information especially whether the cgroup is offlined or not. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Thanks! ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information 2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long [not found] ` <20220131192308.608837-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-01 10:54 ` Michal Hocko [not found] ` <YfkRS75D3xcqLT85-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 39+ messages in thread From: Michal Hocko @ 2022-02-01 10:54 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Mon 31-01-22 14:23:07, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to print memory > cgroup information especially whether the cgroup is offlined or not. > > Signed-off-by: Waiman Long <longman@redhat.com> > Acked-by: David Rientjes <rientjes@google.com> > --- > mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..a471c74c7fe0 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, > seq_putc(m, '\n'); > } > > +#ifdef CONFIG_MEMCG > +/* > + * Looking for memcg information and print it out > + */ > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, > + struct page *page) > +{ > + unsigned long memcg_data = READ_ONCE(page->memcg_data); > + struct mem_cgroup *memcg; > + bool onlined; > + char name[80]; > + > + if (!memcg_data) > + return; > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + *pret += scnprintf(kbuf + *pret, count - *pret, > + "Slab cache page\n"); > + > + memcg = page_memcg_check(page); > + if (!memcg) > + return; > + > + onlined = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + *pret += scnprintf(kbuf + *pret, count - *pret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + onlined ? "" : "offlined ", > + name); I have asked in the previous version already but what makes the memcg stable (why it cannot go away and be reallocated for something else) while you are trying to get its name? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YfkRS75D3xcqLT85-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <YfkRS75D3xcqLT85-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-01 17:04 ` Waiman Long [not found] ` <33be132c-874d-1061-9003-50942275b221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-01 17:04 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/1/22 05:54, Michal Hocko wrote: > On Mon 31-01-22 14:23:07, Waiman Long wrote: >> It was found that a number of offlined memcgs were not freed because >> they were pinned by some charged pages that were present. Even "echo >> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >> offlined but not freed memcgs tend to increase in number over time with >> the side effect that percpu memory consumption as shown in /proc/meminfo >> also increases over time. >> >> In order to find out more information about those pages that pin >> offlined memcgs, the page_owner feature is extended to print memory >> cgroup information especially whether the cgroup is offlined or not. >> >> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> >> --- >> mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 39 insertions(+) >> >> diff --git a/mm/page_owner.c b/mm/page_owner.c >> index 28dac73e0542..a471c74c7fe0 100644 >> --- a/mm/page_owner.c >> +++ b/mm/page_owner.c >> @@ -10,6 +10,7 @@ >> #include <linux/migrate.h> >> #include <linux/stackdepot.h> >> #include <linux/seq_file.h> >> +#include <linux/memcontrol.h> >> #include <linux/sched/clock.h> >> >> #include "internal.h" >> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, >> seq_putc(m, '\n'); >> } >> >> +#ifdef CONFIG_MEMCG >> +/* >> + * Looking for memcg information and print it out >> + */ >> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, >> + struct page *page) >> +{ >> + unsigned long memcg_data = READ_ONCE(page->memcg_data); >> + struct mem_cgroup *memcg; >> + bool onlined; >> + char name[80]; >> + >> + if (!memcg_data) >> + return; >> + >> + if (memcg_data & MEMCG_DATA_OBJCGS) >> + *pret += scnprintf(kbuf + *pret, count - *pret, >> + "Slab cache page\n"); >> + >> + memcg = page_memcg_check(page); >> + if (!memcg) >> + return; >> + >> + onlined = (memcg->css.flags & CSS_ONLINE); >> + cgroup_name(memcg->css.cgroup, name, sizeof(name)); >> + *pret += scnprintf(kbuf + *pret, count - *pret, >> + "Charged %sto %smemcg %s\n", >> + PageMemcgKmem(page) ? "(via objcg) " : "", >> + onlined ? "" : "offlined ", >> + name); > I have asked in the previous version already but what makes the memcg > stable (why it cannot go away and be reallocated for something else) > while you are trying to get its name? The memcg is not going away as long as the page isn't freed unless if it is indirectly connected via objcg. Of course, there can be a race between the page is going to be freed while the page_owner information is being displayed. One solution is to add a simple bit lock to each of the page_owner structure and acquire the lock when it is being written to or read from. Anyway a lot of these debugging aids or tools don't eliminate all the race conditions that affect the accuracy of the displayed information. I can add a patch to eliminate this direct memcg race if you think this is necessary. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <33be132c-874d-1061-9003-50942275b221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <33be132c-874d-1061-9003-50942275b221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-02 8:49 ` Michal Hocko [not found] ` <YfpFkVLBb0GsDFsi-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Michal Hocko @ 2022-02-02 8:49 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Tue 01-02-22 12:04:37, Waiman Long wrote: > On 2/1/22 05:54, Michal Hocko wrote: > > On Mon 31-01-22 14:23:07, Waiman Long wrote: > > > It was found that a number of offlined memcgs were not freed because > > > they were pinned by some charged pages that were present. Even "echo > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > offlined but not freed memcgs tend to increase in number over time with > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > also increases over time. > > > > > > In order to find out more information about those pages that pin > > > offlined memcgs, the page_owner feature is extended to print memory > > > cgroup information especially whether the cgroup is offlined or not. > > > > > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > > --- > > > mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 39 insertions(+) > > > > > > diff --git a/mm/page_owner.c b/mm/page_owner.c > > > index 28dac73e0542..a471c74c7fe0 100644 > > > --- a/mm/page_owner.c > > > +++ b/mm/page_owner.c > > > @@ -10,6 +10,7 @@ > > > #include <linux/migrate.h> > > > #include <linux/stackdepot.h> > > > #include <linux/seq_file.h> > > > +#include <linux/memcontrol.h> > > > #include <linux/sched/clock.h> > > > #include "internal.h" > > > @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, > > > seq_putc(m, '\n'); > > > } > > > +#ifdef CONFIG_MEMCG > > > +/* > > > + * Looking for memcg information and print it out > > > + */ > > > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, > > > + struct page *page) > > > +{ > > > + unsigned long memcg_data = READ_ONCE(page->memcg_data); > > > + struct mem_cgroup *memcg; > > > + bool onlined; > > > + char name[80]; > > > + > > > + if (!memcg_data) > > > + return; > > > + > > > + if (memcg_data & MEMCG_DATA_OBJCGS) > > > + *pret += scnprintf(kbuf + *pret, count - *pret, > > > + "Slab cache page\n"); > > > + > > > + memcg = page_memcg_check(page); > > > + if (!memcg) > > > + return; > > > + > > > + onlined = (memcg->css.flags & CSS_ONLINE); > > > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > > > + *pret += scnprintf(kbuf + *pret, count - *pret, > > > + "Charged %sto %smemcg %s\n", > > > + PageMemcgKmem(page) ? "(via objcg) " : "", > > > + onlined ? "" : "offlined ", > > > + name); > > I have asked in the previous version already but what makes the memcg > > stable (why it cannot go away and be reallocated for something else) > > while you are trying to get its name? > > The memcg is not going away as long as the page isn't freed unless if it is > indirectly connected via objcg. Of course, there can be a race between the > page is going to be freed while the page_owner information is being > displayed. Right. And that means that cgtoup_name can go off the rail and wander through memory correct? > One solution is to add a simple bit lock to each of the > page_owner structure and acquire the lock when it is being written to or > read from. I do not really see how a bit lock could prevent memcg from going away. On the other hand I think RCU read lock should be sufficient to keep the memcg from going away completely. > Anyway a lot of these debugging aids or tools don't eliminate all > the race conditions that affect the accuracy of the displayed information. I > can add a patch to eliminate this direct memcg race if you think this is > necessary. I do not mind inaccurate information. That is natural but reading through a freed memory can be really harmfull. So this really need to be sorted out. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YfpFkVLBb0GsDFsi-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information [not found] ` <YfpFkVLBb0GsDFsi-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-02 16:12 ` Waiman Long 0 siblings, 0 replies; 39+ messages in thread From: Waiman Long @ 2022-02-02 16:12 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/2/22 03:49, Michal Hocko wrote: > On Tue 01-02-22 12:04:37, Waiman Long wrote: >> On 2/1/22 05:54, Michal Hocko wrote: >>> On Mon 31-01-22 14:23:07, Waiman Long wrote: >>>> It was found that a number of offlined memcgs were not freed because >>>> they were pinned by some charged pages that were present. Even "echo >>>> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >>>> offlined but not freed memcgs tend to increase in number over time with >>>> the side effect that percpu memory consumption as shown in /proc/meminfo >>>> also increases over time. >>>> >>>> In order to find out more information about those pages that pin >>>> offlined memcgs, the page_owner feature is extended to print memory >>>> cgroup information especially whether the cgroup is offlined or not. >>>> >>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> >>>> --- >>>> mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 39 insertions(+) >>>> >>>> diff --git a/mm/page_owner.c b/mm/page_owner.c >>>> index 28dac73e0542..a471c74c7fe0 100644 >>>> --- a/mm/page_owner.c >>>> +++ b/mm/page_owner.c >>>> @@ -10,6 +10,7 @@ >>>> #include <linux/migrate.h> >>>> #include <linux/stackdepot.h> >>>> #include <linux/seq_file.h> >>>> +#include <linux/memcontrol.h> >>>> #include <linux/sched/clock.h> >>>> #include "internal.h" >>>> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, >>>> seq_putc(m, '\n'); >>>> } >>>> +#ifdef CONFIG_MEMCG >>>> +/* >>>> + * Looking for memcg information and print it out >>>> + */ >>>> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret, >>>> + struct page *page) >>>> +{ >>>> + unsigned long memcg_data = READ_ONCE(page->memcg_data); >>>> + struct mem_cgroup *memcg; >>>> + bool onlined; >>>> + char name[80]; >>>> + >>>> + if (!memcg_data) >>>> + return; >>>> + >>>> + if (memcg_data & MEMCG_DATA_OBJCGS) >>>> + *pret += scnprintf(kbuf + *pret, count - *pret, >>>> + "Slab cache page\n"); >>>> + >>>> + memcg = page_memcg_check(page); >>>> + if (!memcg) >>>> + return; >>>> + >>>> + onlined = (memcg->css.flags & CSS_ONLINE); >>>> + cgroup_name(memcg->css.cgroup, name, sizeof(name)); >>>> + *pret += scnprintf(kbuf + *pret, count - *pret, >>>> + "Charged %sto %smemcg %s\n", >>>> + PageMemcgKmem(page) ? "(via objcg) " : "", >>>> + onlined ? "" : "offlined ", >>>> + name); >>> I have asked in the previous version already but what makes the memcg >>> stable (why it cannot go away and be reallocated for something else) >>> while you are trying to get its name? >> The memcg is not going away as long as the page isn't freed unless if it is >> indirectly connected via objcg. Of course, there can be a race between the >> page is going to be freed while the page_owner information is being >> displayed. > Right. And that means that cgtoup_name can go off the rail and wander > through memory correct? > >> One solution is to add a simple bit lock to each of the >> page_owner structure and acquire the lock when it is being written to or >> read from. > I do not really see how a bit lock could prevent memcg from going away. > On the other hand I think RCU read lock should be sufficient to keep the > memcg from going away completely. Using rcu_read_lock() is also what I have been thinking of doing. So I will update the patch to add that for safety. > >> Anyway a lot of these debugging aids or tools don't eliminate all >> the race conditions that affect the accuracy of the displayed information. I >> can add a patch to eliminate this direct memcg race if you think this is >> necessary. > I do not mind inaccurate information. That is natural but reading > through a freed memory can be really harmfull. So this really need to be > sorted out. Thanks for the review. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v3 4/4] mm/page_owner: Record task command name [not found] ` <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long @ 2022-01-31 19:23 ` Waiman Long [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (3 more replies) 3 siblings, 4 replies; 39+ messages in thread From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long The page_owner information currently includes the pid of the calling task. That is useful as long as the task is still running. Otherwise, the number is meaningless. To have more information about the allocating tasks that had exited by the time the page_owner information is retrieved, we need to store the command name of the task. Add a new comm field into page_owner structure to store the command name and display it when the page_owner information is retrieved. Only the first 15 characters of the command name will be copied, but that should be enough in most cases. Even for those commands with longer names, it shouldn't be hard to guess what they are. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- mm/page_owner.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index a471c74c7fe0..8b2b381fd986 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -20,6 +20,7 @@ * to use off stack temporal storage */ #define PAGE_OWNER_STACK_DEPTH (16) +#define PAGE_OWNER_COMM_LEN 16 struct page_owner { unsigned short order; @@ -29,6 +30,7 @@ struct page_owner { depot_stack_handle_t free_handle; u64 ts_nsec; u64 free_ts_nsec; + char comm[PAGE_OWNER_COMM_LEN]; pid_t pid; }; @@ -146,6 +148,7 @@ void __reset_page_owner(struct page *page, unsigned short order) page_owner = get_page_owner(page_ext); page_owner->free_handle = handle; page_owner->free_ts_nsec = free_ts_nsec; + page_owner->comm[0] = '\0'; page_ext = page_ext_next(page_ext); } } @@ -165,6 +168,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext, page_owner->last_migrate_reason = -1; page_owner->pid = current->pid; page_owner->ts_nsec = local_clock(); + strlcpy(page_owner->comm, current->comm, + sizeof(page_owner->comm)); __set_bit(PAGE_EXT_OWNER, &page_ext->flags); __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); @@ -232,6 +237,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) new_page_owner->pid = old_page_owner->pid; new_page_owner->ts_nsec = old_page_owner->ts_nsec; new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; + strcpy(new_page_owner->comm, old_page_owner->comm); /* * We don't clear the bit on the old folio as it's going to be freed @@ -376,10 +382,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, return -ENOMEM; ret = scnprintf(kbuf, count, - "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", + "Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n", page_owner->order, page_owner->gfp_mask, &page_owner->gfp_mask, page_owner->pid, - page_owner->ts_nsec, page_owner->free_ts_nsec); + page_owner->comm, page_owner->ts_nsec, + page_owner->free_ts_nsec); /* Print information relevant to grouping pages by mobility */ pageblock_mt = get_pageblock_migratetype(page); @@ -446,9 +453,10 @@ void __dump_page_owner(const struct page *page) else pr_alert("page_owner tracks the page as freed\n"); - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n", + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n", page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, - page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); + page_owner->pid, page_owner->comm, page_owner->ts_nsec, + page_owner->free_ts_nsec); handle = READ_ONCE(page_owner->handle); if (!handle) -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 4/4] mm/page_owner: Record task command name [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 20:54 ` Roman Gushchin [not found] ` <YfhMd5LTzTHu9zMD-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> 2022-02-02 20:30 ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Roman Gushchin @ 2022-01-31 20:54 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Rafael Aquini On Mon, Jan 31, 2022 at 02:23:08PM -0500, Waiman Long wrote: > The page_owner information currently includes the pid of the calling > task. That is useful as long as the task is still running. Otherwise, > the number is meaningless. To have more information about the allocating > tasks that had exited by the time the page_owner information is > retrieved, we need to store the command name of the task. > > Add a new comm field into page_owner structure to store the command name > and display it when the page_owner information is retrieved. Only the > first 15 characters of the command name will be copied, but that should > be enough in most cases. Even for those commands with longer names, > it shouldn't be hard to guess what they are. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > mm/page_owner.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index a471c74c7fe0..8b2b381fd986 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -20,6 +20,7 @@ > * to use off stack temporal storage > */ > #define PAGE_OWNER_STACK_DEPTH (16) > +#define PAGE_OWNER_COMM_LEN 16 Not sure I understand why not simply use TASK_COMM_LEN ? ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YfhMd5LTzTHu9zMD-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH v3 4/4] mm/page_owner: Record task command name [not found] ` <YfhMd5LTzTHu9zMD-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> @ 2022-01-31 21:46 ` Waiman Long 0 siblings, 0 replies; 39+ messages in thread From: Waiman Long @ 2022-01-31 21:46 UTC (permalink / raw) To: Roman Gushchin Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Rafael Aquini On 1/31/22 15:54, Roman Gushchin wrote: > On Mon, Jan 31, 2022 at 02:23:08PM -0500, Waiman Long wrote: >> The page_owner information currently includes the pid of the calling >> task. That is useful as long as the task is still running. Otherwise, >> the number is meaningless. To have more information about the allocating >> tasks that had exited by the time the page_owner information is >> retrieved, we need to store the command name of the task. >> >> Add a new comm field into page_owner structure to store the command name >> and display it when the page_owner information is retrieved. Only the >> first 15 characters of the command name will be copied, but that should >> be enough in most cases. Even for those commands with longer names, >> it shouldn't be hard to guess what they are. >> >> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> --- >> mm/page_owner.c | 16 ++++++++++++---- >> 1 file changed, 12 insertions(+), 4 deletions(-) >> >> diff --git a/mm/page_owner.c b/mm/page_owner.c >> index a471c74c7fe0..8b2b381fd986 100644 >> --- a/mm/page_owner.c >> +++ b/mm/page_owner.c >> @@ -20,6 +20,7 @@ >> * to use off stack temporal storage >> */ >> #define PAGE_OWNER_STACK_DEPTH (16) >> +#define PAGE_OWNER_COMM_LEN 16 > Not sure I understand why not simply use TASK_COMM_LEN ? Yes, you are right. I thought TASK_COMM_LEN is larger than 16 without actually checking it. Will fix that. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 20:54 ` Roman Gushchin @ 2022-02-02 20:30 ` Waiman Long 2022-02-08 10:08 ` Petr Mladek 2022-02-02 20:30 ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long 2022-02-02 20:30 ` [PATCH v4 4/4] mm/page_owner: Record task command name Waiman Long 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long For *scnprintf(), vsnprintf() is always called even if the input size is 0. That is a waste of time, so just return 0 in this case. Note that vsnprintf() will never return -1 to indicate an error. So skipping the call to vsnprintf() when size is 0 will have no functional impact at all. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> --- lib/vsprintf.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 3b8129dd374c..d419154b47bb 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) { int i; + if (unlikely(!size)) + return 0; + i = vsnprintf(buf, size, fmt, args); if (likely(i < size)) return i; - if (size != 0) - return size - 1; - return 0; + + return size - 1; } EXPORT_SYMBOL(vscnprintf); -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size 2022-02-02 20:30 ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long @ 2022-02-08 10:08 ` Petr Mladek 0 siblings, 0 replies; 39+ messages in thread From: Petr Mladek @ 2022-02-08 10:08 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Wed 2022-02-02 15:30:33, Waiman Long wrote: > For *scnprintf(), vsnprintf() is always called even if the input size is > 0. That is a waste of time, so just return 0 in this case. > > Note that vsnprintf() will never return -1 to indicate an error. So > skipping the call to vsnprintf() when size is 0 will have no functional > impact at all. > > Signed-off-by: Waiman Long <longman@redhat.com> > Acked-by: David Rientjes <rientjes@google.com> > Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> > Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Best Regards, Petr ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 3/4] mm/page_owner: Print memcg information [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 20:54 ` Roman Gushchin 2022-02-02 20:30 ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long @ 2022-02-02 20:30 ` Waiman Long [not found] ` <20220202203036.744010-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-02-02 20:30 ` [PATCH v4 4/4] mm/page_owner: Record task command name Waiman Long 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long It was found that a number of offline memcgs were not freed because they were pinned by some charged pages that were present. Even "echo 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These offline but not freed memcgs tend to increase in number over time with the side effect that percpu memory consumption as shown in /proc/meminfo also increases over time. In order to find out more information about those pages that pin offline memcgs, the page_owner feature is extended to print memory cgroup information especially whether the cgroup is offline or not. RCU read lock is taken when memcg is being accessed to make sure that it won't be freed. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> --- mm/page_owner.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/mm/page_owner.c b/mm/page_owner.c index 28dac73e0542..f7820357e4d4 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -10,6 +10,7 @@ #include <linux/migrate.h> #include <linux/stackdepot.h> #include <linux/seq_file.h> +#include <linux/memcontrol.h> #include <linux/sched/clock.h> #include "internal.h" @@ -325,6 +326,45 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, seq_putc(m, '\n'); } +/* + * Looking for memcg information and print it out + */ +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret, + struct page *page) +{ +#ifdef CONFIG_MEMCG + unsigned long memcg_data; + struct mem_cgroup *memcg; + bool online; + char name[80]; + + rcu_read_lock(); + memcg_data = READ_ONCE(page->memcg_data); + if (!memcg_data) + goto out_unlock; + + if (memcg_data & MEMCG_DATA_OBJCGS) + ret += scnprintf(kbuf + ret, count - ret, + "Slab cache page\n"); + + memcg = page_memcg_check(page); + if (!memcg) + goto out_unlock; + + online = (memcg->css.flags & CSS_ONLINE); + cgroup_name(memcg->css.cgroup, name, sizeof(name)); + ret += scnprintf(kbuf + ret, count - ret, + "Charged %sto %smemcg %s\n", + PageMemcgKmem(page) ? "(via objcg) " : "", + online ? "" : "offline ", + name); +out_unlock: + rcu_read_unlock(); +#endif /* CONFIG_MEMCG */ + + return ret; +} + static ssize_t print_page_owner(char __user *buf, size_t count, unsigned long pfn, struct page *page, struct page_owner *page_owner, @@ -365,6 +405,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migrate_reason_names[page_owner->last_migrate_reason]); } + ret = print_page_owner_memcg(kbuf, count, ret, page); + ret += snprintf(kbuf + ret, count - ret, "\n"); if (ret >= count) goto err; -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220202203036.744010-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information [not found] ` <20220202203036.744010-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-03 6:53 ` Mike Rapoport 2022-02-03 12:46 ` Michal Hocko 1 sibling, 0 replies; 39+ messages in thread From: Mike Rapoport @ 2022-02-03 6:53 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, David Rientjes, Roman Gushchin, Rafael Aquini On Wed, Feb 02, 2022 at 03:30:35PM -0500, Waiman Long wrote: > It was found that a number of offline memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo 1 > > /proc/sys/vm/drop_caches" wasn't able to free those pages. These offline > but not freed memcgs tend to increase in number over time with the side > effect that percpu memory consumption as shown in /proc/meminfo also > increases over time. > > In order to find out more information about those pages that pin > offline memcgs, the page_owner feature is extended to print memory > cgroup information especially whether the cgroup is offline or not. > RCU read lock is taken when memcg is being accessed to make sure > that it won't be freed. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Acked-by: Mike Rapoport <rppt-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org> And my akcs for the first two patches are missing somehow in v4... > --- > mm/page_owner.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 42 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..f7820357e4d4 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -325,6 +326,45 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, > seq_putc(m, '\n'); > } > > +/* > + * Looking for memcg information and print it out > + */ > +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret, > + struct page *page) > +{ > +#ifdef CONFIG_MEMCG > + unsigned long memcg_data; > + struct mem_cgroup *memcg; > + bool online; > + char name[80]; > + > + rcu_read_lock(); > + memcg_data = READ_ONCE(page->memcg_data); > + if (!memcg_data) > + goto out_unlock; > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + ret += scnprintf(kbuf + ret, count - ret, > + "Slab cache page\n"); > + > + memcg = page_memcg_check(page); > + if (!memcg) > + goto out_unlock; > + > + online = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + ret += scnprintf(kbuf + ret, count - ret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + online ? "" : "offline ", > + name); > +out_unlock: > + rcu_read_unlock(); > +#endif /* CONFIG_MEMCG */ > + > + return ret; > +} > + > static ssize_t > print_page_owner(char __user *buf, size_t count, unsigned long pfn, > struct page *page, struct page_owner *page_owner, > @@ -365,6 +405,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migrate_reason_names[page_owner->last_migrate_reason]); > } > > + ret = print_page_owner_memcg(kbuf, count, ret, page); > + > ret += snprintf(kbuf + ret, count - ret, "\n"); > if (ret >= count) > goto err; > -- > 2.27.0 > > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information [not found] ` <20220202203036.744010-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-02-03 6:53 ` Mike Rapoport @ 2022-02-03 12:46 ` Michal Hocko [not found] ` <YfvOp5VXrxy9IW1w-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 39+ messages in thread From: Michal Hocko @ 2022-02-03 12:46 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Wed 02-02-22 15:30:35, Waiman Long wrote: [...] > +#ifdef CONFIG_MEMCG > + unsigned long memcg_data; > + struct mem_cgroup *memcg; > + bool online; > + char name[80]; > + > + rcu_read_lock(); > + memcg_data = READ_ONCE(page->memcg_data); > + if (!memcg_data) > + goto out_unlock; > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + ret += scnprintf(kbuf + ret, count - ret, > + "Slab cache page\n"); > + > + memcg = page_memcg_check(page); > + if (!memcg) > + goto out_unlock; > + > + online = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); Is there any specific reason to use another buffer allocated on the stack? Also 80B seems too short to cover NAME_MAX. Nothing else jumped at me. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YfvOp5VXrxy9IW1w-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information [not found] ` <YfvOp5VXrxy9IW1w-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-03 19:03 ` Waiman Long 2022-02-07 17:20 ` Michal Hocko 0 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-03 19:03 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/3/22 07:46, Michal Hocko wrote: > On Wed 02-02-22 15:30:35, Waiman Long wrote: > [...] >> +#ifdef CONFIG_MEMCG >> + unsigned long memcg_data; >> + struct mem_cgroup *memcg; >> + bool online; >> + char name[80]; >> + >> + rcu_read_lock(); >> + memcg_data = READ_ONCE(page->memcg_data); >> + if (!memcg_data) >> + goto out_unlock; >> + >> + if (memcg_data & MEMCG_DATA_OBJCGS) >> + ret += scnprintf(kbuf + ret, count - ret, >> + "Slab cache page\n"); >> + >> + memcg = page_memcg_check(page); >> + if (!memcg) >> + goto out_unlock; >> + >> + online = (memcg->css.flags & CSS_ONLINE); >> + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > Is there any specific reason to use another buffer allocated on the > stack? Also 80B seems too short to cover NAME_MAX. > > Nothing else jumped at me. I suppose we can print directly into kbuf with cgroup_name(), but using a separate buffer is easier to read and understand. 79 characters should be enough for most cgroup names. Some auto-generated names with some kind of embedded uuids may be longer than that, but the random sequence of hex digits that may be missing do not convey much information for identification purpose. We can always increase the buffer length later if it turns out to be an issue. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information 2022-02-03 19:03 ` Waiman Long @ 2022-02-07 17:20 ` Michal Hocko [not found] ` <YgFUxFI5bMbc42j4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Michal Hocko @ 2022-02-07 17:20 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Thu 03-02-22 14:03:58, Waiman Long wrote: > On 2/3/22 07:46, Michal Hocko wrote: > > On Wed 02-02-22 15:30:35, Waiman Long wrote: > > [...] > > > +#ifdef CONFIG_MEMCG > > > + unsigned long memcg_data; > > > + struct mem_cgroup *memcg; > > > + bool online; > > > + char name[80]; > > > + > > > + rcu_read_lock(); > > > + memcg_data = READ_ONCE(page->memcg_data); > > > + if (!memcg_data) > > > + goto out_unlock; > > > + > > > + if (memcg_data & MEMCG_DATA_OBJCGS) > > > + ret += scnprintf(kbuf + ret, count - ret, > > > + "Slab cache page\n"); > > > + > > > + memcg = page_memcg_check(page); > > > + if (!memcg) > > > + goto out_unlock; > > > + > > > + online = (memcg->css.flags & CSS_ONLINE); > > > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > > Is there any specific reason to use another buffer allocated on the > > stack? Also 80B seems too short to cover NAME_MAX. > > > > Nothing else jumped at me. > > I suppose we can print directly into kbuf with cgroup_name(), but using a > separate buffer is easier to read and understand. 79 characters should be > enough for most cgroup names. Some auto-generated names with some kind of > embedded uuids may be longer than that, but the random sequence of hex > digits that may be missing do not convey much information for identification > purpose. We can always increase the buffer length later if it turns out to > be an issue. Cutting a name short sounds like a source of confusion and there doesn't seem to be any good reason for that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YgFUxFI5bMbc42j4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information [not found] ` <YgFUxFI5bMbc42j4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-07 19:09 ` Andrew Morton [not found] ` <20220207110947.f07b58898d91c02090f9aacf-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Andrew Morton @ 2022-02-07 19:09 UTC (permalink / raw) To: Michal Hocko Cc: Waiman Long, Johannes Weiner, Vladimir Davydov, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Mon, 7 Feb 2022 18:20:04 +0100 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > On Thu 03-02-22 14:03:58, Waiman Long wrote: > > On 2/3/22 07:46, Michal Hocko wrote: > > > On Wed 02-02-22 15:30:35, Waiman Long wrote: > > > [...] > ... > > > > + online = (memcg->css.flags & CSS_ONLINE); > > > > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > > > Is there any specific reason to use another buffer allocated on the > > > stack? Also 80B seems too short to cover NAME_MAX. > > > > > > Nothing else jumped at me. > > > > I suppose we can print directly into kbuf with cgroup_name(), but using a > > separate buffer is easier to read and understand. 79 characters should be > > enough for most cgroup names. Some auto-generated names with some kind of > > embedded uuids may be longer than that, but the random sequence of hex > > digits that may be missing do not convey much information for identification > > purpose. We can always increase the buffer length later if it turns out to > > be an issue. > > Cutting a name short sounds like a source of confusion and there doesn't > seem to be any good reason for that. Yes. If we give them 79 characters, someone will go and want 94. If we can prevent this once and for ever, let's please do so. ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <20220207110947.f07b58898d91c02090f9aacf-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information [not found] ` <20220207110947.f07b58898d91c02090f9aacf-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2022-02-07 19:33 ` Waiman Long 0 siblings, 0 replies; 39+ messages in thread From: Waiman Long @ 2022-02-07 19:33 UTC (permalink / raw) To: Andrew Morton, Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/7/22 14:09, Andrew Morton wrote: > On Mon, 7 Feb 2022 18:20:04 +0100 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > >> On Thu 03-02-22 14:03:58, Waiman Long wrote: >>> On 2/3/22 07:46, Michal Hocko wrote: >>>> On Wed 02-02-22 15:30:35, Waiman Long wrote: >>>> [...] >> ... >>>>> + online = (memcg->css.flags & CSS_ONLINE); >>>>> + cgroup_name(memcg->css.cgroup, name, sizeof(name)); >>>> Is there any specific reason to use another buffer allocated on the >>>> stack? Also 80B seems too short to cover NAME_MAX. >>>> >>>> Nothing else jumped at me. >>> I suppose we can print directly into kbuf with cgroup_name(), but using a >>> separate buffer is easier to read and understand. 79 characters should be >>> enough for most cgroup names. Some auto-generated names with some kind of >>> embedded uuids may be longer than that, but the random sequence of hex >>> digits that may be missing do not convey much information for identification >>> purpose. We can always increase the buffer length later if it turns out to >>> be an issue. >> Cutting a name short sounds like a source of confusion and there doesn't >> seem to be any good reason for that. > Yes. If we give them 79 characters, someone will go and want 94. If > we can prevent this once and for ever, let's please do so. Sure. Will send a version with that change. Cheers, Longman > ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 4/4] mm/page_owner: Record task command name [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2022-02-02 20:30 ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long @ 2022-02-02 20:30 ` Waiman Long 3 siblings, 0 replies; 39+ messages in thread From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long The page_owner information currently includes the pid of the calling task. That is useful as long as the task is still running. Otherwise, the number is meaningless. To have more information about the allocating tasks that had exited by the time the page_owner information is retrieved, we need to store the command name of the task. Add a new comm field into page_owner structure to store the command name and display it when the page_owner information is retrieved. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- mm/page_owner.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index f7820357e4d4..d56afa9c792e 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -29,6 +29,7 @@ struct page_owner { depot_stack_handle_t free_handle; u64 ts_nsec; u64 free_ts_nsec; + char comm[TASK_COMM_LEN]; pid_t pid; }; @@ -165,6 +166,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext, page_owner->last_migrate_reason = -1; page_owner->pid = current->pid; page_owner->ts_nsec = local_clock(); + strlcpy(page_owner->comm, current->comm, + sizeof(page_owner->comm)); __set_bit(PAGE_EXT_OWNER, &page_ext->flags); __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); @@ -232,6 +235,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) new_page_owner->pid = old_page_owner->pid; new_page_owner->ts_nsec = old_page_owner->ts_nsec; new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; + strcpy(new_page_owner->comm, old_page_owner->comm); /* * We don't clear the bit on the old folio as it's going to be freed @@ -379,10 +383,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, return -ENOMEM; ret = scnprintf(kbuf, count, - "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", + "Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n", page_owner->order, page_owner->gfp_mask, &page_owner->gfp_mask, page_owner->pid, - page_owner->ts_nsec, page_owner->free_ts_nsec); + page_owner->comm, page_owner->ts_nsec, + page_owner->free_ts_nsec); /* Print information relevant to grouping pages by mobility */ pageblock_mt = get_pageblock_migratetype(page); @@ -449,9 +454,10 @@ void __dump_page_owner(const struct page *page) else pr_alert("page_owner tracks the page as freed\n"); - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n", + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n", page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, - page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); + page_owner->pid, page_owner->comm, page_owner->ts_nsec, + page_owner->free_ts_nsec); handle = READ_ONCE(page_owner->handle); if (!handle) -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v4 4/4] mm/page_owner: Record task command name 2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 22:03 ` Waiman Long [not found] ` <20220131220328.622162-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-02-02 20:30 ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long 2022-02-02 20:30 ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-01-31 22:03 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long The page_owner information currently includes the pid of the calling task. That is useful as long as the task is still running. Otherwise, the number is meaningless. To have more information about the allocating tasks that had exited by the time the page_owner information is retrieved, we need to store the command name of the task. Add a new comm field into page_owner structure to store the command name and display it when the page_owner information is retrieved. Signed-off-by: Waiman Long <longman@redhat.com> --- mm/page_owner.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index a471c74c7fe0..485542155483 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -29,6 +29,7 @@ struct page_owner { depot_stack_handle_t free_handle; u64 ts_nsec; u64 free_ts_nsec; + char comm[TASK_COMM_LEN]; pid_t pid; }; @@ -146,6 +147,7 @@ void __reset_page_owner(struct page *page, unsigned short order) page_owner = get_page_owner(page_ext); page_owner->free_handle = handle; page_owner->free_ts_nsec = free_ts_nsec; + page_owner->comm[0] = '\0'; page_ext = page_ext_next(page_ext); } } @@ -165,6 +167,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext, page_owner->last_migrate_reason = -1; page_owner->pid = current->pid; page_owner->ts_nsec = local_clock(); + strlcpy(page_owner->comm, current->comm, + sizeof(page_owner->comm)); __set_bit(PAGE_EXT_OWNER, &page_ext->flags); __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); @@ -232,6 +236,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) new_page_owner->pid = old_page_owner->pid; new_page_owner->ts_nsec = old_page_owner->ts_nsec; new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; + strcpy(new_page_owner->comm, old_page_owner->comm); /* * We don't clear the bit on the old folio as it's going to be freed @@ -376,10 +381,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, return -ENOMEM; ret = scnprintf(kbuf, count, - "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", + "Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n", page_owner->order, page_owner->gfp_mask, &page_owner->gfp_mask, page_owner->pid, - page_owner->ts_nsec, page_owner->free_ts_nsec); + page_owner->comm, page_owner->ts_nsec, + page_owner->free_ts_nsec); /* Print information relevant to grouping pages by mobility */ pageblock_mt = get_pageblock_migratetype(page); @@ -446,9 +452,10 @@ void __dump_page_owner(const struct page *page) else pr_alert("page_owner tracks the page as freed\n"); - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n", + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n", page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, - page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); + page_owner->pid, page_owner->comm, page_owner->ts_nsec, + page_owner->free_ts_nsec); handle = READ_ONCE(page_owner->handle); if (!handle) -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220131220328.622162-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 4/4] mm/page_owner: Record task command name [not found] ` <20220131220328.622162-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-01 15:28 ` Michal Hocko [not found] ` <YflRjeoC0jbzArDG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Michal Hocko @ 2022-02-01 15:28 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Vlastimil Babka Cc Vlastimil On Mon 31-01-22 17:03:28, Waiman Long wrote: > The page_owner information currently includes the pid of the calling > task. That is useful as long as the task is still running. Otherwise, > the number is meaningless. To have more information about the allocating > tasks that had exited by the time the page_owner information is > retrieved, we need to store the command name of the task. > > Add a new comm field into page_owner structure to store the command name > and display it when the page_owner information is retrieved. I completely agree that pid is effectivelly useless (if not misleading) but is comm really telling all that much to compensate for the additional storage required for _each_ page in the system? > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > mm/page_owner.c | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index a471c74c7fe0..485542155483 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -29,6 +29,7 @@ struct page_owner { > depot_stack_handle_t free_handle; > u64 ts_nsec; > u64 free_ts_nsec; > + char comm[TASK_COMM_LEN]; > pid_t pid; > }; > > @@ -146,6 +147,7 @@ void __reset_page_owner(struct page *page, unsigned short order) > page_owner = get_page_owner(page_ext); > page_owner->free_handle = handle; > page_owner->free_ts_nsec = free_ts_nsec; > + page_owner->comm[0] = '\0'; > page_ext = page_ext_next(page_ext); > } > } > @@ -165,6 +167,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext, > page_owner->last_migrate_reason = -1; > page_owner->pid = current->pid; > page_owner->ts_nsec = local_clock(); > + strlcpy(page_owner->comm, current->comm, > + sizeof(page_owner->comm)); > __set_bit(PAGE_EXT_OWNER, &page_ext->flags); > __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); > > @@ -232,6 +236,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) > new_page_owner->pid = old_page_owner->pid; > new_page_owner->ts_nsec = old_page_owner->ts_nsec; > new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; > + strcpy(new_page_owner->comm, old_page_owner->comm); > > /* > * We don't clear the bit on the old folio as it's going to be freed > @@ -376,10 +381,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > return -ENOMEM; > > ret = scnprintf(kbuf, count, > - "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", > + "Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n", > page_owner->order, page_owner->gfp_mask, > &page_owner->gfp_mask, page_owner->pid, > - page_owner->ts_nsec, page_owner->free_ts_nsec); > + page_owner->comm, page_owner->ts_nsec, > + page_owner->free_ts_nsec); > > /* Print information relevant to grouping pages by mobility */ > pageblock_mt = get_pageblock_migratetype(page); > @@ -446,9 +452,10 @@ void __dump_page_owner(const struct page *page) > else > pr_alert("page_owner tracks the page as freed\n"); > > - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n", > + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n", > page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, > - page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); > + page_owner->pid, page_owner->comm, page_owner->ts_nsec, > + page_owner->free_ts_nsec); > > handle = READ_ONCE(page_owner->handle); > if (!handle) > -- > 2.27.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <YflRjeoC0jbzArDG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 4/4] mm/page_owner: Record task command name [not found] ` <YflRjeoC0jbzArDG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-02 16:53 ` Waiman Long [not found] ` <4ba66abe-5c6d-26a7-f11c-c3b8514bfb34-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-02 16:53 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Vlastimil Babka On 2/1/22 10:28, Michal Hocko wrote: > Cc Vlastimil > > On Mon 31-01-22 17:03:28, Waiman Long wrote: >> The page_owner information currently includes the pid of the calling >> task. That is useful as long as the task is still running. Otherwise, >> the number is meaningless. To have more information about the allocating >> tasks that had exited by the time the page_owner information is >> retrieved, we need to store the command name of the task. >> >> Add a new comm field into page_owner structure to store the command name >> and display it when the page_owner information is retrieved. > I completely agree that pid is effectivelly useless (if not misleading) > but is comm really telling all that much to compensate for the > additional storage required for _each_ page in the system? Yes, it does add an extra 16 bytes per page overhead. The command name can be useful if one want to find out which userspace command is responsible for a problematic page allocation. Maybe we can remove pid from page_owner to save 8 bytes as you also agree that this number is not that useful. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <4ba66abe-5c6d-26a7-f11c-c3b8514bfb34-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 4/4] mm/page_owner: Record task command name [not found] ` <4ba66abe-5c6d-26a7-f11c-c3b8514bfb34-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-03 12:10 ` Vlastimil Babka 2022-02-03 18:53 ` Waiman Long 0 siblings, 1 reply; 39+ messages in thread From: Vlastimil Babka @ 2022-02-03 12:10 UTC (permalink / raw) To: Waiman Long, Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/2/22 17:53, Waiman Long wrote: > > On 2/1/22 10:28, Michal Hocko wrote: >> Cc Vlastimil >> >> On Mon 31-01-22 17:03:28, Waiman Long wrote: >>> The page_owner information currently includes the pid of the calling >>> task. That is useful as long as the task is still running. Otherwise, >>> the number is meaningless. To have more information about the allocating >>> tasks that had exited by the time the page_owner information is >>> retrieved, we need to store the command name of the task. >>> >>> Add a new comm field into page_owner structure to store the command name >>> and display it when the page_owner information is retrieved. >> I completely agree that pid is effectivelly useless (if not misleading) >> but is comm really telling all that much to compensate for the >> additional storage required for _each_ page in the system? > > Yes, it does add an extra 16 bytes per page overhead. The command name can > be useful if one want to find out which userspace command is responsible for > a problematic page allocation. Maybe we can remove pid from page_owner to > save 8 bytes as you also agree that this number is not that useful. Pid could be used to correlate command instances (not perfectly if reuse happens), but command name could have a higher chance to be useful. In my experience the most useful were the stacktraces and gfp/order etc. anyway. So I wouldn't be opposed replacing pid with comm. The mild size increase should be acceptable, this is an opt-in feature for debugging sessions with known tradeoff for memory and cpu overhead for the extra info. > Cheers, > Longman > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 4/4] mm/page_owner: Record task command name 2022-02-03 12:10 ` Vlastimil Babka @ 2022-02-03 18:53 ` Waiman Long 0 siblings, 0 replies; 39+ messages in thread From: Waiman Long @ 2022-02-03 18:53 UTC (permalink / raw) To: Vlastimil Babka, Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/3/22 07:10, Vlastimil Babka wrote: > On 2/2/22 17:53, Waiman Long wrote: >> On 2/1/22 10:28, Michal Hocko wrote: >>> Cc Vlastimil >>> >>> On Mon 31-01-22 17:03:28, Waiman Long wrote: >>>> The page_owner information currently includes the pid of the calling >>>> task. That is useful as long as the task is still running. Otherwise, >>>> the number is meaningless. To have more information about the allocating >>>> tasks that had exited by the time the page_owner information is >>>> retrieved, we need to store the command name of the task. >>>> >>>> Add a new comm field into page_owner structure to store the command name >>>> and display it when the page_owner information is retrieved. >>> I completely agree that pid is effectivelly useless (if not misleading) >>> but is comm really telling all that much to compensate for the >>> additional storage required for _each_ page in the system? >> Yes, it does add an extra 16 bytes per page overhead. The command name can >> be useful if one want to find out which userspace command is responsible for >> a problematic page allocation. Maybe we can remove pid from page_owner to >> save 8 bytes as you also agree that this number is not that useful. > Pid could be used to correlate command instances (not perfectly if reuse > happens), but command name could have a higher chance to be useful. In my > experience the most useful were the stacktraces and gfp/order etc. anyway. > So I wouldn't be opposed replacing pid with comm. The mild size increase > should be acceptable, this is an opt-in feature for debugging sessions with > known tradeoff for memory and cpu overhead for the extra info. Thanks for the information. I floated around dropping pid just as a possible way to reduce overall memory overhead. I did not do that in my patch and I am not planning to post any patch unless everybody agree. Cheer, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information 2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long [not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-31 22:03 ` Waiman Long @ 2022-02-02 20:30 ` Waiman Long 2022-02-02 23:06 ` Rafael Aquini 2022-02-02 20:30 ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long v4: - Take rcu_read_lock() when memcg is being accessed as suggested by Michal. - Make print_page_owner_memcg() return the new offset into the buffer and put CONFIG_MEMCG block inside as suggested by Mike. - Directly use TASK_COMM_LEN as length of name buffer as suggested by Roman. v3: - Add unlikely() to patch 1 and clarify that -1 will not be returned. - Use a helper function to print out memcg information in patch 3. - Add a new patch 4 to store task command name in page_owner structure. v2: - Remove the SNPRINTF() macro as suggested by Ira and use scnprintf() instead to remove some buffer overrun checks. - Add a patch to optimize vscnprintf with a size parameter of 0. While debugging the constant increase in percpu memory consumption on a system that spawned large number of containers, it was found that a lot of offline mem_cgroup structures remained in place without being freed. Further investigation indicated that those mem_cgroup structures were pinned by some pages. In order to find out what those pages are, the existing page_owner debugging tool is extended to show memory cgroup information and whether those memcgs are offline or not. With the enhanced page_owner tool, the following is a typical page that pinned the mem_cgroup structure in my test case: Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) prep_new_page+0xac/0xe0 get_page_from_freelist+0x1327/0x14d0 __alloc_pages+0x191/0x340 alloc_pages_vma+0x84/0x250 shmem_alloc_page+0x3f/0x90 shmem_alloc_and_acct_page+0x76/0x1c0 shmem_getpage_gfp+0x281/0x940 shmem_write_begin+0x36/0xe0 generic_perform_write+0xed/0x1d0 __generic_file_write_iter+0xdc/0x1b0 generic_file_write_iter+0x5d/0xb0 new_sync_write+0x11f/0x1b0 vfs_write+0x1ba/0x2a0 ksys_write+0x59/0xd0 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8. So the page was not freed because it was part of a shmem segment. That is useful information that can help users to diagnose similar problems. With cgroup v1, /proc/cgroups can be read to find out the total number of memory cgroups (online + offline). With cgroup v2, the cgroup.stat of the root cgroup can be read to find the number of dying cgroups (most likely pinned by dying memcgs). The page_owner feature is not supposed to be enabled for production system due to its memory overhead. However, if it is suspected that dying memcgs are increasing over time, a test environment with page_owner enabled can then be set up with appropriate workload for further analysis on what may be causing the increasing number of dying memcgs. Waiman Long (4): lib/vsprintf: Avoid redundant work with 0 size mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check mm/page_owner: Print memcg information mm/page_owner: Record task command name lib/vsprintf.c | 8 +++--- mm/page_owner.c | 70 ++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 60 insertions(+), 18 deletions(-) -- 2.27.0 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information 2022-02-02 20:30 ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long @ 2022-02-02 23:06 ` Rafael Aquini 0 siblings, 0 replies; 39+ messages in thread From: Rafael Aquini @ 2022-02-02 23:06 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin On Wed, Feb 02, 2022 at 03:30:32PM -0500, Waiman Long wrote: > v4: > - Take rcu_read_lock() when memcg is being accessed as suggested by > Michal. > - Make print_page_owner_memcg() return the new offset into the buffer > and put CONFIG_MEMCG block inside as suggested by Mike. > - Directly use TASK_COMM_LEN as length of name buffer as suggested by > Roman. > > v3: > - Add unlikely() to patch 1 and clarify that -1 will not be returned. > - Use a helper function to print out memcg information in patch 3. > - Add a new patch 4 to store task command name in page_owner > structure. > > v2: > - Remove the SNPRINTF() macro as suggested by Ira and use scnprintf() > instead to remove some buffer overrun checks. > - Add a patch to optimize vscnprintf with a size parameter of 0. > > While debugging the constant increase in percpu memory consumption on > a system that spawned large number of containers, it was found that a > lot of offline mem_cgroup structures remained in place without being > freed. Further investigation indicated that those mem_cgroup structures > were pinned by some pages. > > In order to find out what those pages are, the existing page_owner > debugging tool is extended to show memory cgroup information and whether > those memcgs are offline or not. With the enhanced page_owner tool, > the following is a typical page that pinned the mem_cgroup structure > in my test case: > > Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns > PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) > prep_new_page+0xac/0xe0 > get_page_from_freelist+0x1327/0x14d0 > __alloc_pages+0x191/0x340 > alloc_pages_vma+0x84/0x250 > shmem_alloc_page+0x3f/0x90 > shmem_alloc_and_acct_page+0x76/0x1c0 > shmem_getpage_gfp+0x281/0x940 > shmem_write_begin+0x36/0xe0 > generic_perform_write+0xed/0x1d0 > __generic_file_write_iter+0xdc/0x1b0 > generic_file_write_iter+0x5d/0xb0 > new_sync_write+0x11f/0x1b0 > vfs_write+0x1ba/0x2a0 > ksys_write+0x59/0xd0 > do_syscall_64+0x37/0x80 > entry_SYSCALL_64_after_hwframe+0x44/0xae > Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8. > > So the page was not freed because it was part of a shmem segment. That > is useful information that can help users to diagnose similar problems. > > With cgroup v1, /proc/cgroups can be read to find out the total number > of memory cgroups (online + offline). With cgroup v2, the cgroup.stat of > the root cgroup can be read to find the number of dying cgroups (most > likely pinned by dying memcgs). > > The page_owner feature is not supposed to be enabled for production > system due to its memory overhead. However, if it is suspected that > dying memcgs are increasing over time, a test environment with page_owner > enabled can then be set up with appropriate workload for further analysis > on what may be causing the increasing number of dying memcgs. > > Waiman Long (4): > lib/vsprintf: Avoid redundant work with 0 size > mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check > mm/page_owner: Print memcg information > mm/page_owner: Record task command name > > lib/vsprintf.c | 8 +++--- > mm/page_owner.c | 70 ++++++++++++++++++++++++++++++++++++++----------- > 2 files changed, 60 insertions(+), 18 deletions(-) > > -- > 2.27.0 > Thank you, Waiman. Acked-by: Rafael Aquini <aquini@redhat.com> ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check 2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long ` (2 preceding siblings ...) 2022-02-02 20:30 ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long @ 2022-02-02 20:30 ` Waiman Long [not found] ` <20220202203036.744010-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 3 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long The snprintf() function can return a length greater than the given input size. That will require a check for buffer overrun after each invocation of snprintf(). scnprintf(), on the other hand, will never return a greater length. By using scnprintf() in selected places, we can avoid some buffer overrun checks except after stack_depot_snprint() and after the last snprintf(). Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> --- mm/page_owner.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index 99e360df9465..28dac73e0542 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, if (!kbuf) return -ENOMEM; - ret = snprintf(kbuf, count, + ret = scnprintf(kbuf, count, "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", page_owner->order, page_owner->gfp_mask, &page_owner->gfp_mask, page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); - if (ret >= count) - goto err; - /* Print information relevant to grouping pages by mobility */ pageblock_mt = get_pageblock_migratetype(page); page_mt = gfp_migratetype(page_owner->gfp_mask); - ret += snprintf(kbuf + ret, count - ret, + ret += scnprintf(kbuf + ret, count - ret, "PFN %lu type %s Block %lu type %s Flags %pGp\n", pfn, migratetype_names[page_mt], @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migratetype_names[pageblock_mt], &page->flags); - if (ret >= count) - goto err; - ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0); if (ret >= count) goto err; if (page_owner->last_migrate_reason != -1) { - ret += snprintf(kbuf + ret, count - ret, + ret += scnprintf(kbuf + ret, count - ret, "Page has been migrated, last migrate reason: %s\n", migrate_reason_names[page_owner->last_migrate_reason]); - if (ret >= count) - goto err; } ret += snprintf(kbuf + ret, count - ret, "\n"); -- 2.27.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <20220202203036.744010-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <20220202203036.744010-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-03 15:46 ` Vlastimil Babka [not found] ` <5c03fa31-35a5-4cbc-6b0e-872d5db82a41-AlSwsSmVLrQ@public.gmane.org> 0 siblings, 1 reply; 39+ messages in thread From: Vlastimil Babka @ 2022-02-03 15:46 UTC (permalink / raw) To: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/2/22 21:30, Waiman Long wrote: > The snprintf() function can return a length greater than the given > input size. That will require a check for buffer overrun after each > invocation of snprintf(). scnprintf(), on the other hand, will never > return a greater length. By using scnprintf() in selected places, we > can avoid some buffer overrun checks except after stack_depot_snprint() > and after the last snprintf(). > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> Looks like this will work, but note that if the purpose of patch 1/4 was that after the first scnprintf() that overflows the following calls will be short-cut thanks to passing the size as 0, AFAICS that won't work. Because scnprintf() returns the number without trailing zero, 'ret' will be 'count - 1' after the overflow, so 'count - ret' will be 1, never 0. > --- > mm/page_owner.c | 14 +++----------- > 1 file changed, 3 insertions(+), 11 deletions(-) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 99e360df9465..28dac73e0542 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > if (!kbuf) > return -ENOMEM; > > - ret = snprintf(kbuf, count, > + ret = scnprintf(kbuf, count, > "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", > page_owner->order, page_owner->gfp_mask, > &page_owner->gfp_mask, page_owner->pid, > page_owner->ts_nsec, page_owner->free_ts_nsec); > > - if (ret >= count) > - goto err; > - > /* Print information relevant to grouping pages by mobility */ > pageblock_mt = get_pageblock_migratetype(page); > page_mt = gfp_migratetype(page_owner->gfp_mask); > - ret += snprintf(kbuf + ret, count - ret, > + ret += scnprintf(kbuf + ret, count - ret, > "PFN %lu type %s Block %lu type %s Flags %pGp\n", > pfn, > migratetype_names[page_mt], > @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migratetype_names[pageblock_mt], > &page->flags); > > - if (ret >= count) > - goto err; > - > ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0); > if (ret >= count) > goto err; > > if (page_owner->last_migrate_reason != -1) { > - ret += snprintf(kbuf + ret, count - ret, > + ret += scnprintf(kbuf + ret, count - ret, > "Page has been migrated, last migrate reason: %s\n", > migrate_reason_names[page_owner->last_migrate_reason]); > - if (ret >= count) > - goto err; > } > > ret += snprintf(kbuf + ret, count - ret, "\n"); ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <5c03fa31-35a5-4cbc-6b0e-872d5db82a41-AlSwsSmVLrQ@public.gmane.org>]
* Re: [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <5c03fa31-35a5-4cbc-6b0e-872d5db82a41-AlSwsSmVLrQ@public.gmane.org> @ 2022-02-03 18:49 ` Waiman Long 2022-02-08 10:51 ` Petr Mladek 0 siblings, 1 reply; 39+ messages in thread From: Waiman Long @ 2022-02-03 18:49 UTC (permalink / raw) To: Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On 2/3/22 10:46, Vlastimil Babka wrote: > On 2/2/22 21:30, Waiman Long wrote: >> The snprintf() function can return a length greater than the given >> input size. That will require a check for buffer overrun after each >> invocation of snprintf(). scnprintf(), on the other hand, will never >> return a greater length. By using scnprintf() in selected places, we >> can avoid some buffer overrun checks except after stack_depot_snprint() >> and after the last snprintf(). >> >> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> >> Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> > Looks like this will work, but note that if the purpose of patch 1/4 was > that after the first scnprintf() that overflows the following calls will be > short-cut thanks to passing the size as 0, AFAICS that won't work. Because > scnprintf() returns the number without trailing zero, 'ret' will be 'count - > 1' after the overflow, so 'count - ret' will be 1, never 0. Yes, I am aware of that. Patch 1 is just a micro-optimization for the very rare case. Cheers, Longman ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check 2022-02-03 18:49 ` Waiman Long @ 2022-02-08 10:51 ` Petr Mladek 0 siblings, 0 replies; 39+ messages in thread From: Petr Mladek @ 2022-02-08 10:51 UTC (permalink / raw) To: Waiman Long Cc: Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini On Thu 2022-02-03 13:49:02, Waiman Long wrote: > On 2/3/22 10:46, Vlastimil Babka wrote: > > On 2/2/22 21:30, Waiman Long wrote: > > > The snprintf() function can return a length greater than the given > > > input size. That will require a check for buffer overrun after each > > > invocation of snprintf(). scnprintf(), on the other hand, will never > > > return a greater length. By using scnprintf() in selected places, we > > > can avoid some buffer overrun checks except after stack_depot_snprint() > > > and after the last snprintf(). > > > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > > Acked-by: David Rientjes <rientjes@google.com> > > > Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> > > Looks like this will work, but note that if the purpose of patch 1/4 was > > that after the first scnprintf() that overflows the following calls will be > > short-cut thanks to passing the size as 0, AFAICS that won't work. Because > > scnprintf() returns the number without trailing zero, 'ret' will be 'count - > > 1' after the overflow, so 'count - ret' will be 1, never 0. > > Yes, I am aware of that. Patch 1 is just a micro-optimization for the very > rare case. In theory, we might micro-optimize also the case when "size == 1". Well, I am not sure if it is worth it. After all, the primary use-case is to print the message into a big-enough buffer. Lost information is a bigger problem than the speed ;-) Best Regards, Petr ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2022-02-08 10:51 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-31 19:23 [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
[not found] ` <20220131192308.608837-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
[not found] ` <20220131192308.608837-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-31 20:42 ` Mike Rapoport
2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
[not found] ` <20220131192308.608837-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-31 20:38 ` Roman Gushchin
2022-01-31 20:43 ` Mike Rapoport
2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
[not found] ` <20220131192308.608837-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-31 20:51 ` Mike Rapoport
[not found] ` <YfhLzI+RLRGgexmr-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2022-01-31 21:43 ` Waiman Long
[not found] ` <4234fc60-5d65-1089-555a-734218aa6f9c-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-01 6:23 ` Mike Rapoport
2022-01-31 20:51 ` Roman Gushchin
2022-02-01 10:54 ` Michal Hocko
[not found] ` <YfkRS75D3xcqLT85-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-01 17:04 ` Waiman Long
[not found] ` <33be132c-874d-1061-9003-50942275b221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-02 8:49 ` Michal Hocko
[not found] ` <YfpFkVLBb0GsDFsi-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-02 16:12 ` Waiman Long
2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
[not found] ` <20220131192308.608837-5-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-31 20:54 ` Roman Gushchin
[not found] ` <YfhMd5LTzTHu9zMD-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2022-01-31 21:46 ` Waiman Long
2022-02-02 20:30 ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
2022-02-08 10:08 ` Petr Mladek
2022-02-02 20:30 ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long
[not found] ` <20220202203036.744010-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-03 6:53 ` Mike Rapoport
2022-02-03 12:46 ` Michal Hocko
[not found] ` <YfvOp5VXrxy9IW1w-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-03 19:03 ` Waiman Long
2022-02-07 17:20 ` Michal Hocko
[not found] ` <YgFUxFI5bMbc42j4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-07 19:09 ` Andrew Morton
[not found] ` <20220207110947.f07b58898d91c02090f9aacf-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-02-07 19:33 ` Waiman Long
2022-02-02 20:30 ` [PATCH v4 4/4] mm/page_owner: Record task command name Waiman Long
2022-01-31 22:03 ` Waiman Long
[not found] ` <20220131220328.622162-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-01 15:28 ` Michal Hocko
[not found] ` <YflRjeoC0jbzArDG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-02 16:53 ` Waiman Long
[not found] ` <4ba66abe-5c6d-26a7-f11c-c3b8514bfb34-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-03 12:10 ` Vlastimil Babka
2022-02-03 18:53 ` Waiman Long
2022-02-02 20:30 ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
2022-02-02 23:06 ` Rafael Aquini
2022-02-02 20:30 ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
[not found] ` <20220202203036.744010-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-03 15:46 ` Vlastimil Babka
[not found] ` <5c03fa31-35a5-4cbc-6b0e-872d5db82a41-AlSwsSmVLrQ@public.gmane.org>
2022-02-03 18:49 ` Waiman Long
2022-02-08 10:51 ` Petr Mladek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).