* [PATCH v2 0/3] mm/page_owner: Extend page_owner to show memcg information
@ 2022-01-29 20:53 Waiman Long
[not found] ` <20220129205315.478628-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Waiman Long @ 2022-01-29 20:53 UTC (permalink / raw)
To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
Rasmus Villemoes
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
Ira Weiny, Rafael Aquini, Waiman Long
v2:
- Remove the SNPRINTF() macro as suggested by Ira and use scnprintf()
instead to remove some buffer overrun checks.
- Add a patch to optimize vscnprintf with a size parameter of 0.
While debugging the constant increase in percpu memory consumption on
a system that spawned large number of containers, it was found that a
lot of offlined mem_cgroup structures remained in place without being
freed. Further investigation indicated that those mem_cgroup structures
were pinned by some pages.
In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are offlined or not. With the enhanced page_owner tool,
the following is a typical page that pinned the mem_cgroup structure
in my test case:
Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 62760, ts 119274296592 ns, free_ts 118989764823 ns
PFN 1273412 type Movable Block 2487 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
prep_new_page+0x8e/0xb0
get_page_from_freelist+0xc4d/0xe50
__alloc_pages+0x172/0x320
alloc_pages_vma+0x84/0x230
shmem_alloc_page+0x3f/0x90
shmem_alloc_and_acct_page+0x76/0x1c0
shmem_getpage_gfp+0x48d/0x890
shmem_write_begin+0x36/0xc0
generic_perform_write+0xed/0x1d0
__generic_file_write_iter+0xdc/0x1b0
generic_file_write_iter+0x5d/0xb0
new_sync_write+0x11f/0x1b0
vfs_write+0x1ba/0x2a0
ksys_write+0x59/0xd0
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to offlined memcg libpod-conmon-e59cc83faf807bacc61223fec6a80c1540ebe8f83c802870c6af4708d58f77ea
So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.
Waiman Long (3):
lib/vsprintf: Avoid redundant work with 0 size
mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
mm/page_owner: Dump memcg information
lib/vsprintf.c | 8 +++++---
mm/page_owner.c | 45 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 39 insertions(+), 14 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 34+ messages in thread[parent not found: <20220129205315.478628-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220129205315.478628-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-29 20:53 ` Waiman Long [not found] ` <20220129205315.478628-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-29 20:53 ` [PATCH v2 2/3] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long 2022-01-29 20:53 ` [PATCH v2 3/3] mm/page_owner: Dump memcg information Waiman Long 2 siblings, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-01-29 20:53 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini, Waiman Long For *scnprintf(), vsnprintf() is always called even if the input size is 0. That is a waste of time, so just return 0 in this case. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- lib/vsprintf.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 3b8129dd374c..a65df546fb06 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) { int i; + if (!size) + return 0; + i = vsnprintf(buf, size, fmt, args); if (likely(i < size)) return i; - if (size != 0) - return size - 1; - return 0; + + return size - 1; } EXPORT_SYMBOL(vscnprintf); -- 2.27.0 ^ permalink raw reply related [flat|nested] 34+ messages in thread
[parent not found: <20220129205315.478628-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220129205315.478628-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-30 20:49 ` David Rientjes [not found] ` <d99b3c4b-7b6e-529-6e4b-b91b65c92d81-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2022-01-31 10:25 ` Andy Shevchenko 2022-01-31 2:53 ` Sergey Senozhatsky 2022-01-31 18:17 ` Roman Gushchin 2 siblings, 2 replies; 34+ messages in thread From: David Rientjes @ 2022-01-30 20:49 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Sat, 29 Jan 2022, Waiman Long wrote: > For *scnprintf(), vsnprintf() is always called even if the input size is > 0. That is a waste of time, so just return 0 in this case. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > lib/vsprintf.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/lib/vsprintf.c b/lib/vsprintf.c > index 3b8129dd374c..a65df546fb06 100644 > --- a/lib/vsprintf.c > +++ b/lib/vsprintf.c > @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) > { > int i; > > + if (!size) > + return 0; Nit: any reason this shouldn't be unlikely()? If the conditional for i < size is likely(), this seems assumed already? > + > i = vsnprintf(buf, size, fmt, args); > > if (likely(i < size)) > return i; > - if (size != 0) > - return size - 1; > - return 0; > + > + return size - 1; > } > EXPORT_SYMBOL(vscnprintf); > > -- > 2.27.0 > > > ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <d99b3c4b-7b6e-529-6e4b-b91b65c92d81-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <d99b3c4b-7b6e-529-6e4b-b91b65c92d81-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2022-01-30 20:57 ` Waiman Long 0 siblings, 0 replies; 34+ messages in thread From: Waiman Long @ 2022-01-30 20:57 UTC (permalink / raw) To: David Rientjes Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 1/30/22 15:49, David Rientjes wrote: > On Sat, 29 Jan 2022, Waiman Long wrote: > >> For *scnprintf(), vsnprintf() is always called even if the input size is >> 0. That is a waste of time, so just return 0 in this case. >> >> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> --- >> lib/vsprintf.c | 8 +++++--- >> 1 file changed, 5 insertions(+), 3 deletions(-) >> >> diff --git a/lib/vsprintf.c b/lib/vsprintf.c >> index 3b8129dd374c..a65df546fb06 100644 >> --- a/lib/vsprintf.c >> +++ b/lib/vsprintf.c >> @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) >> { >> int i; >> >> + if (!size) >> + return 0; > Nit: any reason this shouldn't be unlikely()? If the conditional for > i < size is likely(), this seems assumed already? Good suggestion. Will make the change in the next version. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size 2022-01-30 20:49 ` David Rientjes [not found] ` <d99b3c4b-7b6e-529-6e4b-b91b65c92d81-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 10:25 ` Andy Shevchenko 2022-01-31 10:30 ` Andy Shevchenko 1 sibling, 1 reply; 34+ messages in thread From: Andy Shevchenko @ 2022-01-31 10:25 UTC (permalink / raw) To: David Rientjes Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Rafael Aquini On Sun, Jan 30, 2022 at 12:49:37PM -0800, David Rientjes wrote: > On Sat, 29 Jan 2022, Waiman Long wrote: > > > For *scnprintf(), vsnprintf() is always called even if the input size is > > 0. That is a waste of time, so just return 0 in this case. Why do you think it's not legit? -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size 2022-01-31 10:25 ` Andy Shevchenko @ 2022-01-31 10:30 ` Andy Shevchenko [not found] ` <Yfe6SfG4CqzWSaMM-XvqNBM/wLWRrdx17CPfAsdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Andy Shevchenko @ 2022-01-31 10:30 UTC (permalink / raw) To: David Rientjes Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Rafael Aquini On Mon, Jan 31, 2022 at 12:25:09PM +0200, Andy Shevchenko wrote: > On Sun, Jan 30, 2022 at 12:49:37PM -0800, David Rientjes wrote: > > On Sat, 29 Jan 2022, Waiman Long wrote: > > > > > For *scnprintf(), vsnprintf() is always called even if the input size is > > > 0. That is a waste of time, so just return 0 in this case. > > Why do you think it's not legit? I have to elaborate. For *nprintf() the size=0 is quite useful to have. For *cnprintf() the size=0 makes less sense, but, if we read `man snprintf()`: The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte ('\0')). If the output was truncated due to this limit, then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available. Thus, a return value of size or more means that the output was truncated. (See also below under NOTES.) If an output error is encountered, a negative value is returned. Note the last sentence there. You need to answer to it in the commit message why your change is okay and it will show that you thought through all possible scenarios. -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <Yfe6SfG4CqzWSaMM-XvqNBM/wLWRrdx17CPfAsdBPR1lH4CV8@public.gmane.org>]
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <Yfe6SfG4CqzWSaMM-XvqNBM/wLWRrdx17CPfAsdBPR1lH4CV8@public.gmane.org> @ 2022-01-31 10:34 ` Andy Shevchenko 2022-01-31 11:02 ` Rasmus Villemoes 2022-01-31 18:48 ` Waiman Long 0 siblings, 2 replies; 34+ messages in thread From: Andy Shevchenko @ 2022-01-31 10:34 UTC (permalink / raw) To: David Rientjes Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Mon, Jan 31, 2022 at 12:30:33PM +0200, Andy Shevchenko wrote: > On Mon, Jan 31, 2022 at 12:25:09PM +0200, Andy Shevchenko wrote: > > On Sun, Jan 30, 2022 at 12:49:37PM -0800, David Rientjes wrote: > > > On Sat, 29 Jan 2022, Waiman Long wrote: > > > > > > > For *scnprintf(), vsnprintf() is always called even if the input size is > > > > 0. That is a waste of time, so just return 0 in this case. > > > > Why do you think it's not legit? > > I have to elaborate. > > For *nprintf() the size=0 is quite useful to have. > For *cnprintf() the size=0 makes less sense, but, if we read `man snprintf()`: > > The functions snprintf() and vsnprintf() do not write more than size bytes > (including the terminating null byte ('\0')). If the output was truncated due > to this limit, then the return value is the number of characters (excluding > the terminating null byte) which would have been written to the final string > if enough space had been available. Thus, a return value of size or more > means that the output was truncated. (See also below under NOTES.) > > If an output error is encountered, a negative value is returned. > > Note the last sentence there. You need to answer to it in the commit message > why your change is okay and it will show that you thought through all possible > scenarios. Also it seems currently the kernel documentation is not aligned with the code "If @size is == 0 the function returns 0." It should mention the (theoretical?) possibility of getting negative value, if vsnprintf() returns negative value. -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size 2022-01-31 10:34 ` Andy Shevchenko @ 2022-01-31 11:02 ` Rasmus Villemoes [not found] ` <d44824d4-2dd1-a8ab-d3ee-ac67b749ca6f-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org> 2022-01-31 18:48 ` Waiman Long 1 sibling, 1 reply; 34+ messages in thread From: Rasmus Villemoes @ 2022-01-31 11:02 UTC (permalink / raw) To: Andy Shevchenko, David Rientjes Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 31/01/2022 11.34, Andy Shevchenko wrote: > On Mon, Jan 31, 2022 at 12:30:33PM +0200, Andy Shevchenko wrote: >> On Mon, Jan 31, 2022 at 12:25:09PM +0200, Andy Shevchenko wrote: >>> On Sun, Jan 30, 2022 at 12:49:37PM -0800, David Rientjes wrote: >>>> On Sat, 29 Jan 2022, Waiman Long wrote: >>>> >>>>> For *scnprintf(), vsnprintf() is always called even if the input size is >>>>> 0. That is a waste of time, so just return 0 in this case. >>> >>> Why do you think it's not legit? >> >> I have to elaborate. >> >> For *nprintf() the size=0 is quite useful to have. >> For *cnprintf() the size=0 makes less sense, but, if we read `man snprintf()`: >> >> The functions snprintf() and vsnprintf() do not write more than size bytes >> (including the terminating null byte ('\0')). If the output was truncated due >> to this limit, then the return value is the number of characters (excluding >> the terminating null byte) which would have been written to the final string >> if enough space had been available. Thus, a return value of size or more >> means that the output was truncated. (See also below under NOTES.) >> >> If an output error is encountered, a negative value is returned. >> >> Note the last sentence there. You need to answer to it in the commit message >> why your change is okay and it will show that you thought through all possible >> scenarios. > > Also it seems currently the kernel documentation is not aligned with the code > > "If @size is == 0 the function returns 0." > > It should mention the (theoretical?) possibility of getting negative value, > if vsnprintf() returns negative value. > The kernel's vsnprintf _will never_ return a negative value. There is way too much code which relies on that. It also has to work from any context, so we'll never do any memory allocation or anything else that could possibly force us to error out, and even if we encounter some impossible situation, we do not return a negative value, but just stop the output where we are. So yes, micro-optimizing [v]scnprintf() is completely valid, but I've never bothered to send the patch because the use case for scnprintf() is primarily the ret += scnprintf(buf + ret, size - ret, ...); pattern, with ret starting out at 0 and size being some non-zero number. When given a non-zero size, scnprintf() is guaranteed to return something _strictly less_ than that value; that invariant guarantees that the size-ret expression never becomes 0. So if scnprintf() is properly used, I can't think of any situation where size will be 0, hence I see that patch as correct-but-mostly-pointless. Rasmus ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <d44824d4-2dd1-a8ab-d3ee-ac67b749ca6f-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>]
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <d44824d4-2dd1-a8ab-d3ee-ac67b749ca6f-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org> @ 2022-01-31 11:22 ` Andy Shevchenko 0 siblings, 0 replies; 34+ messages in thread From: Andy Shevchenko @ 2022-01-31 11:22 UTC (permalink / raw) To: Rasmus Villemoes Cc: David Rientjes, Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Mon, Jan 31, 2022 at 12:02:29PM +0100, Rasmus Villemoes wrote: > On 31/01/2022 11.34, Andy Shevchenko wrote: > > On Mon, Jan 31, 2022 at 12:30:33PM +0200, Andy Shevchenko wrote: > >> On Mon, Jan 31, 2022 at 12:25:09PM +0200, Andy Shevchenko wrote: > >>> On Sun, Jan 30, 2022 at 12:49:37PM -0800, David Rientjes wrote: > >>>> On Sat, 29 Jan 2022, Waiman Long wrote: > >>>> > >>>>> For *scnprintf(), vsnprintf() is always called even if the input size is > >>>>> 0. That is a waste of time, so just return 0 in this case. > >>> > >>> Why do you think it's not legit? > >> > >> I have to elaborate. > >> > >> For *nprintf() the size=0 is quite useful to have. > >> For *cnprintf() the size=0 makes less sense, but, if we read `man snprintf()`: > >> > >> The functions snprintf() and vsnprintf() do not write more than size bytes > >> (including the terminating null byte ('\0')). If the output was truncated due > >> to this limit, then the return value is the number of characters (excluding > >> the terminating null byte) which would have been written to the final string > >> if enough space had been available. Thus, a return value of size or more > >> means that the output was truncated. (See also below under NOTES.) > >> > >> If an output error is encountered, a negative value is returned. > >> > >> Note the last sentence there. You need to answer to it in the commit message > >> why your change is okay and it will show that you thought through all possible > >> scenarios. > > > > Also it seems currently the kernel documentation is not aligned with the code > > > > "If @size is == 0 the function returns 0." > > > > It should mention the (theoretical?) possibility of getting negative value, > > if vsnprintf() returns negative value. > > > > The kernel's vsnprintf _will never_ return a negative value. There is > way too much code which relies on that. It also has to work from any > context, so we'll never do any memory allocation or anything else that > could possibly force us to error out, and even if we encounter some > impossible situation, we do not return a negative value, but just stop > the output where we are. Yep, I see the code. My comments more or less are related to the (better) commit message which may include what you just said. > So yes, micro-optimizing [v]scnprintf() is completely valid, but I've > never bothered to send the patch because the use case for scnprintf() is > primarily the > > ret += scnprintf(buf + ret, size - ret, ...); > > pattern, with ret starting out at 0 and size being some non-zero number. > When given a non-zero size, scnprintf() is guaranteed to return > something _strictly less_ than that value; that invariant guarantees > that the size-ret expression never becomes 0. So if scnprintf() is > properly used, I can't think of any situation where size will be 0, > hence I see that patch as correct-but-mostly-pointless. Good remark and again commit message probably should elaborate this as well. -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size 2022-01-31 10:34 ` Andy Shevchenko 2022-01-31 11:02 ` Rasmus Villemoes @ 2022-01-31 18:48 ` Waiman Long [not found] ` <c33b6435-1b27-32af-b14c-0f3a0318dcca-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-01-31 18:48 UTC (permalink / raw) To: Andy Shevchenko, David Rientjes Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 1/31/22 05:34, Andy Shevchenko wrote: > On Mon, Jan 31, 2022 at 12:30:33PM +0200, Andy Shevchenko wrote: >> On Mon, Jan 31, 2022 at 12:25:09PM +0200, Andy Shevchenko wrote: >>> On Sun, Jan 30, 2022 at 12:49:37PM -0800, David Rientjes wrote: >>>> On Sat, 29 Jan 2022, Waiman Long wrote: >>>> >>>>> For *scnprintf(), vsnprintf() is always called even if the input size is >>>>> 0. That is a waste of time, so just return 0 in this case. >>> Why do you think it's not legit? >> I have to elaborate. >> >> For *nprintf() the size=0 is quite useful to have. >> For *cnprintf() the size=0 makes less sense, but, if we read `man snprintf()`: >> >> The functions snprintf() and vsnprintf() do not write more than size bytes >> (including the terminating null byte ('\0')). If the output was truncated due >> to this limit, then the return value is the number of characters (excluding >> the terminating null byte) which would have been written to the final string >> if enough space had been available. Thus, a return value of size or more >> means that the output was truncated. (See also below under NOTES.) >> >> If an output error is encountered, a negative value is returned. >> >> Note the last sentence there. You need to answer to it in the commit message >> why your change is okay and it will show that you thought through all possible >> scenarios. > Also it seems currently the kernel documentation is not aligned with the code > > "If @size is == 0 the function returns 0." > > It should mention the (theoretical?) possibility of getting negative value, > if vsnprintf() returns negative value. AFAICS, the kernel's vsnprintf() function will not return -1. So in that sense it is not fully POSIX compliant. Since vscnprintf() function always returns 0 when size is 0, there is no point in finding out exactly how much bytes the buffer needs to hold the formatted text as this information will not be returned back to the caller anyway. I will update to indicate the vsnprintf() does not return -1. Thanks, Longmn ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <c33b6435-1b27-32af-b14c-0f3a0318dcca-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <c33b6435-1b27-32af-b14c-0f3a0318dcca-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-01 7:12 ` Rasmus Villemoes [not found] ` <f3bcf541-e77b-ca93-ef5c-862f4de99366-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Rasmus Villemoes @ 2022-02-01 7:12 UTC (permalink / raw) To: Waiman Long, Andy Shevchenko, David Rientjes Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 31/01/2022 19.48, Waiman Long wrote: > On 1/31/22 05:34, Andy Shevchenko wrote: >> Also it seems currently the kernel documentation is not aligned with >> the code >> >> "If @size is == 0 the function returns 0." >> >> It should mention the (theoretical?) possibility of getting negative >> value, >> if vsnprintf() returns negative value. > > AFAICS, the kernel's vsnprintf() function will not return -1. Even if it did, the "i < size" comparison in vscnprintf() is "int v size_t", so integer promotion says that even if i were negative, that comparison would be false, so we wouldn't forward that negative value anyway. > So in that > sense it is not fully POSIX compliant. Of course it's not, but not because it doesn't return -1. POSIX just says to return that in case of an error, and as a matter of QoI, the kernel's implementation simply can't (and must not) fail. There are other cases where we don't follow POSIX/C, e.g. in some corner cases around field length and precision (documented in test_printf.c), and the non-support of %n (and floating point and handling of wchar_t*), and the whole %p<> extension etc. Rasmus ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <f3bcf541-e77b-ca93-ef5c-862f4de99366-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>]
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <f3bcf541-e77b-ca93-ef5c-862f4de99366-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org> @ 2022-02-01 16:01 ` Waiman Long 0 siblings, 0 replies; 34+ messages in thread From: Waiman Long @ 2022-02-01 16:01 UTC (permalink / raw) To: Rasmus Villemoes, Andy Shevchenko, David Rientjes Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 2/1/22 02:12, Rasmus Villemoes wrote: > On 31/01/2022 19.48, Waiman Long wrote: >> On 1/31/22 05:34, Andy Shevchenko wrote: >>> Also it seems currently the kernel documentation is not aligned with >>> the code >>> >>> "If @size is == 0 the function returns 0." >>> >>> It should mention the (theoretical?) possibility of getting negative >>> value, >>> if vsnprintf() returns negative value. >> AFAICS, the kernel's vsnprintf() function will not return -1. > Even if it did, the "i < size" comparison in vscnprintf() is "int v > size_t", so integer promotion says that even if i were negative, that > comparison would be false, so we wouldn't forward that negative value > anyway. > >> So in that >> sense it is not fully POSIX compliant. > Of course it's not, but not because it doesn't return -1. POSIX just > says to return that in case of an error, and as a matter of QoI, the > kernel's implementation simply can't (and must not) fail. There are > other cases where we don't follow POSIX/C, e.g. in some corner cases > around field length and precision (documented in test_printf.c), and the > non-support of %n (and floating point and handling of wchar_t*), and the > whole %p<> extension etc. > > Rasmus > Thanks for the clarification. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220129205315.478628-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-30 20:49 ` David Rientjes @ 2022-01-31 2:53 ` Sergey Senozhatsky 2022-01-31 18:17 ` Roman Gushchin 2 siblings, 0 replies; 34+ messages in thread From: Sergey Senozhatsky @ 2022-01-31 2:53 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On (22/01/29 15:53), Waiman Long wrote: > For *scnprintf(), vsnprintf() is always called even if the input size is > 0. That is a waste of time, so just return 0 in this case. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> > +++ b/lib/vsprintf.c > @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) > { > int i; > > + if (!size) > + return 0; > + > i = vsnprintf(buf, size, fmt, args); > > if (likely(i < size)) > return i; > - if (size != 0) > - return size - 1; > - return 0; > + > + return size - 1; > } > EXPORT_SYMBOL(vscnprintf); ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size [not found] ` <20220129205315.478628-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-30 20:49 ` David Rientjes 2022-01-31 2:53 ` Sergey Senozhatsky @ 2022-01-31 18:17 ` Roman Gushchin 2 siblings, 0 replies; 34+ messages in thread From: Roman Gushchin @ 2022-01-31 18:17 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Sat, Jan 29, 2022 at 03:53:13PM -0500, Waiman Long wrote: > For *scnprintf(), vsnprintf() is always called even if the input size is > 0. That is a waste of time, so just return 0 in this case. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > lib/vsprintf.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/lib/vsprintf.c b/lib/vsprintf.c > index 3b8129dd374c..a65df546fb06 100644 > --- a/lib/vsprintf.c > +++ b/lib/vsprintf.c > @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args) > { > int i; > > + if (!size) > + return 0; > + > i = vsnprintf(buf, size, fmt, args); > > if (likely(i < size)) > return i; > - if (size != 0) > - return size - 1; > - return 0; > + > + return size - 1; > } > EXPORT_SYMBOL(vscnprintf); Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Thanks! ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v2 2/3] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <20220129205315.478628-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-29 20:53 ` [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size Waiman Long @ 2022-01-29 20:53 ` Waiman Long [not found] ` <20220129205315.478628-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-29 20:53 ` [PATCH v2 3/3] mm/page_owner: Dump memcg information Waiman Long 2 siblings, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-01-29 20:53 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini, Waiman Long The snprintf() function can return a length greater than the given input size. That will require a check for buffer overrun after each invocation of snprintf(). scnprintf(), on the other hand, will never return a greater length. By using scnprintf() in selected places, we can avoid some buffer overrun checks except after stack_depot_snprint() and after the last snprintf(). Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- mm/page_owner.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index 99e360df9465..28dac73e0542 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, if (!kbuf) return -ENOMEM; - ret = snprintf(kbuf, count, + ret = scnprintf(kbuf, count, "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", page_owner->order, page_owner->gfp_mask, &page_owner->gfp_mask, page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); - if (ret >= count) - goto err; - /* Print information relevant to grouping pages by mobility */ pageblock_mt = get_pageblock_migratetype(page); page_mt = gfp_migratetype(page_owner->gfp_mask); - ret += snprintf(kbuf + ret, count - ret, + ret += scnprintf(kbuf + ret, count - ret, "PFN %lu type %s Block %lu type %s Flags %pGp\n", pfn, migratetype_names[page_mt], @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migratetype_names[pageblock_mt], &page->flags); - if (ret >= count) - goto err; - ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0); if (ret >= count) goto err; if (page_owner->last_migrate_reason != -1) { - ret += snprintf(kbuf + ret, count - ret, + ret += scnprintf(kbuf + ret, count - ret, "Page has been migrated, last migrate reason: %s\n", migrate_reason_names[page_owner->last_migrate_reason]); - if (ret >= count) - goto err; } ret += snprintf(kbuf + ret, count - ret, "\n"); -- 2.27.0 ^ permalink raw reply related [flat|nested] 34+ messages in thread
[parent not found: <20220129205315.478628-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 2/3] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check [not found] ` <20220129205315.478628-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-31 2:58 ` Sergey Senozhatsky 0 siblings, 0 replies; 34+ messages in thread From: Sergey Senozhatsky @ 2022-01-31 2:58 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On (22/01/29 15:53), Waiman Long wrote: > The snprintf() function can return a length greater than the given > input size. That will require a check for buffer overrun after each > invocation of snprintf(). scnprintf(), on the other hand, will never > return a greater length. By using scnprintf() in selected places, we > can avoid some buffer overrun checks except after stack_depot_snprint() > and after the last snprintf(). > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reviewed-by: Sergey Senozhatsky <senozhatsky-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <20220129205315.478628-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-29 20:53 ` [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size Waiman Long 2022-01-29 20:53 ` [PATCH v2 2/3] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long @ 2022-01-29 20:53 ` Waiman Long [not found] ` <20220129205315.478628-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2 siblings, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-01-29 20:53 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini, Waiman Long It was found that a number of offlined memcgs were not freed because they were pinned by some charged pages that were present. Even "echo 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These offlined but not freed memcgs tend to increase in number over time with the side effect that percpu memory consumption as shown in /proc/meminfo also increases over time. In order to find out more information about those pages that pin offlined memcgs, the page_owner feature is extended to dump memory cgroup information especially whether the cgroup is offlined or not. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/mm/page_owner.c b/mm/page_owner.c index 28dac73e0542..8dc5cd0fa227 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -10,6 +10,7 @@ #include <linux/migrate.h> #include <linux/stackdepot.h> #include <linux/seq_file.h> +#include <linux/memcontrol.h> #include <linux/sched/clock.h> #include "internal.h" @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, depot_stack_handle_t handle) { int ret, pageblock_mt, page_mt; + unsigned long __maybe_unused memcg_data; char *kbuf; count = min_t(size_t, count, PAGE_SIZE); @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migrate_reason_names[page_owner->last_migrate_reason]); } +#ifdef CONFIG_MEMCG + /* + * Look for memcg information and print it out + */ + memcg_data = READ_ONCE(page->memcg_data); + if (memcg_data) { + struct mem_cgroup *memcg = page_memcg_check(page); + bool onlined; + char name[80]; + + if (memcg_data & MEMCG_DATA_OBJCGS) + ret += scnprintf(kbuf + ret, count - ret, + "Slab cache page\n"); + + if (!memcg) + goto copy_out; + + onlined = (memcg->css.flags & CSS_ONLINE); + cgroup_name(memcg->css.cgroup, name, sizeof(name)); + ret += scnprintf(kbuf + ret, count - ret, + "Charged %sto %smemcg %s\n", + PageMemcgKmem(page) ? "(via objcg) " : "", + onlined ? "" : "offlined ", + name); + } + +copy_out: +#endif + ret += snprintf(kbuf + ret, count - ret, "\n"); if (ret >= count) goto err; -- 2.27.0 ^ permalink raw reply related [flat|nested] 34+ messages in thread
[parent not found: <20220129205315.478628-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <20220129205315.478628-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-30 6:33 ` Mike Rapoport 2022-01-30 18:22 ` Waiman Long 2022-01-31 9:38 ` Michal Hocko 1 sibling, 1 reply; 34+ messages in thread From: Mike Rapoport @ 2022-01-30 6:33 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Sat, Jan 29, 2022 at 03:53:15PM -0500, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to dump memory > cgroup information especially whether the cgroup is offlined or not. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..8dc5cd0fa227 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > depot_stack_handle_t handle) > { > int ret, pageblock_mt, page_mt; > + unsigned long __maybe_unused memcg_data; > char *kbuf; > > count = min_t(size_t, count, PAGE_SIZE); > @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migrate_reason_names[page_owner->last_migrate_reason]); > } > > +#ifdef CONFIG_MEMCG Can we put all this along with the declaration of memcg_data in a helper function please? > + /* > + * Look for memcg information and print it out > + */ > + memcg_data = READ_ONCE(page->memcg_data); > + if (memcg_data) { > + struct mem_cgroup *memcg = page_memcg_check(page); > + bool onlined; > + char name[80]; > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + ret += scnprintf(kbuf + ret, count - ret, > + "Slab cache page\n"); > + > + if (!memcg) > + goto copy_out; > + > + onlined = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + ret += scnprintf(kbuf + ret, count - ret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + onlined ? "" : "offlined ", > + name); > + } > + > +copy_out: > +#endif > + > ret += snprintf(kbuf + ret, count - ret, "\n"); > if (ret >= count) > goto err; > -- > 2.27.0 > > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information 2022-01-30 6:33 ` Mike Rapoport @ 2022-01-30 18:22 ` Waiman Long [not found] ` <82c99093-e44b-7fac-14ab-9e8392d107ea-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-01-30 18:22 UTC (permalink / raw) To: Mike Rapoport Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Rafael Aquini On 1/30/22 01:33, Mike Rapoport wrote: > On Sat, Jan 29, 2022 at 03:53:15PM -0500, Waiman Long wrote: >> It was found that a number of offlined memcgs were not freed because >> they were pinned by some charged pages that were present. Even "echo >> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >> offlined but not freed memcgs tend to increase in number over time with >> the side effect that percpu memory consumption as shown in /proc/meminfo >> also increases over time. >> >> In order to find out more information about those pages that pin >> offlined memcgs, the page_owner feature is extended to dump memory >> cgroup information especially whether the cgroup is offlined or not. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/mm/page_owner.c b/mm/page_owner.c >> index 28dac73e0542..8dc5cd0fa227 100644 >> --- a/mm/page_owner.c >> +++ b/mm/page_owner.c >> @@ -10,6 +10,7 @@ >> #include <linux/migrate.h> >> #include <linux/stackdepot.h> >> #include <linux/seq_file.h> >> +#include <linux/memcontrol.h> >> #include <linux/sched/clock.h> >> >> #include "internal.h" >> @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, >> depot_stack_handle_t handle) >> { >> int ret, pageblock_mt, page_mt; >> + unsigned long __maybe_unused memcg_data; >> char *kbuf; >> >> count = min_t(size_t, count, PAGE_SIZE); >> @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, >> migrate_reason_names[page_owner->last_migrate_reason]); >> } >> >> +#ifdef CONFIG_MEMCG > Can we put all this along with the declaration of memcg_data in a helper > function please? > Sure. Will post another version with that change. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <82c99093-e44b-7fac-14ab-9e8392d107ea-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <82c99093-e44b-7fac-14ab-9e8392d107ea-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-01-30 20:51 ` David Rientjes 0 siblings, 0 replies; 34+ messages in thread From: David Rientjes @ 2022-01-30 20:51 UTC (permalink / raw) To: Waiman Long Cc: Mike Rapoport, Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Sun, 30 Jan 2022, Waiman Long wrote: > On 1/30/22 01:33, Mike Rapoport wrote: > > On Sat, Jan 29, 2022 at 03:53:15PM -0500, Waiman Long wrote: > > > It was found that a number of offlined memcgs were not freed because > > > they were pinned by some charged pages that were present. Even "echo > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > offlined but not freed memcgs tend to increase in number over time with > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > also increases over time. > > > > > > In order to find out more information about those pages that pin > > > offlined memcgs, the page_owner feature is extended to dump memory > > > cgroup information especially whether the cgroup is offlined or not. > > > > > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > --- > > > mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ > > > 1 file changed, 31 insertions(+) > > > > > > diff --git a/mm/page_owner.c b/mm/page_owner.c > > > index 28dac73e0542..8dc5cd0fa227 100644 > > > --- a/mm/page_owner.c > > > +++ b/mm/page_owner.c > > > @@ -10,6 +10,7 @@ > > > #include <linux/migrate.h> > > > #include <linux/stackdepot.h> > > > #include <linux/seq_file.h> > > > +#include <linux/memcontrol.h> > > > #include <linux/sched/clock.h> > > > #include "internal.h" > > > @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, > > > unsigned long pfn, > > > depot_stack_handle_t handle) > > > { > > > int ret, pageblock_mt, page_mt; > > > + unsigned long __maybe_unused memcg_data; > > > char *kbuf; > > > count = min_t(size_t, count, PAGE_SIZE); > > > @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, > > > unsigned long pfn, > > > migrate_reason_names[page_owner->last_migrate_reason]); > > > } > > > +#ifdef CONFIG_MEMCG > > Can we put all this along with the declaration of memcg_data in a helper > > function please? > > > Sure. Will post another version with that change. > That would certainly make it much cleaner. After that's done (and perhaps addressing my nit comment in the first patch), feel free to add Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> to all three patches. Thanks! ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <20220129205315.478628-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2022-01-30 6:33 ` Mike Rapoport @ 2022-01-31 9:38 ` Michal Hocko [not found] ` <YfeuK5j7cbgM+Oo+-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 34+ messages in thread From: Michal Hocko @ 2022-01-31 9:38 UTC (permalink / raw) To: Waiman Long Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Sat 29-01-22 15:53:15, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to dump memory > cgroup information especially whether the cgroup is offlined or not. It is not really clear to me how this is supposed to be used. Are you really dumping all the pages in the system to find out offline memcgs? That looks rather clumsy to me. I am not against adding memcg information to the page owner output. That can be useful in other contexts. > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..8dc5cd0fa227 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > depot_stack_handle_t handle) > { > int ret, pageblock_mt, page_mt; > + unsigned long __maybe_unused memcg_data; > char *kbuf; > > count = min_t(size_t, count, PAGE_SIZE); > @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migrate_reason_names[page_owner->last_migrate_reason]); > } > > +#ifdef CONFIG_MEMCG This really begs to be in a dedicated function. page_owner_print_memcg or something like that. > + /* > + * Look for memcg information and print it out > + */ > + memcg_data = READ_ONCE(page->memcg_data); > + if (memcg_data) { > + struct mem_cgroup *memcg = page_memcg_check(page); > + bool onlined; > + char name[80]; What does prevent memcg to go away and being reused for a different purpose? > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + ret += scnprintf(kbuf + ret, count - ret, > + "Slab cache page\n"); > + > + if (!memcg) > + goto copy_out; > + > + onlined = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + ret += scnprintf(kbuf + ret, count - ret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + onlined ? "" : "offlined ", > + name); > + } > + > +copy_out: > +#endif -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <YfeuK5j7cbgM+Oo+-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfeuK5j7cbgM+Oo+-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-01-31 16:53 ` Johannes Weiner [not found] ` <YfgT/9tEREQNiiAN-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2022-01-31 19:01 ` Waiman Long 1 sibling, 1 reply; 34+ messages in thread From: Johannes Weiner @ 2022-01-31 16:53 UTC (permalink / raw) To: Michal Hocko Cc: Waiman Long, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: > On Sat 29-01-22 15:53:15, Waiman Long wrote: > > It was found that a number of offlined memcgs were not freed because > > they were pinned by some charged pages that were present. Even "echo > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > offlined but not freed memcgs tend to increase in number over time with > > the side effect that percpu memory consumption as shown in /proc/meminfo > > also increases over time. > > > > In order to find out more information about those pages that pin > > offlined memcgs, the page_owner feature is extended to dump memory > > cgroup information especially whether the cgroup is offlined or not. > > It is not really clear to me how this is supposed to be used. Are you > really dumping all the pages in the system to find out offline memcgs? > That looks rather clumsy to me. I am not against adding memcg > information to the page owner output. That can be useful in other > contexts. We've sometimes done exactly that in production, but with drgn scripts. It's not very common, so it doesn't need to be very efficient either. Typically, we'd encounter a host with an unusual number of dying cgroups, ssh in and poke around with drgn to figure out what kind of objects are still pinning the cgroups in question. This patch would make that process a little easier, I suppose. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <YfgT/9tEREQNiiAN-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfgT/9tEREQNiiAN-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2022-01-31 18:15 ` Roman Gushchin [not found] ` <YfgnUZQBRkqhrEIb-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Roman Gushchin @ 2022-01-31 18:15 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Waiman Long, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Mon, Jan 31, 2022 at 11:53:19AM -0500, Johannes Weiner wrote: > On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: > > On Sat 29-01-22 15:53:15, Waiman Long wrote: > > > It was found that a number of offlined memcgs were not freed because > > > they were pinned by some charged pages that were present. Even "echo > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > offlined but not freed memcgs tend to increase in number over time with > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > also increases over time. > > > > > > In order to find out more information about those pages that pin > > > offlined memcgs, the page_owner feature is extended to dump memory > > > cgroup information especially whether the cgroup is offlined or not. > > > > It is not really clear to me how this is supposed to be used. Are you > > really dumping all the pages in the system to find out offline memcgs? > > That looks rather clumsy to me. I am not against adding memcg > > information to the page owner output. That can be useful in other > > contexts. > > We've sometimes done exactly that in production, but with drgn > scripts. It's not very common, so it doesn't need to be very efficient > either. Typically, we'd encounter a host with an unusual number of > dying cgroups, ssh in and poke around with drgn to figure out what > kind of objects are still pinning the cgroups in question. > > This patch would make that process a little easier, I suppose. Right. Over last few years I've spent enormous amount of time digging into various aspects of this problem and in my experience the combination of drgn for the inspection of the current state and bpf for following various decisions on the reclaim path was the most useful combination. I really appreciate an effort to put useful tools to track memcg references into the kernel tree, however the page_owner infra has a limited usefulness as it has to be enabled on the boot. But because it doesn't add any overhead, I also don't think there any reasons to not add it. Thanks! ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <YfgnUZQBRkqhrEIb-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfgnUZQBRkqhrEIb-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> @ 2022-01-31 18:25 ` Michal Hocko [not found] ` <Yfgpknwr1tMnPkqh-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Michal Hocko @ 2022-01-31 18:25 UTC (permalink / raw) To: Roman Gushchin Cc: Johannes Weiner, Waiman Long, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Mon 31-01-22 10:15:45, Roman Gushchin wrote: > On Mon, Jan 31, 2022 at 11:53:19AM -0500, Johannes Weiner wrote: > > On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: > > > On Sat 29-01-22 15:53:15, Waiman Long wrote: > > > > It was found that a number of offlined memcgs were not freed because > > > > they were pinned by some charged pages that were present. Even "echo > > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > > offlined but not freed memcgs tend to increase in number over time with > > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > > also increases over time. > > > > > > > > In order to find out more information about those pages that pin > > > > offlined memcgs, the page_owner feature is extended to dump memory > > > > cgroup information especially whether the cgroup is offlined or not. > > > > > > It is not really clear to me how this is supposed to be used. Are you > > > really dumping all the pages in the system to find out offline memcgs? > > > That looks rather clumsy to me. I am not against adding memcg > > > information to the page owner output. That can be useful in other > > > contexts. > > > > We've sometimes done exactly that in production, but with drgn > > scripts. It's not very common, so it doesn't need to be very efficient > > either. Typically, we'd encounter a host with an unusual number of > > dying cgroups, ssh in and poke around with drgn to figure out what > > kind of objects are still pinning the cgroups in question. > > > > This patch would make that process a little easier, I suppose. > > Right. Over last few years I've spent enormous amount of time digging into > various aspects of this problem and in my experience the combination of drgn > for the inspection of the current state and bpf for following various decisions > on the reclaim path was the most useful combination. > > I really appreciate an effort to put useful tools to track memcg references > into the kernel tree, however the page_owner infra has a limited usefulness > as it has to be enabled on the boot. But because it doesn't add any overhead, > I also don't think there any reasons to not add it. Would it be feasible to add a debugfs interface to displa dead memcg information? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <Yfgpknwr1tMnPkqh-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <Yfgpknwr1tMnPkqh-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-01-31 18:38 ` Waiman Long [not found] ` <12686956-612d-d89b-5641-470d5e913090-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-01-31 18:38 UTC (permalink / raw) To: Michal Hocko, Roman Gushchin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 1/31/22 13:25, Michal Hocko wrote: > On Mon 31-01-22 10:15:45, Roman Gushchin wrote: >> On Mon, Jan 31, 2022 at 11:53:19AM -0500, Johannes Weiner wrote: >>> On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: >>>> On Sat 29-01-22 15:53:15, Waiman Long wrote: >>>>> It was found that a number of offlined memcgs were not freed because >>>>> they were pinned by some charged pages that were present. Even "echo >>>>> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >>>>> offlined but not freed memcgs tend to increase in number over time with >>>>> the side effect that percpu memory consumption as shown in /proc/meminfo >>>>> also increases over time. >>>>> >>>>> In order to find out more information about those pages that pin >>>>> offlined memcgs, the page_owner feature is extended to dump memory >>>>> cgroup information especially whether the cgroup is offlined or not. >>>> It is not really clear to me how this is supposed to be used. Are you >>>> really dumping all the pages in the system to find out offline memcgs? >>>> That looks rather clumsy to me. I am not against adding memcg >>>> information to the page owner output. That can be useful in other >>>> contexts. >>> We've sometimes done exactly that in production, but with drgn >>> scripts. It's not very common, so it doesn't need to be very efficient >>> either. Typically, we'd encounter a host with an unusual number of >>> dying cgroups, ssh in and poke around with drgn to figure out what >>> kind of objects are still pinning the cgroups in question. >>> >>> This patch would make that process a little easier, I suppose. >> Right. Over last few years I've spent enormous amount of time digging into >> various aspects of this problem and in my experience the combination of drgn >> for the inspection of the current state and bpf for following various decisions >> on the reclaim path was the most useful combination. >> >> I really appreciate an effort to put useful tools to track memcg references >> into the kernel tree, however the page_owner infra has a limited usefulness >> as it has to be enabled on the boot. But because it doesn't add any overhead, >> I also don't think there any reasons to not add it. > Would it be feasible to add a debugfs interface to displa dead memcg > information? Originally, I added some debug code to keep track of the list of memcg that has been offlined but not yet freed. After some more testing, I figured out that the memcg's were not freed because they were pinned by references in the page structs. At this point, I realize the using the existing page owner debugging tool will be useful to track this kind of problem since it already have all the infrastructure to list where the pages were allocated as well as various field in the page structures. Of course, it is also possible to have a debugfs interface to list those dead memcg information, displaying more information about the page that pins the memcg will be hard without using the page owner tool. Keeping track of the list of dead memcg's may also have some runtime overhead. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <12686956-612d-d89b-5641-470d5e913090-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <12686956-612d-d89b-5641-470d5e913090-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-01 10:49 ` Michal Hocko 2022-02-01 16:41 ` Waiman Long 0 siblings, 1 reply; 34+ messages in thread From: Michal Hocko @ 2022-02-01 10:49 UTC (permalink / raw) To: Waiman Long Cc: Roman Gushchin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Mon 31-01-22 13:38:28, Waiman Long wrote: [...] > Of course, it is also possible to have a debugfs interface to list those > dead memcg information, displaying more information about the page that pins > the memcg will be hard without using the page owner tool. Yes, you will need page owner or hook into the kernel by other means (like already mentioned drgn). The question is whether scanning all existing pages to get that information is the best we can offer. > Keeping track of > the list of dead memcg's may also have some runtime overhead. Could you be more specific? Offlined memcgs are still part of the hierarchy IIRC. So it shouldn't be much more than iterating the whole cgroup tree and collect interesting data about dead cgroups. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information 2022-02-01 10:49 ` Michal Hocko @ 2022-02-01 16:41 ` Waiman Long [not found] ` <268a8bdf-4c70-b967-f34c-2293b54186f0-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Waiman Long @ 2022-02-01 16:41 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Rafael Aquini On 2/1/22 05:49, Michal Hocko wrote: > On Mon 31-01-22 13:38:28, Waiman Long wrote: > [...] >> Of course, it is also possible to have a debugfs interface to list those >> dead memcg information, displaying more information about the page that pins >> the memcg will be hard without using the page owner tool. > Yes, you will need page owner or hook into the kernel by other means > (like already mentioned drgn). The question is whether scanning all > existing pages to get that information is the best we can offer. The page_owner tool records the page information at allocation time. There are some slight performance overhead, but it is the memory overhead that is the major drawback of this approach as we need one page_owner structure for each physical page. Page scanning is only done when users read the page_owner debugfs file. Yes, I agree that scanning all the pages is not the most efficient way to get these dead memcg information, but it is what the page_owner tool does. I would argue that this is the most efficient coding-wise to get this information. >> Keeping track of >> the list of dead memcg's may also have some runtime overhead. > Could you be more specific? Offlined memcgs are still part of the > hierarchy IIRC. So it shouldn't be much more than iterating the whole > cgroup tree and collect interesting data about dead cgroups. What I mean is that without piggybacking on top of page_owner, we will to add a lot more code to collect and display those information which may have some overhead of its own. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <268a8bdf-4c70-b967-f34c-2293b54186f0-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <268a8bdf-4c70-b967-f34c-2293b54186f0-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2022-02-02 8:57 ` Michal Hocko [not found] ` <YfpHbtffFi2x1L4p-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Michal Hocko @ 2022-02-02 8:57 UTC (permalink / raw) To: Waiman Long Cc: Roman Gushchin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Tue 01-02-22 11:41:19, Waiman Long wrote: > > On 2/1/22 05:49, Michal Hocko wrote: [...] > > Could you be more specific? Offlined memcgs are still part of the > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > cgroup tree and collect interesting data about dead cgroups. > > What I mean is that without piggybacking on top of page_owner, we will to > add a lot more code to collect and display those information which may have > some overhead of its own. Yes, there is nothing like a free lunch. Page owner is certainly a tool that can be used. My main concern is that this tool doesn't really scale on large machines with a lots of memory. It will provide a very detailed information but I am not sure this is particularly helpful to most admins (why should people process tons of allocation backtraces in the first place). Wouldn't it be sufficient to have per dead memcg stats to see where the memory sits? Accumulated offline memcgs is something that bothers more people and I am really wondering whether we can do more for those people to evaluate the current state. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <YfpHbtffFi2x1L4p-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfpHbtffFi2x1L4p-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-02 15:54 ` Roman Gushchin 2022-02-02 16:38 ` Michal Hocko 2022-02-02 16:29 ` Waiman Long 1 sibling, 1 reply; 34+ messages in thread From: Roman Gushchin @ 2022-02-02 15:54 UTC (permalink / raw) To: Michal Hocko Cc: Waiman Long, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > On 2/1/22 05:49, Michal Hocko wrote: > [...] > > > Could you be more specific? Offlined memcgs are still part of the > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > cgroup tree and collect interesting data about dead cgroups. > > > > What I mean is that without piggybacking on top of page_owner, we will to > > add a lot more code to collect and display those information which may have > > some overhead of its own. > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > that can be used. My main concern is that this tool doesn't really > scale on large machines with a lots of memory. It will provide a very > detailed information but I am not sure this is particularly helpful to > most admins (why should people process tons of allocation backtraces in > the first place). Wouldn't it be sufficient to have per dead memcg stats > to see where the memory sits? > > Accumulated offline memcgs is something that bothers more people and I > am really wondering whether we can do more for those people to evaluate > the current state. Cgroup v2 has corresponding counters for years. Or do you mean something different? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information 2022-02-02 15:54 ` Roman Gushchin @ 2022-02-02 16:38 ` Michal Hocko [not found] ` <YfqzbwAPKpshXSLK-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Michal Hocko @ 2022-02-02 16:38 UTC (permalink / raw) To: Roman Gushchin Cc: Waiman Long, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny, Rafael Aquini On Wed 02-02-22 07:54:48, Roman Gushchin wrote: > On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > > > On 2/1/22 05:49, Michal Hocko wrote: > > [...] > > > > Could you be more specific? Offlined memcgs are still part of the > > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > > cgroup tree and collect interesting data about dead cgroups. > > > > > > What I mean is that without piggybacking on top of page_owner, we will to > > > add a lot more code to collect and display those information which may have > > > some overhead of its own. > > > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > > that can be used. My main concern is that this tool doesn't really > > scale on large machines with a lots of memory. It will provide a very > > detailed information but I am not sure this is particularly helpful to > > most admins (why should people process tons of allocation backtraces in > > the first place). Wouldn't it be sufficient to have per dead memcg stats > > to see where the memory sits? > > > > Accumulated offline memcgs is something that bothers more people and I > > am really wondering whether we can do more for those people to evaluate > > the current state. > > Cgroup v2 has corresponding counters for years. Or do you mean something different? Do we have anything more specific than nr_dying_descendants? I was thinking about an interface which would provide paths and stats for dead memcgs. But I have to confess I haven't really spent much time thinking about how much work that would be. I am by no means against adding memcg information to the page owner. I just think there must be a better way to present resource consumption by dead memcgs. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <YfqzbwAPKpshXSLK-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfqzbwAPKpshXSLK-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2022-02-02 17:51 ` Roman Gushchin [not found] ` <YfrEpOGObnc0mYAW-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Roman Gushchin @ 2022-02-02 17:51 UTC (permalink / raw) To: Michal Hocko Cc: Waiman Long, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Wed, Feb 02, 2022 at 05:38:07PM +0100, Michal Hocko wrote: > On Wed 02-02-22 07:54:48, Roman Gushchin wrote: > > On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > > > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > > > > > On 2/1/22 05:49, Michal Hocko wrote: > > > [...] > > > > > Could you be more specific? Offlined memcgs are still part of the > > > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > > > cgroup tree and collect interesting data about dead cgroups. > > > > > > > > What I mean is that without piggybacking on top of page_owner, we will to > > > > add a lot more code to collect and display those information which may have > > > > some overhead of its own. > > > > > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > > > that can be used. My main concern is that this tool doesn't really > > > scale on large machines with a lots of memory. It will provide a very > > > detailed information but I am not sure this is particularly helpful to > > > most admins (why should people process tons of allocation backtraces in > > > the first place). Wouldn't it be sufficient to have per dead memcg stats > > > to see where the memory sits? > > > > > > Accumulated offline memcgs is something that bothers more people and I > > > am really wondering whether we can do more for those people to evaluate > > > the current state. > > > > Cgroup v2 has corresponding counters for years. Or do you mean something different? > > Do we have anything more specific than nr_dying_descendants? No, just nr_dying_descendants. > I was thinking about an interface which would provide paths and stats for dead > memcgs. But I have to confess I haven't really spent much time thinking > about how much work that would be. I am by no means against adding memcg > information to the page owner. I just think there must be a better way > to present resource consumption by dead memcgs. I'd go with a drgn script. I wrote a bunch of them some times ago and can probably revive them and post here (will take few days). I agree that the problem still exists and providing some tool around would be useful. Thanks! ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <YfrEpOGObnc0mYAW-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfrEpOGObnc0mYAW-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> @ 2022-02-02 17:56 ` Michal Hocko 0 siblings, 0 replies; 34+ messages in thread From: Michal Hocko @ 2022-02-02 17:56 UTC (permalink / raw) To: Roman Gushchin Cc: Waiman Long, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On Wed 02-02-22 09:51:32, Roman Gushchin wrote: > On Wed, Feb 02, 2022 at 05:38:07PM +0100, Michal Hocko wrote: > > On Wed 02-02-22 07:54:48, Roman Gushchin wrote: > > > On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > > > > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > > > > > > > On 2/1/22 05:49, Michal Hocko wrote: > > > > [...] > > > > > > Could you be more specific? Offlined memcgs are still part of the > > > > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > > > > cgroup tree and collect interesting data about dead cgroups. > > > > > > > > > > What I mean is that without piggybacking on top of page_owner, we will to > > > > > add a lot more code to collect and display those information which may have > > > > > some overhead of its own. > > > > > > > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > > > > that can be used. My main concern is that this tool doesn't really > > > > scale on large machines with a lots of memory. It will provide a very > > > > detailed information but I am not sure this is particularly helpful to > > > > most admins (why should people process tons of allocation backtraces in > > > > the first place). Wouldn't it be sufficient to have per dead memcg stats > > > > to see where the memory sits? > > > > > > > > Accumulated offline memcgs is something that bothers more people and I > > > > am really wondering whether we can do more for those people to evaluate > > > > the current state. > > > > > > Cgroup v2 has corresponding counters for years. Or do you mean something different? > > > > Do we have anything more specific than nr_dying_descendants? > > No, just nr_dying_descendants. > > > I was thinking about an interface which would provide paths and stats for dead > > memcgs. But I have to confess I haven't really spent much time thinking > > about how much work that would be. I am by no means against adding memcg > > information to the page owner. I just think there must be a better way > > to present resource consumption by dead memcgs. > > I'd go with a drgn script. I wrote a bunch of them some times ago and > can probably revive them and post here (will take few days). That would be really awsome! Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfpHbtffFi2x1L4p-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2022-02-02 15:54 ` Roman Gushchin @ 2022-02-02 16:29 ` Waiman Long 1 sibling, 0 replies; 34+ messages in thread From: Waiman Long @ 2022-02-02 16:29 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 2/2/22 03:57, Michal Hocko wrote: > On Tue 01-02-22 11:41:19, Waiman Long wrote: >> On 2/1/22 05:49, Michal Hocko wrote: > [...] >>> Could you be more specific? Offlined memcgs are still part of the >>> hierarchy IIRC. So it shouldn't be much more than iterating the whole >>> cgroup tree and collect interesting data about dead cgroups. >> What I mean is that without piggybacking on top of page_owner, we will to >> add a lot more code to collect and display those information which may have >> some overhead of its own. > Yes, there is nothing like a free lunch. Page owner is certainly a tool > that can be used. My main concern is that this tool doesn't really > scale on large machines with a lots of memory. It will provide a very > detailed information but I am not sure this is particularly helpful to > most admins (why should people process tons of allocation backtraces in > the first place). Wouldn't it be sufficient to have per dead memcg stats > to see where the memory sits? > > Accumulated offline memcgs is something that bothers more people and I > am really wondering whether we can do more for those people to evaluate > the current state. You won't get the stack backtrace information without page_owner enabled. I believe that is a helpful piece of information. I don't expect page_owner to be enabled by default on production system because of its memory overhead. I believe you can actually see the number of memory cgroups present by looking at the /proc/cgroups file. Though, you don't know how many of them are offline memcgs. So if one suspect that there are a large number of offline memcgs, one can set up a test environment with page_owner enabled for further analysis. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 3/3] mm/page_owner: Dump memcg information [not found] ` <YfeuK5j7cbgM+Oo+-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2022-01-31 16:53 ` Johannes Weiner @ 2022-01-31 19:01 ` Waiman Long 1 sibling, 0 replies; 34+ messages in thread From: Waiman Long @ 2022-01-31 19:01 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ira Weiny, Rafael Aquini On 1/31/22 04:38, Michal Hocko wrote: > On Sat 29-01-22 15:53:15, Waiman Long wrote: >> It was found that a number of offlined memcgs were not freed because >> they were pinned by some charged pages that were present. Even "echo >> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >> offlined but not freed memcgs tend to increase in number over time with >> the side effect that percpu memory consumption as shown in /proc/meminfo >> also increases over time. >> >> In order to find out more information about those pages that pin >> offlined memcgs, the page_owner feature is extended to dump memory >> cgroup information especially whether the cgroup is offlined or not. > It is not really clear to me how this is supposed to be used. Are you > really dumping all the pages in the system to find out offline memcgs? > That looks rather clumsy to me. I am not against adding memcg > information to the page owner output. That can be useful in other > contexts. I am just piggybacking on top of the existing page_owner tool to provide information for me to find out what pages are pinning the dead memcgs. page_owner is a debugging tool that is not turned on by default. We do have to add a kernel parameter and rebooting the system to use that, but that is pretty easy to do once we have a reproducer to reproduce the problem. Cheers, Longman ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2022-02-02 17:56 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-29 20:53 [PATCH v2 0/3] mm/page_owner: Extend page_owner to show memcg information Waiman Long
[not found] ` <20220129205315.478628-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-29 20:53 ` [PATCH v2 1/3] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
[not found] ` <20220129205315.478628-2-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-30 20:49 ` David Rientjes
[not found] ` <d99b3c4b-7b6e-529-6e4b-b91b65c92d81-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2022-01-30 20:57 ` Waiman Long
2022-01-31 10:25 ` Andy Shevchenko
2022-01-31 10:30 ` Andy Shevchenko
[not found] ` <Yfe6SfG4CqzWSaMM-XvqNBM/wLWRrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2022-01-31 10:34 ` Andy Shevchenko
2022-01-31 11:02 ` Rasmus Villemoes
[not found] ` <d44824d4-2dd1-a8ab-d3ee-ac67b749ca6f-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>
2022-01-31 11:22 ` Andy Shevchenko
2022-01-31 18:48 ` Waiman Long
[not found] ` <c33b6435-1b27-32af-b14c-0f3a0318dcca-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-01 7:12 ` Rasmus Villemoes
[not found] ` <f3bcf541-e77b-ca93-ef5c-862f4de99366-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>
2022-02-01 16:01 ` Waiman Long
2022-01-31 2:53 ` Sergey Senozhatsky
2022-01-31 18:17 ` Roman Gushchin
2022-01-29 20:53 ` [PATCH v2 2/3] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
[not found] ` <20220129205315.478628-3-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-31 2:58 ` Sergey Senozhatsky
2022-01-29 20:53 ` [PATCH v2 3/3] mm/page_owner: Dump memcg information Waiman Long
[not found] ` <20220129205315.478628-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-30 6:33 ` Mike Rapoport
2022-01-30 18:22 ` Waiman Long
[not found] ` <82c99093-e44b-7fac-14ab-9e8392d107ea-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-01-30 20:51 ` David Rientjes
2022-01-31 9:38 ` Michal Hocko
[not found] ` <YfeuK5j7cbgM+Oo+-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-01-31 16:53 ` Johannes Weiner
[not found] ` <YfgT/9tEREQNiiAN-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-01-31 18:15 ` Roman Gushchin
[not found] ` <YfgnUZQBRkqhrEIb-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2022-01-31 18:25 ` Michal Hocko
[not found] ` <Yfgpknwr1tMnPkqh-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-01-31 18:38 ` Waiman Long
[not found] ` <12686956-612d-d89b-5641-470d5e913090-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-01 10:49 ` Michal Hocko
2022-02-01 16:41 ` Waiman Long
[not found] ` <268a8bdf-4c70-b967-f34c-2293b54186f0-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-02-02 8:57 ` Michal Hocko
[not found] ` <YfpHbtffFi2x1L4p-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-02 15:54 ` Roman Gushchin
2022-02-02 16:38 ` Michal Hocko
[not found] ` <YfqzbwAPKpshXSLK-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-02-02 17:51 ` Roman Gushchin
[not found] ` <YfrEpOGObnc0mYAW-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2022-02-02 17:56 ` Michal Hocko
2022-02-02 16:29 ` Waiman Long
2022-01-31 19:01 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).