[RFC PATCH] mm: fix the inaccurate memory statistics issue for users

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
@ 2025-05-23  3:16 Baolin Wang
  2025-05-23  5:25 ` Donet Tom
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Baolin Wang @ 2025-05-23  3:16 UTC (permalink / raw)
  To: akpm, david, shakeelb
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	baolin.wang, linux-mm, linux-fsdevel, linux-kernel

On some large machines with a high number of CPUs running a 64K kernel,
we found that the 'RES' field is always 0 displayed by the top command
for some processes, which will cause a lot of confusion for users.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
      1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd

The main reason is that the batch size of the percpu counter is quite large
on these machines, caching a significant percpu value, since converting mm's
rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
stats into percpu_counter"). Intuitively, the batch number should be optimized,
but on some paths, performance may take precedence over statistical accuracy.
Therefore, introducing a new interface to add the percpu statistical count
and display it to users, which can remove the confusion. In addition, this
change is not expected to be on a performance-critical path, so the modification
should be acceptable.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 fs/proc/task_mmu.c | 14 +++++++-------
 include/linux/mm.h |  5 +++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f629e6526935 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	unsigned long text, lib, swap, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
-	anon = get_mm_counter(mm, MM_ANONPAGES);
-	file = get_mm_counter(mm, MM_FILEPAGES);
-	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
+	file = get_mm_counter_sum(mm, MM_FILEPAGES);
+	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	text = min(text, mm->exec_vm << PAGE_SHIFT);
 	lib = (mm->exec_vm << PAGE_SHIFT) - text;
 
-	swap = get_mm_counter(mm, MM_SWAPENTS);
+	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared = get_mm_counter(mm, MM_FILEPAGES) +
-			get_mm_counter(mm, MM_SHMEMPAGES);
+	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
+			get_mm_counter_sum(mm, MM_SHMEMPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
-	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 185424858f23..15ec5cfe9515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 	return percpu_counter_read_positive(&mm->rss_stat[member]);
 }
 
+static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
+{
+	return percpu_counter_sum_positive(&mm->rss_stat[member]);
+}
+
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23  3:16 [RFC PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
@ 2025-05-23  5:25 ` Donet Tom
  2025-05-23  5:47   ` Baolin Wang
  2025-05-23 10:14 ` Aboorva Devarajan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Donet Tom @ 2025-05-23  5:25 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, shakeelb
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-fsdevel, linux-kernel


On 5/23/25 8:46 AM, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
>
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   fs/proc/task_mmu.c | 14 +++++++-------
>   include/linux/mm.h |  5 +++++
>   2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f629e6526935 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   	unsigned long text, lib, swap, anon, file, shmem;
>   	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>   
> -	anon = get_mm_counter(mm, MM_ANONPAGES);
> -	file = get_mm_counter(mm, MM_FILEPAGES);
> -	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> +	anon = get_mm_counter_sum(mm, MM_ANONPAGES);


Hi Baolin Wang,

We also observed the same issue where the RSS value in /proc/PID/status
was 0 on machines with a high number of CPUs. With this patch, the issue
got fixedl

Rss value without this patch
----------------------------
  # cat /proc/87406/status
.....
VmRSS:           0 kB
RssAnon:           0 kB
RssFile:           0 k


Rss values with this patch
--------------------------
  # cat /proc/3055/status
VmRSS:        2176 kB
RssAnon:         512 kB
RssFile:        1664 kB
RssShmem:           0 kB

Tested-by Donet Tom <donettom@linux.ibm.com>


> +	file = get_mm_counter_sum(mm, MM_FILEPAGES);
> +	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>   
>   	/*
>   	 * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   	text = min(text, mm->exec_vm << PAGE_SHIFT);
>   	lib = (mm->exec_vm << PAGE_SHIFT) - text;
>   
> -	swap = get_mm_counter(mm, MM_SWAPENTS);
> +	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>   	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>   	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>   	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>   			 unsigned long *shared, unsigned long *text,
>   			 unsigned long *data, unsigned long *resident)
>   {
> -	*shared = get_mm_counter(mm, MM_FILEPAGES) +
> -			get_mm_counter(mm, MM_SHMEMPAGES);
> +	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> +			get_mm_counter_sum(mm, MM_SHMEMPAGES);
>   	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>   								>> PAGE_SHIFT;
>   	*data = mm->data_vm + mm->stack_vm;
> -	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> +	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>   	return mm->total_vm;
>   }
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 185424858f23..15ec5cfe9515 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>   	return percpu_counter_read_positive(&mm->rss_stat[member]);
>   }
>   
> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> +{
> +	return percpu_counter_sum_positive(&mm->rss_stat[member]);
> +}
> +
>   void mm_trace_rss_stat(struct mm_struct *mm, int member);
>   
>   static inline void add_mm_counter(struct mm_struct *mm, int member, long value)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23  5:25 ` Donet Tom
@ 2025-05-23  5:47   ` Baolin Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2025-05-23  5:47 UTC (permalink / raw)
  To: Donet Tom, akpm, david, shakeelb
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-fsdevel, linux-kernel



On 2025/5/23 13:25, Donet Tom wrote:
> 
> On 5/23/25 8:46 AM, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     
>> TIME+ COMMAND
>>   875525 root      20   0   12480      0      0 R   0.3   0.0   
>> 0:00.08 top
>>        1 root      20   0  172800      0      0 S   0.0   0.0   
>> 0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite 
>> large
>> on these machines, caching a significant percpu value, since 
>> converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert 
>> mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be 
>> optimized,
>> but on some paths, performance may take precedence over statistical 
>> accuracy.
>> Therefore, introducing a new interface to add the percpu statistical 
>> count
>> and display it to users, which can remove the confusion. In addition, 
>> this
>> change is not expected to be on a performance-critical path, so the 
>> modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   fs/proc/task_mmu.c | 14 +++++++-------
>>   include/linux/mm.h |  5 +++++
>>   2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index b9e4fbbdf6e6..f629e6526935 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>>       unsigned long text, lib, swap, anon, file, shmem;
>>       unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>> -    anon = get_mm_counter(mm, MM_ANONPAGES);
>> -    file = get_mm_counter(mm, MM_FILEPAGES);
>> -    shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>> +    anon = get_mm_counter_sum(mm, MM_ANONPAGES);
> 
> 
> Hi Baolin Wang,
> 
> We also observed the same issue where the RSS value in /proc/PID/status
> was 0 on machines with a high number of CPUs. With this patch, the issue
> got fixedl

Yes, we also observed this issue.

> Rss value without this patch
> ----------------------------
>   # cat /proc/87406/status
> .....
> VmRSS:           0 kB
> RssAnon:           0 kB
> RssFile:           0 k
> 
> 
> Rss values with this patch
> --------------------------
>   # cat /proc/3055/status
> VmRSS:        2176 kB
> RssAnon:         512 kB
> RssFile:        1664 kB
> RssShmem:           0 kB
> 
> Tested-by Donet Tom <donettom@linux.ibm.com>

Thanks for testing.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23  3:16 [RFC PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
  2025-05-23  5:25 ` Donet Tom
@ 2025-05-23 10:14 ` Aboorva Devarajan
  2025-05-24  1:24   ` Baolin Wang
  2025-05-30 10:23   ` Michal Hocko
  2025-05-23 14:11 ` Shakeel Butt
  2025-05-23 17:20 ` SeongJae Park
  3 siblings, 2 replies; 11+ messages in thread
From: Aboorva Devarajan @ 2025-05-23 10:14 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, shakeelb
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-fsdevel, linux-kernel, aboorvad

On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  fs/proc/task_mmu.c | 14 +++++++-------
>  include/linux/mm.h |  5 +++++
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f629e6526935 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	unsigned long text, lib, swap, anon, file, shmem;
>  	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>  
> -	anon = get_mm_counter(mm, MM_ANONPAGES);
> -	file = get_mm_counter(mm, MM_FILEPAGES);
> -	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> +	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
> +	file = get_mm_counter_sum(mm, MM_FILEPAGES);
> +	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>  
>  	/*
>  	 * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	text = min(text, mm->exec_vm << PAGE_SHIFT);
>  	lib = (mm->exec_vm << PAGE_SHIFT) - text;
>  
> -	swap = get_mm_counter(mm, MM_SWAPENTS);
> +	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>  	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>  	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>  	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>  			 unsigned long *shared, unsigned long *text,
>  			 unsigned long *data, unsigned long *resident)
>  {
> -	*shared = get_mm_counter(mm, MM_FILEPAGES) +
> -			get_mm_counter(mm, MM_SHMEMPAGES);
> +	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> +			get_mm_counter_sum(mm, MM_SHMEMPAGES);
>  	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>  								>> PAGE_SHIFT;
>  	*data = mm->data_vm + mm->stack_vm;
> -	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> +	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>  	return mm->total_vm;
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 185424858f23..15ec5cfe9515 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>  	return percpu_counter_read_positive(&mm->rss_stat[member]);
>  }
>  
> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> +{
> +	return percpu_counter_sum_positive(&mm->rss_stat[member]);
> +}
> +
>  void mm_trace_rss_stat(struct mm_struct *mm, int member);
>  
>  static inline void add_mm_counter(struct mm_struct *mm, int member, long value)

Hi Baolin,

This patch looks good to me. We observed a similar issue where the
generic mm selftest split_huge_page_test failed due to outdated RssAnon
values reported in /proc/[pid]/status.

...

Without Patch:

# ./split_huge_page_test 
TAP version 13
1..34
Bail out! No RssAnon is allocated before split
# Planned tests != run tests (34 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0

...

With Patch:

# ./split_huge_page_test
# ./split_huge_page_test 
TAP version 13
1..34
...
# Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0

...

While this change may introduce some lock contention, it only affects
the task_mem function which is invoked only when reading
/proc/[pid]/status. Since this is not on a performance critical path,
it will be good to have this change in order to get accurate memory
stats.

This fix resolves the issue we've seen with split_huge_page_test.

Thanks!


Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23  3:16 [RFC PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
  2025-05-23  5:25 ` Donet Tom
  2025-05-23 10:14 ` Aboorva Devarajan
@ 2025-05-23 14:11 ` Shakeel Butt
  2025-05-24  1:25   ` Baolin Wang
  2025-05-23 17:20 ` SeongJae Park
  3 siblings, 1 reply; 11+ messages in thread
From: Shakeel Butt @ 2025-05-23 14:11 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-fsdevel, linux-kernel

CC Mathieu

On Fri, May 23, 2025 at 11:16:13AM +0800, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Hi Baolin, this seems reasonale. For long term Mathieu is planning to
fix this with newer hierarchical percpu counter until then this looks
good.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23  3:16 [RFC PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
                   ` (2 preceding siblings ...)
  2025-05-23 14:11 ` Shakeel Butt
@ 2025-05-23 17:20 ` SeongJae Park
  2025-05-23 17:23   ` SeongJae Park
  3 siblings, 1 reply; 11+ messages in thread
From: SeongJae Park @ 2025-05-23 17:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: SeongJae Park, akpm, david, shakeelb, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> On some large machines with a high number of CPUs running a 64K kernel,

What does 64K kernel means?

> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23 17:20 ` SeongJae Park
@ 2025-05-23 17:23   ` SeongJae Park
  2025-05-24  1:29     ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: SeongJae Park @ 2025-05-23 17:23 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Baolin Wang, akpm, david, shakeelb, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, 23 May 2025 10:20:29 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
> > On some large machines with a high number of CPUs running a 64K kernel,
> 
> What does 64K kernel means?
> 
> > we found that the 'RES' field is always 0 displayed by the top command
> > for some processes, which will cause a lot of confusion for users.
> > 
> >     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
> >       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> > 
> > The main reason is that the batch size of the percpu counter is quite large
> > on these machines, caching a significant percpu value, since converting mm's
> > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > stats into percpu_counter").

Forgot asking this, sorry.  Should we add Fixes: tag and Cc stable@?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23 10:14 ` Aboorva Devarajan
@ 2025-05-24  1:24   ` Baolin Wang
  2025-05-30 10:23   ` Michal Hocko
  1 sibling, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2025-05-24  1:24 UTC (permalink / raw)
  To: Aboorva Devarajan, akpm, david, shakeel.butt
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-fsdevel, linux-kernel



On 2025/5/23 18:14, Aboorva Devarajan wrote:
> On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite large
>> on these machines, caching a significant percpu value, since converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>> but on some paths, performance may take precedence over statistical accuracy.
>> Therefore, introducing a new interface to add the percpu statistical count
>> and display it to users, which can remove the confusion. In addition, this
>> change is not expected to be on a performance-critical path, so the modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   fs/proc/task_mmu.c | 14 +++++++-------
>>   include/linux/mm.h |  5 +++++
>>   2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index b9e4fbbdf6e6..f629e6526935 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>>   	unsigned long text, lib, swap, anon, file, shmem;
>>   	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>>   
>> -	anon = get_mm_counter(mm, MM_ANONPAGES);
>> -	file = get_mm_counter(mm, MM_FILEPAGES);
>> -	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>> +	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
>> +	file = get_mm_counter_sum(mm, MM_FILEPAGES);
>> +	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>>   
>>   	/*
>>   	 * Note: to minimize their overhead, mm maintains hiwater_vm and
>> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>>   	text = min(text, mm->exec_vm << PAGE_SHIFT);
>>   	lib = (mm->exec_vm << PAGE_SHIFT) - text;
>>   
>> -	swap = get_mm_counter(mm, MM_SWAPENTS);
>> +	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>>   	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>>   	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>>   	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
>> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>>   			 unsigned long *shared, unsigned long *text,
>>   			 unsigned long *data, unsigned long *resident)
>>   {
>> -	*shared = get_mm_counter(mm, MM_FILEPAGES) +
>> -			get_mm_counter(mm, MM_SHMEMPAGES);
>> +	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
>> +			get_mm_counter_sum(mm, MM_SHMEMPAGES);
>>   	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>>   								>> PAGE_SHIFT;
>>   	*data = mm->data_vm + mm->stack_vm;
>> -	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
>> +	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>>   	return mm->total_vm;
>>   }
>>   
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 185424858f23..15ec5cfe9515 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>>   	return percpu_counter_read_positive(&mm->rss_stat[member]);
>>   }
>>   
>> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
>> +{
>> +	return percpu_counter_sum_positive(&mm->rss_stat[member]);
>> +}
>> +
>>   void mm_trace_rss_stat(struct mm_struct *mm, int member);
>>   
>>   static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
> 
> Hi Baolin,
> 
> This patch looks good to me. We observed a similar issue where the
> generic mm selftest split_huge_page_test failed due to outdated RssAnon
> values reported in /proc/[pid]/status.
> 
> ...
> 
> Without Patch:
> 
> # ./split_huge_page_test
> TAP version 13
> 1..34
> Bail out! No RssAnon is allocated before split
> # Planned tests != run tests (34 != 0)
> # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> ...
> 
> With Patch:
> 
> # ./split_huge_page_test
> # ./split_huge_page_test
> TAP version 13
> 1..34
> ...
> # Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0
> 
> ...
> 
> While this change may introduce some lock contention, it only affects
> the task_mem function which is invoked only when reading
> /proc/[pid]/status. Since this is not on a performance critical path,
> it will be good to have this change in order to get accurate memory
> stats.

Agree.

> 
> This fix resolves the issue we've seen with split_huge_page_test.
> 
> Thanks!
> 
> 
> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> 

Thanks for reviewing and testing.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23 14:11 ` Shakeel Butt
@ 2025-05-24  1:25   ` Baolin Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2025-05-24  1:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-fsdevel, linux-kernel



On 2025/5/23 22:11, Shakeel Butt wrote:
> CC Mathieu
> 
> On Fri, May 23, 2025 at 11:16:13AM +0800, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite large
>> on these machines, caching a significant percpu value, since converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>> but on some paths, performance may take precedence over statistical accuracy.
>> Therefore, introducing a new interface to add the percpu statistical count
>> and display it to users, which can remove the confusion. In addition, this
>> change is not expected to be on a performance-critical path, so the modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Hi Baolin, this seems reasonale. For long term Mathieu is planning to
> fix this with newer hierarchical percpu counter until then this looks
> good.

OK. Good.

> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23 17:23   ` SeongJae Park
@ 2025-05-24  1:29     ` Baolin Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2025-05-24  1:29 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, linux-mm, linux-fsdevel, linux-kernel, shakeel.butt



On 2025/5/24 01:23, SeongJae Park wrote:
> On Fri, 23 May 2025 10:20:29 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
>> On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>
>>> On some large machines with a high number of CPUs running a 64K kernel,
>>
>> What does 64K kernel means?

Sorry for not being clear. I mean a 64K pagesize kernel on Arm servers.

>>> we found that the 'RES' field is always 0 displayed by the top command
>>> for some processes, which will cause a lot of confusion for users.
>>>
>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>>
>>> The main reason is that the batch size of the percpu counter is quite large
>>> on these machines, caching a significant percpu value, since converting mm's
>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>> stats into percpu_counter").
> 
> Forgot asking this, sorry.  Should we add Fixes: tag and Cc stable@?

Yes, will add the Fixes tag in next version. Thanks for reviewing.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-23 10:14 ` Aboorva Devarajan
  2025-05-24  1:24   ` Baolin Wang
@ 2025-05-30 10:23   ` Michal Hocko
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2025-05-30 10:23 UTC (permalink / raw)
  To: Aboorva Devarajan
  Cc: Baolin Wang, akpm, david, shakeelb, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, linux-mm, linux-fsdevel, linux-kernel

On Fri 23-05-25 15:44:37, Aboorva Devarajan wrote:
> While this change may introduce some lock contention, it only affects
> the task_mem function which is invoked only when reading
> /proc/[pid]/status. Since this is not on a performance critical path,
> it will be good to have this change in order to get accurate memory
> stats.

This particular function might not be performance critical but you are
exposing a lock contention to the userspace that could be abused and
cause contention controlled by unprivileged user. I do not think we want
that without any control. Or is the pcp lock not really affecting any
actual kernel code path?

So while precision is nice it should be weight against potential side
effects. 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-05-30 10:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23  3:16 [RFC PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
2025-05-23  5:25 ` Donet Tom
2025-05-23  5:47   ` Baolin Wang
2025-05-23 10:14 ` Aboorva Devarajan
2025-05-24  1:24   ` Baolin Wang
2025-05-30 10:23   ` Michal Hocko
2025-05-23 14:11 ` Shakeel Butt
2025-05-24  1:25   ` Baolin Wang
2025-05-23 17:20 ` SeongJae Park
2025-05-23 17:23   ` SeongJae Park
2025-05-24  1:29     ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).